I have a challenge for you. I would like to find every link in a block of html and add a name/value pair to the end of the query string. For example: It might find a link <a href="www.google.com">google</a> I want it to change the link to <a href="www.google.com?SID=123">google</a> We need to keep in mind that some links might already have a '?' and we need to use '&' while others will not have the '?'.
In my first attempt to answer this question, I used some fairly simple Regular Expression replaces. It was, however, brought to my attention, that certain use cases were not considered. One was that the name-value pair we were adding to the URL might already be part of the URL. Another was that some URL might have hash signs (page anchors) and the previous method would inappropriately add the name-value pair after the given hash sign.
In this attempt, I have switched over from simple regular expression replaces to using a Java Pattern / Matcher. This will allows us to match a link and examine each one on an individual basis. This code is certainly more robust, but I think it is straight forward. In my experience, more robust, easy to undestand code is going to be more maintainable than a fairly complicated regular expression that accomplishes the same effect. Of course, that's just me and is definitely limited by my understanding of regular expression (they are still hard for me - Steve would probably know how to do this).
In the following code, notice that one of the links already has the name-value pair "source=bennadel.com". Also notice that two of the links have a hash sign. We can easily handle this by splitting the URL based on the hash sign and then treating the base URL as if it never had a hash sign at all.
Launch code in new window » Download code as text file »
Now, as Shuns raised on the previous example, this will also alter the URL of the HREF used in the LINK tag used to link external style sheets. This can worked around by actually grabbing the tag as part of the regular expression (or forcing the tag to be an anchor). Unforutnately, Shuns didn't raise this until I was practically done with this second attempt and I am far too lazy to go modify my code. As a trade off, however, modifying that URL is really not too much of a big deal. Afterall, it will still return the appropriate style sheet as we are just appending a query string value, not changing the base URL in any way.
This gives us the following output:
Launch code in new window » Download code as text file »
Notice that everything went quite swimmingly. We did not duplicate our name-value pair in the first URL. Nor did we add any name-value pairs in an inappropriate place.
Download Code Snippet ZIP File
Comments (2) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
"In my experience, more robust, easy to understand code is going to be more maintainable than a fairly complicated regular expression that accomplishes the same effect."
Agreed.
However, you also said this in your earlier post:
"Steve, if you are game and perhaps you can alter the first attempt (this one) and alter the regular expression to handle the other use cases - I demand satisfaction (throwing down the gauntlet)."
:-) Since I'm always down for a regex challenge, here's how you can do this with a single regex (let me know if I'm forgetting any of the cases that need to be accounted for or am otherwise messing something up):
<cfset content = reReplaceNoCase(content, '(< a\s[^>]*?href\s*=\s*"[^?##"]*)\??(?![^##"]*?\bsource=)', "\1?source=bennadel.com&", "all") />
(Remove the space between "<" and "a"... I added it to get around this blog's restricted HTML elements rule.)
That handles the following cases:
- Works with relative and absolute URLs, containing or not containing URL queries and/or fragments (page anchors).
- Does not modify URLs which already contain a "source" key in their query.
- Does not modify URLs contained within the href attributes of HTML elements other than anchors.
One issue is that it adds an unnecessary ampersand at the end of the URL query for URLs which did not already include a query. However, since this was more for the fun of solving the problem with a single regex than keeping URLs perfectly clean (as long as they still work identically), I can live with it. You could always run something like:
<cfset content = reReplace(content, '&(?="|##(?!x?[a-f\d]+;))', "", "all") />
...afterwards to pretty safely remove only the superfluous ampersands which were added within the HTML.
Posted by Steve on Apr 16, 2007 at 9:35 PM
@Steve,
Brilliant! Well played my friend, well played:
http://www.bennadel.com/index.cfm?dax=blog:642.view
Posted by Ben Nadel on Apr 17, 2007 at 8:07 AM