I have a challenge for you. I would like to find every link in a block of html and add a name/value pair to the end of the query string. For example: It might find a link <a href="www.google.com">google</a> I want it to change the link to <a href="www.google.com?SID=123">google</a> We need to keep in mind that some links might already have a '?' and we need to use '&' while others will not have the '?'.
This sounds like a job for Mr. Regular Expression! For this demo, I had to assume a few things:
- We are searching the ENTIRE page content and NOT just the content between two tags.
- The HREF attribute only exists within an anchor tag and will always denote a value that we want to modify.
These two assumptions make the solutions much more simple. If either of those is not true, then things get much more sticky. And, since I don't have anything else to go on, this is fine with me.
I am going to perform the URL alteration in two passes. The first pass will add the name-value pair to URLs with exsting query strings. The second pass will create a query string consisting of the name-value pair for URLs that do NOT have an existing query string. This order is very important. If we tried to do it in reverse order, creating new query string values, then the second pass would alter ALL the URLs, including the ones previously altered. By only appending query string values in the first run, we can be sure in the second run that we are only modifying URL values that have not yet been modified.
Also, for the purposes of this demo, I am going to alter the HTML before it even gets sent to the browser (using the existing page content buffer). Again, this is not the only way to do this. I am just doing what will be easiest for the demo as my problem domain has not been fully defined.
For this demo, I am adding the name-value pair "source=bennadel.com" to the query string. Let's start off with a simple HTML page that has a few links, some of which have a query string, some of which do not:
- <title>Alter URL Demo</title>
- Hey man, if you are looking for some good images, you should probably try out the search page on <a href="http://www.searchgalleries.com" target="_blank">Search Galleries</a>. It's pretty darn comprehensive and seems to keep track of all the free galleries that you will ever need. If you want to mess with the URL, its easy; just add a "q" query string value to the search url. The general site search URL is <a href="http://www.searchgalleries.com/search/" target="_blank">http://www.searchgalleries.com/search/</a>. So, then, to add a query value to it, such as "mature", you would simply add the query string "q=mature" to the url: <a href="http://www.searchgalleries.com/search/?q=mature" target="_blank">http://www.searchgalleries.com/search/?q=mature</a>. You can even search for more than one value at a given time. So, for instance, if you want to search for mature brunette women, you would put go to the URL:
- <a href="http://www.searchgalleries.com/search/?q=mature+brunette" target="_blank">http://www.searchgalleries.com/search/?q=mature+brunette</a>. Notice that "mature" and "brunette" are separated by a "+" sign. This is the URL encoded form of a space.
- <!--- Get the page context. --->
- <cfset objPageContext = GetPageContext() />
- <!--- Get the page buffer. --->
- <cfset objBuffer = objPageContext.GetOut().GetBuffer() />
- Get the content buffer string. This will give us everything
- that has NOT yet been flushed to the browser. This is just
- how I am doing it for this demo and is NOT the only way to
- perform this task. Since this page is small, (and is being
- tested), we can safely assume that the content has not yet
- been flushed to the client.
- <cfset strContent = objBuffer.ToString() />
- There are two cases that we need to consider when altering
- the HREF items in the content. We either have an existing
- query string (denoted by the existence of the "?" character)
- or we do not. If we do have one, we are simply appending our
- name-value pair. If we do not have one, we are creating a
- query string.
- We have to be careful. If we do the NEW query string first,
- then our "ADD" name-value pair might hit the same HREF
- twice. If, however, we only add pairs first, then our
- second go-round will not affected by our first round.
- I am going to assume that the HREF attribute ONLY exists
- within an anchor tag. I don't think this is a huge stretch,
- and will make our regular expression much more smiple.
- Notice that on our first regular expression, we are
- requiring the HREF value to contain at least one character
- and then a "?".
- <cfset strContent = strContent.ReplaceAll(
- ) />
- Now that we have added to existing query strings, we can do
- a second pass what will add a query string to HREFs that do
- not have a query string. Notice that in our second regular
- expression, we are requiring that our HREF value does NOT
- contain a question mark.
- <cfset strContent = strContent.ReplaceAll(
- ) />
- <!--- Clear the existing content buffer. --->
- <cfset objPageContext.GetOut().ClearBuffer() />
- <!--- Output the updated HTML. --->
- <cfset WriteOutput( strContent ) />
Notice that the ColdFusion code goes after the HTML has already been written to the content buffer. If we didn't do it in this fashion, then we wouldn't have any HTML to work with. Running the above code, our resultant source code of our HTML page is:
- <title>Alter URL Demo</title>
- Hey man, if you are looking for some good images, you should probably try out the search page on <a href="http://www.searchgalleries.com?source=bennadel.com" target="_blank">Search Galleries</a>. It's pretty darn comprehensive and seems to keep track of all the free galleries that you will ever need. If you want to mess with the URL, its easy; just add a "q" query string value to the search url. The general site search URL is <a href="http://www.searchgalleries.com/search/?source=bennadel.com" target="_blank">http://www.searchgalleries.com/search/</a>. So, then, to add a query value to it, such as "mature", you would simply add the query string "q=mature" to the url: <a href="http://www.searchgalleries.com/search/?q=mature&source=bennadel.com" target="_blank">http://www.searchgalleries.com/search/?q=mature</a>. You can even search for more than one value at a given time. So, for instance, if you want to search for mature brunette women, you would put go to the URL:
- <a href="http://www.searchgalleries.com/search/?q=mature+brunette&source=bennadel.com" target="_blank">http://www.searchgalleries.com/search/?q=mature+brunette</a>. Notice that "mature" and "brunette" are separated by a "+" sign. This is the URL encoded form of a space.
Notice that the name-value pair has been properly added to existing query strings and set as the query string to URLs that did not have an existing query string value. Also notice that it only modified the HREF attributes and did NOT alter the text of the link in any way even though the text contained similar URLs.
Now, we did this a post-page-processing technique, but there is no reason that this could not be wrapped up in a ColdFusion user defined function (UDF). Doing it this way, you could pass any kind of HTML you wanted to the function for altering. This would make it more flexible and could still be used in the fashion above.
Looking For A New Job?
While I can't say that I'm particularly interested in the code to add URL params to every link on the page, the output buffer reading is something I've been wishing I could do in ColdFusion for a while. Thanks Ben!
Yeah, don't worry about the URLs themselves. I just had to put down something for an example, and I try not to think too much very early in the morning. As far as reading from the output buffer, that should be something available since CFMX 6, but it is kind of buried under a bunch of method calls. I only found it through someone else's code samples.
Two outside cases not accounted for (although granted, they weren't mentioned in the requirements) are URLs which include a page anchor (e.g., "index.cfm?q#top") and URLs which already contain the query key you're adding (with any corresponding value).
One very minor critique about the above regexes is that, for efficiency, greedy repetition should be used where appropriate ("[^?]+\?" instead of ".+?\?").
Thanks for the regex tip. You can see that I started to use that mentality for the quotes [^"?]+. Sometimes, I am just not sure which is better. In generally, it seems it is always better to tell it what NOT to matcher rather than to tell it just to be non-greedy?
"In generally, it seems it is always better to tell it what NOT to matcher rather than to tell it just to be non-greedy?"
If it's possible to use such a pattern, then yes, typically. The reason ".+?\?" is slower is that the dot matches the question mark, so when using it with lazy repetition like that, the regex engine matches just one character at a time and backtracks after each step. If you use "[^?]+\?", the regex engine greedily matches everything up to a question mark in one step, and since it's then followed by a question mark (which wasn't allowed in the previous part), no backtracking is required. So less steps, and less backtracking.
The impact of this change would probably be tiny, but lazy repetition in the wrong hands is a performance SNAFU or even server crash waiting to happen (this sort of thing is possible without lazy repetition, but such quantifiers make it easier to trigger catastrophic backtracking). These kinds of problems are usually easy to fix (e.g., throw an atomic group [or a faked one... see http://badassery.blogspot.com/2007/04/faking-atomic-groups.html ] around problematic segments), but generally I try to write regexes to be as efficient as possible (within reason, since optimizations which take advantage of language-specific engine implementations can make a simple regex much less readable, which I don't generally agree with).
One other minor point you seem to have missed is that href is also an attribute that link tags use mainly for including CSS files which would almoast certainly be included in the page and should really be modified ;)
That makes a lot of sense. I think I should just break down and get RegEx Buddy cause I think you said that walks you through the steps, right? I am still not wrapping my head properly around atomic groups and/or possessive quantifiers... just gotta rock some examples to drive it home. I saw your post a while back and it looks very cool.
Damn you! ;) I have addressed your issue (but not dealt with it).
Here is a different approach that covers more of the use cases:
This uses a Java patterns / matcher to loop over the matched URLs and examine them each at time. This is easier to me than making a more complicated regular expression. However, Steve, if you are game and perhaps you can alter the first attempt (this one) and alter the regular expression to handle the other use cases - I demand satisfaction (throwing down the gauntlet).
Regarding RegexBuddy, yeah, it can show you all the steps its engine takes to match or fail a pattern using its so-called debugger (I've learned a lot by studying the results from this). I'm a big RegexBuddy evangelist (I love its debugger, real-time matching, and wide regex feature support, and there are plenty of other useful features), but for those interested in a free yet still very solid regex assistant app I'd recommend Expresso ( http://www.ultrapico.com/Expresso.htm ), which uses the .NET library's regex engine.
By the way, just to be clear about what I meant earlier regarding taking advantage of language-specific engine implementations, I meant exploiting optimizations built into certain regex engines, not engine-specific features. I'll readily make use of features available, but I tend to avoid some more specific things (e.g., with some engines "aa*" is faster than "a+", and "(?=c)cat|cap" is faster than "cap|cat").
I'll take a look at your other thread. I'm always down for a regex challenge. :-)
Wow, this has saved us a day's development time and works like a dream - thank you so much.
Glad to help my man :)