I can't remember where it was exactly (maybe Twitter), but the other day, someone asked me a question about replacing a string that was not contained within another string. It was something like, "I want to replace all apostrophes in a string. But, I don't want to do that if they are inside HTML comments." While this might seem like a simple question to understand, it's actually a fairly complicated task to accomplish - at least for me. There might be a way to solve this with a single regular expression (RegEx) pattern used within in a simple find-and-replace; but that kind of RegEx badassery is beyond me.
As such, I tend to approach these types of problems with more of a brute-force attitude. Rather that coming up with a regular expression pattern that is very clever, I go the opposite direction and actually dumb my pattern down to match more values. Using a "shoot first and ask questions later" style strategy, I match both the limiting factor - the HTML comment - and the target text as part of my pattern. Then, only once a pattern has been matched, I ask the regular expression engine which value was matched.
When using this approach, it's important to given the limiting factor pattern - the HTML comment - higher precedence in the regular expression. This way, we'll always be trying to match HTML comments first and never accidentally match a target pattern contained within an HTML comment. To demonstrate this, I am going to be searching for the target pattern, "here", contained within some random HTML markup:
<!--- Store some content to replace. ---> <cfsavecontent variable="content"> Here is some content over here. <!-- And here is an HTML comment over here. --> And here is some more content over here. </cfsavecontent> <!--- ----------------------------------------------------- ---> <!--- ----------------------------------------------------- ---> <!--- ----------------------------------------------------- ---> <!--- Set up the regular expression pattern. Notice that we are using both teh Verbose and Case-Insensitive flags. ---> <cfsavecontent variable="patternText">(?xi) ## I am going to match the comment pattern first since this ## pattern is really our limiting factor (ie. if we match ## this one, it negates any subsequent match within it). ## NOTE: This is our first captured group. ( [<]!--[\w\W]*?--[>] ) ## - OR - | ## Now, we're going to match the pattern that we actually ## want to replace out of the given string. ## NOTE: This is our second captured group. ( here ) </cfsavecontent> <!--- Using our pattern text, let's compile a pattern. ---> <cfset pattern = createObject( "java", "java.util.regex.Pattern" ) .compile( javaCast( "string", patternText ) ) /> <!--- Create a matcher for our pattern as applied to our target text that we are wanting to alter. ---> <cfset matcher = pattern.matcher( javaCast( "string", content ) ) /> <!--- Create a string buffer to hold our result. ---> <cfset result = createObject( "java", "java.lang.StringBuffer" ).init() /> <!--- Keep looping over matches while the matcher can find more in the target string. ---> <cfloop condition="matcher.find()"> <!--- Because we are searching for two patterns here within captured groups, we can check for the existence of captured groups to determine which pattern was matched. ---> <!--- The comment was the first captured group. ---> <cfset commentMatch = matcher.group( javaCast( "int", 1 ) ) /> <!--- The target pattern was our second captured group. ---> <cfset targetMatch = matcher.group( javaCast( "int", 2 ) ) /> <!--- Check to see which pattern exists - the group() method will return NULL if it did not match, which will delete the given ColdFusion variable. ---> <cfif structKeyExists( variables, "commentMatch" )> <!--- The comment was matched - simply add it to the results without any modification. ---> <cfset matcher.appendReplacement( result, matcher.quoteReplacement( commentMatch ) ) /> <cfelse> <!--- The target pattern was what was matched in this case. As such, this time, we want to replace it. ---> <cfset matcher.appendReplacement( result, javaCast( "string", "****" ) ) /> </cfif> </cfloop> <!--- Now that we're out of the matching, let's append whatever remains of the content (after the last match) to our results buffer. ---> <cfset matcher.appendTail( result ) /> <!--- Output results. ---> <pre>#htmlEditFormat( result.toString() )#</pre>
As you can see, my sample markup contains the phrase, "here," several times, both inside and outside of the HTML comment. The verbose (?x) regular expression pattern that I am providing matches both HTML comments as well as our target text (NOTE: This pattern does not allow for nested comments). Then, within the regular expression matcher, I check to see which pattern was matched in the given iteration. If the HTML comment was matched, there's nothing to be done and I simply append it to the result buffer; if the target pattern was matched, however, I add its replacement to the result buffer. When we run the above code, we get the following output:
**** is some content over ****. <!-- And here is an HTML comment over here. --> And **** is some more content over ****.
As you can see, all instances of, "here," found outside of the HTML comment were replaced with, "****"; the two instances of, "here," contained within the HTML comment, however, were ignored.
Regular expressions are exteremly powerful; but, sometimes, they can be overly complicated. In situations like this where a problem might be solved with a very complicated pattern, I tend to prefer simpler regular expression patterns with more algorithmic steps. This affords me a little more sanity and, I believe, makes the code much more readable.
Want to use code from this post? Check out the license.
Why do complex issues always turn out to be so simple
Amazing demo Ben!
"Simple" is a relative term - there's a lot going on here :) I'm glad you like the demo.
I suppose it's possible you could encounter speed issues with larger source files, but I'd still prefer a simpler expression that is easier for other people to follow. (I've used some fairly complex expressions before, created with the aid of some nifty tools, but then you end up having to break down the entire expression for someone else anyway.) If performance becomes an issue in a specific case, then you could explore the use of more complex regexes to do it more quickly.
In the right circumstances, you might also be able to use XSL or possibly an XML parser to do this, but you'd have to be sure you had well-formed HTML in every case, otherwise it would be considerably more trouble than it's worth to use that approach.
The performance is an interesting discussion. The regular expression has to run through the entire content regardless, so there's that. Of course, there is going to be a cost overhead to having to match the comment; and, I wonder if the lazy nature of the match adds to that?
In the end, I agree 100% with what you're saying - performance at this scale is a secondary issue to easy of use / maintainability.
I believe it was on a comment in one of your blogposts or on someone else's blog comments that you had that question ( I was looking at that same question recently and ran into your and that of others comments).
As I just said I was looking for something similar. I believe it had to do with replacing url's in comments (also a post of you).
I tried two weeks ago to replace certain & signs (by &) and certain should stay. The only way I figured was a double Replace() as you will replace & with & which would cause an error, so then I replaced & again by & afterwards. Perhaps not the best solution, but essential to get it solved if you are playing with url's in XML.
It's funny you bring that up because I was just talking to my co-worker yesterday about the use of & in XHTML links (for validation purposes). Apparently, in XHTML standards, you have to escape your link-based & values... which is lame, in my opinion; but, if it has to be XML-compatible, I suppose you gotta do what you gotta do.
I think you need to be a little careful when you are using regexps to parse HTML content like this. As long as you are responsible for the HTML you are cleaning up, then setting up the regexps is safe; if the content that you are parsing is from an untrusted source, then maybe you should be thinking about using an HTML parser. Using regexps can sometimes mean you end up with fragile code.
I've found that not searching for a string in regex is kinda tough. A while back I had a large table with about 50 elements per database row returned in a cfoutput query. I was modifying the code and wanted to ditch the table so I figured I'd use a regular expression to search and replace everything that was not a cfoutput variable (ie between pound signs).
After playing with the regex engine in my text editor for five times longer than it would have taken to hand cut out all of the variables I came up with this:
That's pretty close to working but not quite there yet.
This post reminded me of that experience so I thought I'd share.
Yes - definitely very true. Browsers are so forgiving that it's a bad idea to every rely on "valid" XHTML (or HTML). You can start to get your regex patterns really complex if not are trying to accommodate too much variety.
Search for "not" strings is definitely another one of those things that is much more complicated than it sounds.