I can't remember where it was exactly (maybe Twitter), but the other day, someone asked me a question about replacing a string that was not contained within another string. It was something like, "I want to replace all apostrophes in a string. But, I don't want to do that if they are inside HTML comments." While this might seem like a simple question to understand, it's actually a fairly complicated task to accomplish - at least for me. There might be a way to solve this with a single regular expression (RegEx) pattern used within in a simple find-and-replace; but that kind of RegEx badassery is beyond me.
As such, I tend to approach these types of problems with more of a brute-force attitude. Rather that coming up with a regular expression pattern that is very clever, I go the opposite direction and actually dumb my pattern down to match more values. Using a "shoot first and ask questions later" style strategy, I match both the limiting factor - the HTML comment - and the target text as part of my pattern. Then, only once a pattern has been matched, I ask the regular expression engine which value was matched.
When using this approach, it's important to given the limiting factor pattern - the HTML comment - higher precedence in the regular expression. This way, we'll always be trying to match HTML comments first and never accidentally match a target pattern contained within an HTML comment. To demonstrate this, I am going to be searching for the target pattern, "here", contained within some random HTML markup:
<!--- Store some content to replace. ---> <cfsavecontent variable="content"> Here is some content over here. <!-- And here is an HTML comment over here. --> And here is some more content over here. </cfsavecontent> <!--- ----------------------------------------------------- ---> <!--- ----------------------------------------------------- ---> <!--- ----------------------------------------------------- ---> <!--- Set up the regular expression pattern. Notice that we are using both teh Verbose and Case-Insensitive flags. ---> <cfsavecontent variable="patternText">(?xi) ## I am going to match the comment pattern first since this ## pattern is really our limiting factor (ie. if we match ## this one, it negates any subsequent match within it). ## NOTE: This is our first captured group. ( [<]!--[\w\W]*?--[>] ) ## - OR - | ## Now, we're going to match the pattern that we actually ## want to replace out of the given string. ## NOTE: This is our second captured group. ( here ) </cfsavecontent> <!--- Using our pattern text, let's compile a pattern. ---> <cfset pattern = createObject( "java", "java.util.regex.Pattern" ) .compile( javaCast( "string", patternText ) ) /> <!--- Create a matcher for our pattern as applied to our target text that we are wanting to alter. ---> <cfset matcher = pattern.matcher( javaCast( "string", content ) ) /> <!--- Create a string buffer to hold our result. ---> <cfset result = createObject( "java", "java.lang.StringBuffer" ).init() /> <!--- Keep looping over matches while the matcher can find more in the target string. ---> <cfloop condition="matcher.find()"> <!--- Because we are searching for two patterns here within captured groups, we can check for the existence of captured groups to determine which pattern was matched. ---> <!--- The comment was the first captured group. ---> <cfset commentMatch = matcher.group( javaCast( "int", 1 ) ) /> <!--- The target pattern was our second captured group. ---> <cfset targetMatch = matcher.group( javaCast( "int", 2 ) ) /> <!--- Check to see which pattern exists - the group() method will return NULL if it did not match, which will delete the given ColdFusion variable. ---> <cfif structKeyExists( variables, "commentMatch" )> <!--- The comment was matched - simply add it to the results without any modification. ---> <cfset matcher.appendReplacement( result, matcher.quoteReplacement( commentMatch ) ) /> <cfelse> <!--- The target pattern was what was matched in this case. As such, this time, we want to replace it. ---> <cfset matcher.appendReplacement( result, javaCast( "string", "****" ) ) /> </cfif> </cfloop> <!--- Now that we're out of the matching, let's append whatever remains of the content (after the last match) to our results buffer. ---> <cfset matcher.appendTail( result ) /> <!--- Output results. ---> <pre>#htmlEditFormat( result.toString() )#</pre>
As you can see, my sample markup contains the phrase, "here," several times, both inside and outside of the HTML comment. The verbose (?x) regular expression pattern that I am providing matches both HTML comments as well as our target text (NOTE: This pattern does not allow for nested comments). Then, within the regular expression matcher, I check to see which pattern was matched in the given iteration. If the HTML comment was matched, there's nothing to be done and I simply append it to the result buffer; if the target pattern was matched, however, I add its replacement to the result buffer. When we run the above code, we get the following output:
**** is some content over ****. <!-- And here is an HTML comment over here. --> And **** is some more content over ****.
As you can see, all instances of, "here," found outside of the HTML comment were replaced with, "****"; the two instances of, "here," contained within the HTML comment, however, were ignored.
Regular expressions are exteremly powerful; but, sometimes, they can be overly complicated. In situations like this where a problem might be solved with a very complicated pattern, I tend to prefer simpler regular expression patterns with more algorithmic steps. This affords me a little more sanity and, I believe, makes the code much more readable.
Want to use code from this post? Check out the license.