Ask Ben: Replacing A String That Is Not Inside Of Another String

Posted February 25, 2010 at 9:42 AM by Ben Nadel

Tags: ColdFusion, Ask Ben

I can't remember where it was exactly (maybe Twitter), but the other day, someone asked me a question about replacing a string that was not contained within another string. It was something like, "I want to replace all apostrophes in a string. But, I don't want to do that if they are inside HTML comments." While this might seem like a simple question to understand, it's actually a fairly complicated task to accomplish - at least for me. There might be a way to solve this with a single regular expression (RegEx) pattern used within in a simple find-and-replace; but that kind of RegEx badassery is beyond me.

As such, I tend to approach these types of problems with more of a brute-force attitude. Rather that coming up with a regular expression pattern that is very clever, I go the opposite direction and actually dumb my pattern down to match more values. Using a "shoot first and ask questions later" style strategy, I match both the limiting factor - the HTML comment - and the target text as part of my pattern. Then, only once a pattern has been matched, I ask the regular expression engine which value was matched.

When using this approach, it's important to given the limiting factor pattern - the HTML comment - higher precedence in the regular expression. This way, we'll always be trying to match HTML comments first and never accidentally match a target pattern contained within an HTML comment. To demonstrate this, I am going to be searching for the target pattern, "here", contained within some random HTML markup:

  • <!--- Store some content to replace. --->
  • <cfsavecontent variable="content">
  •  
  • Here is some content over here.
  •  
  • <!-- And here is an HTML comment over here. -->
  •  
  • And here is some more content over here.
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <!---
  • Set up the regular expression pattern. Notice that we
  • are using both teh Verbose and Case-Insensitive flags.
  • --->
  • <cfsavecontent variable="patternText">(?xi)
  •  
  • ## I am going to match the comment pattern first since this
  • ## pattern is really our limiting factor (ie. if we match
  • ## this one, it negates any subsequent match within it).
  • ## NOTE: This is our first captured group.
  •  
  • (
  • [<]!--[\w\W]*?--[>]
  • )
  •  
  • ## - OR -
  •  
  • |
  •  
  • ## Now, we're going to match the pattern that we actually
  • ## want to replace out of the given string.
  • ## NOTE: This is our second captured group.
  •  
  • (
  • here
  • )
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- Using our pattern text, let's compile a pattern. --->
  • <cfset pattern = createObject( "java", "java.util.regex.Pattern" )
  • .compile( javaCast( "string", patternText ) )
  • />
  •  
  • <!---
  • Create a matcher for our pattern as applied to our target
  • text that we are wanting to alter.
  • --->
  • <cfset matcher = pattern.matcher(
  • javaCast( "string", content )
  • ) />
  •  
  • <!--- Create a string buffer to hold our result. --->
  • <cfset result = createObject( "java", "java.lang.StringBuffer" ).init() />
  •  
  •  
  • <!---
  • Keep looping over matches while the matcher can find more
  • in the target string.
  • --->
  • <cfloop condition="matcher.find()">
  •  
  • <!---
  • Because we are searching for two patterns here within
  • captured groups, we can check for the existence of
  • captured groups to determine which pattern was matched.
  • --->
  •  
  • <!--- The comment was the first captured group. --->
  • <cfset commentMatch = matcher.group( javaCast( "int", 1 ) ) />
  •  
  • <!--- The target pattern was our second captured group. --->
  • <cfset targetMatch = matcher.group( javaCast( "int", 2 ) ) />
  •  
  •  
  • <!---
  • Check to see which pattern exists - the group() method
  • will return NULL if it did not match, which will delete
  • the given ColdFusion variable.
  • --->
  • <cfif structKeyExists( variables, "commentMatch" )>
  •  
  • <!---
  • The comment was matched - simply add it to the results
  • without any modification.
  • --->
  • <cfset matcher.appendReplacement(
  • result,
  • matcher.quoteReplacement( commentMatch )
  • ) />
  •  
  • <cfelse>
  •  
  • <!---
  • The target pattern was what was matched in this case.
  • As such, this time, we want to replace it.
  • --->
  • <cfset matcher.appendReplacement(
  • result,
  • javaCast( "string", "****" )
  • ) />
  •  
  • </cfif>
  •  
  • </cfloop>
  •  
  • <!---
  • Now that we're out of the matching, let's append whatever
  • remains of the content (after the last match) to our
  • results buffer.
  • --->
  • <cfset matcher.appendTail( result ) />
  •  
  •  
  • <!--- Output results. --->
  • <pre>#htmlEditFormat( result.toString() )#</pre>

As you can see, my sample markup contains the phrase, "here," several times, both inside and outside of the HTML comment. The verbose (?x) regular expression pattern that I am providing matches both HTML comments as well as our target text (NOTE: This pattern does not allow for nested comments). Then, within the regular expression matcher, I check to see which pattern was matched in the given iteration. If the HTML comment was matched, there's nothing to be done and I simply append it to the result buffer; if the target pattern was matched, however, I add its replacement to the result buffer. When we run the above code, we get the following output:

  • **** is some content over ****.
  •  
  • <!-- And here is an HTML comment over here. -->
  •  
  • And **** is some more content over ****.

As you can see, all instances of, "here," found outside of the HTML comment were replaced with, "****"; the two instances of, "here," contained within the HTML comment, however, were ignored.

Regular expressions are exteremly powerful; but, sometimes, they can be overly complicated. In situations like this where a problem might be solved with a very complicated pattern, I tend to prefer simpler regular expression patterns with more algorithmic steps. This affords me a little more sanity and, I believe, makes the code much more readable.




Reader Comments

Feb 25, 2010 at 11:17 AM // reply »
1 Comments

Why do complex issues always turn out to be so simple

Amazing demo Ben!


Feb 25, 2010 at 11:36 AM // reply »
11,246 Comments

@Joost,

"Simple" is a relative term - there's a lot going on here :) I'm glad you like the demo.


Feb 25, 2010 at 12:17 PM // reply »
35 Comments

I suppose it's possible you could encounter speed issues with larger source files, but I'd still prefer a simpler expression that is easier for other people to follow. (I've used some fairly complex expressions before, created with the aid of some nifty tools, but then you end up having to break down the entire expression for someone else anyway.) If performance becomes an issue in a specific case, then you could explore the use of more complex regexes to do it more quickly.

In the right circumstances, you might also be able to use XSL or possibly an XML parser to do this, but you'd have to be sure you had well-formed HTML in every case, otherwise it would be considerably more trouble than it's worth to use that approach.


Feb 25, 2010 at 12:42 PM // reply »
11,246 Comments

@Dave,

The performance is an interesting discussion. The regular expression has to run through the entire content regardless, so there's that. Of course, there is going to be a cost overhead to having to match the comment; and, I wonder if the lazy nature of the match adds to that?

In the end, I agree 100% with what you're saying - performance at this scale is a secondary issue to easy of use / maintainability.


Feb 27, 2010 at 4:07 AM // reply »
4 Comments

I believe it was on a comment in one of your blogposts or on someone else's blog comments that you had that question ( I was looking at that same question recently and ran into your and that of others comments).
As I just said I was looking for something similar. I believe it had to do with replacing url's in comments (also a post of you).

I tried two weeks ago to replace certain & signs (by &) and certain should stay. The only way I figured was a double Replace() as you will replace & with &amp; which would cause an error, so then I replaced &amp; again by & afterwards. Perhaps not the best solution, but essential to get it solved if you are playing with url's in XML.


Feb 27, 2010 at 11:28 AM // reply »
11,246 Comments

@Steven,

It's funny you bring that up because I was just talking to my co-worker yesterday about the use of & in XHTML links (for validation purposes). Apparently, in XHTML standards, you have to escape your link-based & values... which is lame, in my opinion; but, if it has to be XML-compatible, I suppose you gotta do what you gotta do.


Feb 28, 2010 at 6:00 AM // reply »
1 Comments

I think you need to be a little careful when you are using regexps to parse HTML content like this. As long as you are responsible for the HTML you are cleaning up, then setting up the regexps is safe; if the content that you are parsing is from an untrusted source, then maybe you should be thinking about using an HTML parser. Using regexps can sometimes mean you end up with fragile code.


Mar 3, 2010 at 11:34 PM // reply »
11 Comments

I've found that not searching for a string in regex is kinda tough. A while back I had a large table with about 50 elements per database row returned in a cfoutput query. I was modifying the code and wanted to ditch the table so I figured I'd use a regular expression to search and replace everything that was not a cfoutput variable (ie between pound signs).

After playing with the regex engine in my text editor for five times longer than it would have taken to hand cut out all of the variables I came up with this:

[^#]+(?=[ ]#)

That's pretty close to working but not quite there yet.

This post reminded me of that experience so I thought I'd share.


Mar 8, 2010 at 7:16 PM // reply »
11,246 Comments

@James,

Yes - definitely very true. Browsers are so forgiving that it's a bad idea to every rely on "valid" XHTML (or HTML). You can start to get your regex patterns really complex if not are trying to accommodate too much variety.

@Christopher,

Search for "not" strings is definitely another one of those things that is much more complicated than it sounds.


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 23, 2013 at 9:52 PM
Preventing Links In Standalone iPhone Applications From Opening In Mobile Safari
@Muhmmadibn Did you figure out a solution to launching PDFs? I am running into the same issues myself. There is no way to close the PDF or go back once you launch it. Thanks in advance! ... read »
May 23, 2013 at 6:06 PM
The Girl Who Broke My Heart, And Made Me A Better Person
Good day,ladies and gentle men, my name is Dr AMADI the great spell caster in Africa, i have help so many people for different kind of problems,who say there is no solution to problems on earth, that ... read »
May 23, 2013 at 4:26 PM
ColdFusion QueryAppend( qOne, qTwo )
@Heather, Glad people are still getting value out of this! ... read »
May 23, 2013 at 3:49 PM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@WebManWalking, I meant the code at the bottom (not the video). I did try to experiment with an intermediary variable, like: value = users.id[ i ]; arrayContains( userIDs, value ); ... but t ... read »
May 23, 2013 at 11:06 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben, Are you talking about As Number: YES As String: YES As Java: YES? If so, that's with 3 different ways of referencing the constant 1, not users.id[1]. Query object references(*) are what seem ... read »
May 23, 2013 at 9:55 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Dan, According to the CF Admin, I'm running Java "1.6.0_45". As far as the DB column, in the database it's an INT. I'll see if I can dig into what CF sees it as. @WebManWalking, But h ... read »
May 23, 2013 at 9:49 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben, I think the problem is that we're used to loose typing in ColdFusion, like JavaScript. If a value is a number but it's needed in an expression to be a string, noooo problem. I've encountered ... read »
May 23, 2013 at 9:47 AM
ColdFusion QueryAppend( qOne, qTwo )
You rock! Thank you, thank you, thank you!!! ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools