Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at Scotch On The Rocks (SOTR) 2011 (Edinburgh) with:

reMultiMatch() - Extracting Iterative Regular Expression Patterns In ColdFusion

By Ben Nadel on
Tags: ColdFusion

Regular Expressions are simply amazing; they are one of the sexiest aspects of computer science. However, they can get very complicated very fast. As the complexity of the regular expression increases, so does its power. A complex regular expression might be able to, in a single pass, do the work of three smaller regular expressions applied in succession. The problem with this, however, is that the level of complexity of a regular expression increases far faster than its level of power; a pattern that is 3-times more powerful might very well feel 10-times more complex.

 
 
 
 
 
 
 
 
 
 

In order to find a sweet spot between the ease of simple regular expressions and the power of complex ones, I thought I would try to create a ColdFusion user defined function that allowed multiple patterns to be applied in succession to a target string. The idea here was that we could present a "single pass" approach to the programmer that could be defined using several, smaller regular expressions. What I came up with was reMultiMatch(). Just as with ColdFusion's reMatch() function, the reMultiMatch() function is designed to extract an array of pattern matches contained within a given string. The difference, of course, being that reMultiMatch() allows more than one regular expression pattern to be passed-in:

reMultiMatch( pattern, [pattern,]* string )

Before we dive into how the function works, it might be more helpful to see how it can be used. In the following demo, we have a snippet of HTML that contains several IMG tags. From this HTML, we want to extract the SRC values of each image; but, we only want to do that if that image also contains the CSS class, "saucy." This kind of extraction could be performed using a wicked complex regular expression; however, as you'll see below, reMultiMatch() allows us to extract that matches using four much (relatively) simpler patterns:

  • <!--- Create a piece of demo text (some mock HTML). --->
  • <cfsavecontent variable="demoText">
  •  
  • <h2>
  • Images
  • </h2>
  •  
  • <ul>
  • <li>
  • <img src="very-sexy-girl.jpg" class="saucy" />
  • </li>
  • <li>
  • <img id="banner" src="coldfusion-ad.jpg" class="saucy" />
  • </li>
  • <li>
  • <img id="footerBanner" src="fbanner.png" class="footer" />
  • </li>
  • </ul>
  •  
  • </cfsavecontent>
  •  
  • <!---
  • Extract all of the IMG SRC values from our demo text, but
  • only if the IMG has the class of "saucy". To do this, we are
  • going to use a multi-pass regular expression match that matches
  • the following patterns:
  •  
  • 1. All IMG tags.
  • 2. IMG tags that have the Sauce class.
  • 3. src="value" pairs.
  • 4. The quoted SRC value.
  •  
  • NOTE: Since we are using the Java regular expression engine
  • internal to the function, we are able to make use of powerful
  • features like positive look-behinds.
  • --->
  • <cfset srcValues = reMultiMatch(
  • "<img[^>]+>",
  • "(?=.+?class\s*=\s*""saucy"").+",
  • "src\s*=\s*""[^""]+""",
  • "(?<="").+(?="")",
  • demoText
  • ) />
  •  
  • <!--- Output the list of IMG SRC values. --->
  • <cfdump
  • var="#srcValues#"
  • label="IMG SRC Values"
  • />

When we run the above code, we get the following page output:

 
 
 
 
 
 
reMultiMatch() Allows Multiple Small Regular Expressions To Be Applied To A Given String. 
 
 
 

As you can see, we have successfully extracted the SRC values of the two IMG tags that contained the "saucy" class attribute. In order to do this, we applied the following four regular expressions in succession:

"<img[^>]+>"
This extracted all of the individual IMG tags.

"(?=.+?class\s*=\s*""saucy"").+"
This used a positive look-ahead to make sure that collected IMG tags had the class="saucy" name-value pair.

"src\s*=\s*""[^""]+"""
This extracted the SRC name-value attribute from the given IMG tag.

"(?<="").+(?="")"
This used a positive look-ahead and look-behind to extract the quoted value from the SRC name-value pair.

If any of these regular expressions looks complex on its own, just image how insanely complex it would be to try and merge these four expressions into a single pattern.

Now that you see how reMultiMatch() might be used, let's take a look at how this ColdFusion user defined function is actually built. Underneath the hood, it compiles each regular expression down into an instance of the Java Pattern class. This gives us a more robust regular expression feature set as well as a quicker execution than you'd find in the standard reMatch() function.

  • <cffunction
  • name="reMultiMatch"
  • access="public"
  • returntype="array"
  • output="false"
  • hint="I return array of regular expression matches defined by the first N-1 patterns applied in sequence to the given string.">
  •  
  • <!--- Define arguments. --->
  • <!---
  • The first N-1 arguments will be regular expressions. The
  • last argument will be the target tring to which the regular
  • expressions will be applied.
  • --->
  •  
  • <!--- Define the local scope. --->
  • <cfset var local = {} />
  •  
  • <!---
  • Check to make sure at least two arguments were passed into
  • the function. If not, we don't have at least one regular
  • epxression pattern to apply.
  • --->
  • <cfif (arrayLen( arguments ) lt 2)>
  •  
  • <!--- Invalid argument list. --->
  • <cfthrow
  • type="InvalidArguments"
  • message="This function expects at least 2 arguments."
  • detail="This function expects (N GT 1) regular expressions followed by the target string to which the regular expressions should be applied."
  • />
  •  
  • </cfif>
  •  
  • <!---
  • Create a Pattern class instance that we can use to compile
  • our regular expression patterns.
  •  
  • NOTE: We will be using the Java regular expression engine,
  • not the POSIX engine; as such, we will have a much more
  • robust set of regex capabilities (and speed).
  • --->
  • <cfset local.patternClass = createObject( "java", "java.util.regex.Pattern" ) />
  •  
  • <!---
  • Let's compile our regular expression pattners down to
  • Pattern objects so that we can get matcher objects.
  • --->
  • <cfset local.patterns = [] />
  •  
  • <!---
  • The first N-1 arguments are the patterns - let's loop
  • over each and append them to the array. Then, we can just
  • apply the array of patterns in sequence.
  • --->
  • <cfloop
  • index="local.argumentIndex"
  • from="1"
  • to="#(arrayLen( arguments ) - 1)#"
  • step="1">
  •  
  • <!--- Append the compiled pattern. --->
  • <cfset arrayAppend(
  • local.patterns,
  • local.patternClass.compile(
  • javaCast( "string", arguments[ local.argumentIndex ] )
  • )
  • ) />
  •  
  • </cfloop>
  •  
  • <!---
  • Now that we have compiled our patterns, let's get
  • handle on our target string. This should be the last
  • argument passed-in.
  •  
  • NOTE: Because we are coming out of an arguments loop, we konw
  • that the current argument index will be pointint to the last
  • argument index possible.
  • --->
  • <cfset local.targetString = arguments[ local.argumentIndex ] />
  •  
  • <!---
  • At this point, we need to start applying the pattern matching
  • to the target string. Since this is going to use an iterative
  • approach, we need to create an intermediary result set over
  • which the patterns will be applied.
  •  
  • NOTE: To allow the pattern application iteration to be
  • applied in a more uniform manner, we are going to start the
  • intermediary result set with the target string (as if the
  • previous round matched it).
  • --->
  • <cfset local.currentMatches = [ local.targetString ] />
  •  
  • <!---
  • As we iterate, we need want to override the current results
  • set. As such, we need to create an intermediary set for the
  • next iteration.
  • --->
  • <cfset local.nextMatches = [] />
  •  
  • <!--- Loop over each pattern. --->
  • <cfloop
  • index="local.pattern"
  • array="#local.patterns#">
  •  
  • <!---
  • For each pattern, we want to find matches in the entire
  • set of current results. As such, we want to loop over the
  • current results and apply a matcher individually.
  • --->
  • <cfloop
  • index="local.match"
  • array="#local.currentMatches#">
  •  
  • <!---
  • Apply the current pattern to this match and get a
  • matcher which we can use to extract the matches.
  • --->
  • <cfset local.matcher = local.pattern.matcher(
  • javaCast( "string", local.match )
  • ) />
  •  
  • <!---
  • Loop over each match and add it to the set of next
  • matches (used in the next iteration).
  • --->
  • <cfloop condition="local.matcher.find()">
  •  
  • <!--- Append to the set of next matches. --->
  • <cfset arrayAppend(
  • local.nextMatches,
  • local.matcher.group()
  • ) />
  •  
  • </cfloop>
  •  
  • </cfloop>
  •  
  • <!---
  • Now that we have applied this pattern to each match in
  • the previously available match set, we need to swap the
  • matches and allow the next iteration to conitue.
  • --->
  • <cfset local.currentMatches = local.nextMatches />
  •  
  • <!---
  • Reset the next matches to allow a clean aggregation of
  • matched in the next iteration.
  • --->
  • <cfset local.nextMatches = [] />
  •  
  • </cfloop>
  •  
  • <!---
  • Return the matches aggregated by the last pattern
  • application. This should be all the matches that made it
  • through each matching iteration.
  • --->
  • <cfreturn local.currentMatches />
  • </cffunction>

The underlying code is not too bad; essentially, it's just hiding the grunt work of having to manually apply each regular expression pattern in succession.

When it comes to string parsing, regular expressions can feel both like a gift and a curse. Hopefully, with a function like reMultiMatch(), we can keep the complexity of our regular expressions lower while still experiencing the power that a more complex regular expression would provide. And of course, the more straightforward our patterns are, the easier they are to read. And if you've ever had to debug a complex regular expression, the readability of smaller patterns might be reason enough to try this approach.




Reader Comments

Hi Ben
Interesting function. Some observations:
* Wouldn't an array of regexes in the first argument be slightly more logical / predictable / tidy, than 1->n string arguments?

* Also, isn't it normal to have the required args first? IE: one always needs the target string, so it would be more natural to have that as an argument before the 2->n regexes (ie, obviously the first regex string is required too)? And accordingly, to keep things logically grouped, have the target string argument first, then the regex argument(s) after that? I realise you're trying to match the arguments for reMatch(), but I think what you've ended up with is a bit unnatural given your approach. it'd be less unnatural if you passed an array of regexes, not individual arguments though.

Anyway, as with everything, opinions and mileage differs.

Good work.

--
Adam

Hmmm, this could be a nice shortcut function for progressively narrowing down, but I must object to your example using HTML!

Regex is a great tool, but it is for parsing text, *not* for parsing HTML.

Complexity aside, the problem is that HTML is very flexible - its text representation can change without the HTML itself changing, and the HTML can change in ways that might not matter - and these both cause problems for the relative explicitness of regex. For example:

<img src='very-sexy-girl.jpg' class="saucy girl" />

That causes problems, and requires much more complex regex, even with the reMultiMatch approach.

And that just seems like too much work when there's already a much better way of doing this:

jQuery('img.saucy').attr('src')

:)

@Adam,

I think absolutely an array of regular expressions would be much nicer. The reason I didn't go that way was because implicit array creation cannot be used directly in function calls:

fn( [ val ] )

... until ColdFusion 9 (finally added, woohooo!!!!). As such, people would need an intermediary array to hold the patterns:

p = [ val ]
fn( p )

... and I was just trying to keep this as streamlined as possible to keep the whole "single pass" feeling.

That said, variable-length arguments always makes me feel a bit funny, so I think we're on the same page.

As far as the patterns being first, as you've concluded, the only reason I went that way was to try and keep this someone in step with the native reMatch() and reFind() functions which both take the regular expression as the first argument and the target string as the latter argument. Of course, reReplace() takes the target string as the first argument, so perhaps consistency is a moot point.

All in all, I think the inline, implicit array of patterns would be the nicest approach.

@Peter,

HTML was the only example that I could think of :) Also, someone had sent me a question of a similar nature which was how I happen to get this idea, so I was starting in a bit of biased place.

@Peter
How can you use jQuery to parse a document server side? My understanding was it did client side work? The only other way I thought it could be done was to parse it as an XML document, but the same reasons you give for using regex to pattern match HTML apply to parsing HTML as an XML document. At least with regex it won't fail to validate the document and abort before you start; so you start out with a better chance of success.

@Ben
to accont for double/single quotes, make this tiny mod to look for " or ' [""|']

@MikeG,

You can use Rhino to run jQuery on the server... but it's a beast of task. Plus, you'd probably also have to use something like TagSoup to ensure you have valid XHTML (required by Rhino).

As far as the quote issues, yeah you could definitely do something like that. I tend to break those out into two different branches:

("[^"]*"|'[^']*')

But, same idea.

Mike, jQuery is a JavaScript library, and JS generally runs inside a browser client, but that's a convention not a restriction. There's a server-side JS project called Rhino which I think can run jQuery.
Also, there are other (non-JS/jQuery) selector tools starting to come out that will work on the server-side, for example:
http://github.com/chrsan/css-selectors/tree

I'm not sure what you're saying with the XML comment. Yes, using regex against XML has similar problems as using regex against HTML.
Using XPath against a HTML DOM is possible (even if the original code isn't valid XHTML), but even so that never seems to work well.

Also important to note that in regex the | character indicates alternation in most cases, but not inside a character class, where it is literal. So ["|'] means " or | or ' rather than just the quotes.
And don't forget that quotes are optional in HTML in many places. :)

@Peter
The point I was trying to make given a choice between regex and XPATH against the DOM for searching user generated html, regex seems like the option that would meet with the best success.

Hi Ben
<cfset srcValues = reMultiMatch(listToArray("<img[^>]+>", "(?=.+?class\s*=\s*""saucy"").+", "src\s*=\s*""[^""]+""", "(?<="").+(?="")"), demoText)>

listToArray() works fine for creating "implicit" arrays, if the array elements are strings.

--
Adam

@Adam,

Ah, very nice. Yeah that's a quality approach. I use that all the time in conjunction with the queryAddColumn() method which requires an array of default values.

This is coolness and very useful. I did something similar in JavaScript for the XRegExp.matchChain method ( http://xregexp.com/api/#matchChain ), with the added ability to pass forward a specific backreference to the next regex. I also used recursion to keep the code nice and lightweight (which of course is less imperative in CF-land than JS). Incidentally, our usage examples are also very similar. :)

@Steve,

Also, my idea to use a positive look-ahead to check for a sub-string before I matched something else is specifically taken out of your RegEx Cookbook :) When I saw that recipe, it kind of blew my mind.

@Steve,

Also, I love the idea of passing through a specific group to the next matching. That's brilliant. I had that as a passing thought but didn't even attempt it.

Ben,

I stumbled on this while searching for an answer to a problem at working at currently.

Without too much detail, I'm trying to output in a simple text form, all the alt"" params to all the images on an external page. I just want the text portion between the "" in the alt.

What you have above looks promising for this application, but unfortunately my skills aren't such that I can figure out how to impliment it.

Any thoughts?

Thanks.

@Matt,

Actually, my example maps pretty closely to that. I am checking for a given class - you are checking for an external link. Then, I'm checking for the SRC, you are checking for ALT. We can probably rework the demo to match your situation:

  • reMultiMatch(
  • "<img[^>]+>",
  • "(?=.+?src\s*=\s*""http://(?!YOUR_DOMAIN_NAME)).+",
  • "alt\s*=\s*""[^""]+""",
  • "(?<="").+(?="")",
  • demoText
  • ) />

Maybe something like that? You'd have to replace the YOUR_DOMAIN_NAME with your local domain name. The negative look-ahead checks to make sure the src value doesn't start with your domain.

Of course, this could get complicated if your local source values don't have HTTP in them. Perhaps. It depends on what kind of data you're dealing with.

Hiya Ben

Is it possible to do something like this? I keep getting errors when I modify the code.

  • Illegal repetition {doc:load[^>]+>

{doc:load type="component" pos="left"}

Extracting either attribute

I'm just really confused. I've tried multiple things and no luck.

Thanks

Jody

@Jody,

The "{" and the "}" are special regular expression pattern characters. If you are trying to match them, literally, you need to escape them with a back-slash:

"\{ ... \}"

I hope that helps.