Learning ColdFusion 8: REMatch() For Regular Expression Matching

Posted June 13, 2007 at 8:21 AM

Tags: ColdFusion

ColdFusion has been awesome at finding regular expressions in strings with REFind() and REFindNoCase() and it has been easy to alter strings with regular expressions using REReplace() and REReplaceNoCase(). But, until now, ColdFusion has been extremely weak when it came to retrieving substrings from a chunk of text using regular expressions. Thankfully, ColdFusion 8 has introduced REMatch() and REMatchNoCase(). REMatch() searches through a string using a regular expression and returns all matching patterns in an array.

To demonstrate the power and ease of use of this new regular expression matching function, let's grab some search results from Google.com and then from that content, let's grab out the titles and links using a regular expression:

 Launch code in new window » Download code as text file »

  • <!---
  • Let's search for some google results for the phrase
  • "Maria Bello is hot" (seriously!!). We are going to
  • grab the first page in our CFHttp object.
  • --->
  • <cfhttp
  • url="http://www.google.com/search?q=maria+bello+is+hot"
  • method="GET"
  • useragent="#CGI.http_user_agent#"
  • result="objGet">
  •  
  • <!---
  • Tell Google that we just came from the Google.com
  • home page. This is not necessary, really, but it is
  • good practice when dealing with better security.
  • --->
  • <cfhttpparam
  • type="CGI"
  • name="referer"
  • value="http://www.google.com"
  • />
  •  
  • </cfhttp>
  •  
  •  
  • <!---
  • Now, we want to grab all of the titles from our
  • search results. Based on looking at the source of
  • the results page, we can gather that there is a
  • consistent pattern for the HTML. We can use this
  • pattern to create a regular expression based on
  • the H2 tags and A tags.
  • --->
  • <cfset arrTitles = REMatch(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />
  •  
  •  
  • <!---
  • At this point, our arrTitles should be an array that
  • holds zero or more (hopefully more) matches to our
  • above regular expression.
  • --->
  • <cfdump
  • var="#arrTitles#"
  • label="Maria Bello Is Hot REMatch()"
  • />

Here, we are using the combination of H2 and A tags to create our search results pattern. Running the above code, we get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Matching  
 
 
 

How easy was that?!?

If you have not performed this action before, you might not even realize what kind of a gold mine this is. Let's compare the REMatch() function to how something like this would have been accomplished in pre-ColdFusion-8 days. One way would be to keep looping over REFind() using the Len and Pos arrays... but honestly, that is the most ghetto of all possible solutions and it is shameful that ColdFusion even wanted people to do that (no offense if that's how you do it). So, instead, I will compare this to the ColdFusion / Java equivalent.

ColdFusion's Java libraries have regular expression Patterns libraries that allow us to loop over patterns in a string:

 Launch code in new window » Download code as text file »

  • <!---
  • Create a Java pattern that will match our H2 and A
  • search results pattern.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>"
  • ) />
  •  
  •  
  • <!---
  • Create a matcher that will apply this pattern to
  • our Google search results (that we scrapped via the
  • CFHttp call above). This matcher will have the
  • ability to loop over all pattern matches in the
  • target string.
  • --->
  • <cfset objMatcher = objPattern.Matcher(
  • objGet.FileContent
  • ) />
  •  
  •  
  • <!--- Create an array to hold our search results. --->
  • <cfset arrTitles = [] />
  •  
  •  
  • <!---
  • Keep iterating over the matcher while there are
  • still search results to be found.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!--- Add this match to our title array. --->
  • <cfset ArrayAppend(
  • arrTitles,
  • objMatcher.Group()
  • ) />
  •  
  • </cfloop>
  •  
  •  
  • <!---
  • At this point, our arrTitles should be an array that
  • holds zero or more (hopefully more) matches to our
  • above regular expression.
  • --->
  • <cfdump
  • var="#arrTitles#"
  • label="Maria Bello Is Hot Java Pattern / Matcher"
  • />

As you can see, this is not too complicated, but it is much more code than the above REMatch() method. And, just so you can see that these are doing the exact same thing, when you run the code above, you get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Java Pattern Matching Equivalent  
 
 
 

If you are used to doing things like this (the above), then REMatch() is going to make your life a lot easier. Of course, REMatch() is no silver bullet. ColdFusion 8 and REMatch() still use the same regular expression engine that was used in earlier versions of ColdFusion. This means that the regular expressions used in REMatch() are still more limited than those used in the Java Pattern / Matcher solution.

In our above examples, we are grabbing the H2 tags as well as the links within them. We are doing this so that we don't just grab random links off the page (only the search results have that H2 tag with that class). But let's say we didn't want to retrieve the H2 tags, we just wanted to make sure they existed in our pattern. To achieve this, we could use positive look behinds and positive look aheads in our pattern:

(?i)(?<=<h2 class=r>)<a[^>]+>.+?</a>(?=</h2>)

Here, we are saying that the H2 tags have to start the pattern and end the pattern, however, we do NOT want to grab them. This is purely for assertive matching. This is a valid regular expression, but if you tried to run this in the REMatch() function it will throw a ColdFusion error about ill-formed regular expressions. ColdFusion 8, using the older regular expression engine, still cannot handle look-behinds.

If you are dead-set on using them, you can, of course still use the Java Pattern / Matcher to grab the results. If we modified our Java Pattern / Matcher example above to use our new regular expression with look ahead/behinds, we would get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Java Matching With Look Behinds  
 
 
 

Notice that, now, our search results only have the A tags. The H2 tags, used in the ahead/behind assertive matching, were not captured in our results group.

REMatchNoCase() does the same thing as REMatch() only it does not have case sensitivity. However, you can build case-insensitivity directly into your regular expression using the flag (?i). Notice that all of my regular expressions above start with this flag - that makes them case-insensitive. Running:

 Launch code in new window » Download code as text file »

  • <cfset arrTitles = REMatch(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />

... would be the same thing as running:

 Launch code in new window » Download code as text file »

  • <cfset arrTitles = REMatchNoCase(
  • "<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />

Notice that I have changed the method from REMatch() to REMatchNoCase() and removed the (?i) flag.

So, while REMatch() and REMatchNoCase() are not the cure-all for your regular expression matching, I am sure you can see that for less-complex patterns, these new ColdFusion 8 functions are going to make your life a lot more cozy. One thing that I think these functions lacks is a scope. When you use REReplace(), you can tell it to replace ONE instance of the pattern or ALL instances of the pattern. This seems like something that REMatch() and REMatchNoCase() should have an optional third argument (defaulting to ALL). I can imagine situations where you know you might have multiple pattern matches, but for optimization, you only care about the first one and you want ColdFusion to stop matching after the first match. But oh well, maybe in ColdFusion 9 :)

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page


You Might Also Be Interested In:



Learning ColdFusion 9 - ColdFusion 9 tutorials, samples, examples, demos

Reader Comments

Dustin
Jun 13, 2007 at 11:38 AM // reply »
42 Comments

Another not necessary but a good idea would be to send a user-agent string along with your http request so your script looks like a real browser.


Jun 13, 2007 at 11:44 AM // reply »
5,406 Comments

@Dustin,

Good point... I am actually sending the user agent as part of the CFHttp tag itself (#CGI.http_user_agent#), however this only works because I am testing in a browser. If this was launched via a scheduled task, I believe that shows up as "ColdFusion" user agent. Therefore, it is a good idea to actually put in an explicit user agent.

Additionally, this value can be done as a CFHttpParam:

<cfhttpparam
type="HEADER"
name="user-agent"
value="Mozilla......."
/>

Thanks for pointing that out.


Jun 13, 2007 at 12:37 PM // reply »
26 Comments

This function is fantastic-- thanks for blogging it, Ben! Just this Monday, as I was writing code using ReFindNoCase() and looping over the string to find each new match, I was wishing that CF would just have some function to return all of the matches across the whole string. Now I know it's there. Thank goodness-- it will save a lot of work!


Jun 13, 2007 at 12:47 PM // reply »
5,406 Comments

@Tom,

Yeah looping through in indexes is just so ganky and painful (I know because that's how I did it for a long, long time). If you are feeling adventurous, check out the Java-ish method above. Its a bit strange at first, but once you get going with it, it is AWESOME. Plus, the Matcher.Group() can take arguments to actually give you individual group matches. So, for example, Matcher.Group( 2 ) will return the second captured group.

Of course, if Java is not your thing, ColdFusion 8 should hopefully be here soon :D


Jun 14, 2007 at 3:34 PM // reply »
162 Comments

"these functions lack a scope"

What!? That's just retarded. It's extremely common to only want the first match. And the scope should default to "one", not "all", so that it would be like REReplace, as well as every other programming language's default regular expression matching construct that I'm aware of.


Jun 14, 2007 at 6:52 PM // reply »
5,406 Comments

@Steve,

I agree. If you only know you want the first match, or even more, if you know there is ONLY one match, you would be able to save a lot of processing by having it halt the moment it finds that match.

Check it out:

http://www.forta.com/blog/index.cfm/2007/5/4/Scorpio-Adds-Two-New-RegEx-Functions

Before Forta even had a chance to explain how the function worked, I took a guess... what's that third argument? That's right, a scope. Oh well, dare to dream :)


Jun 17, 2007 at 7:23 PM // reply »
63 Comments

On a different subject, I think that there may be something wrong with the page's syntax coloring (I'm viewing in IE7) - I can see all the text down the entry has gone navy, as though a tag is not closed...


Jun 17, 2007 at 7:25 PM // reply »
5,406 Comments

@Shuns,

Thanks for pointing that out. It looks like the REMatch() highlighting is getting messed up for some reason. Notice that the (?i) flag is a different color than the text next to it (which is should not be). I will check this first thing in the morning.


Jun 17, 2007 at 7:26 PM // reply »
5,406 Comments

I think attributes that have HTML tags in the (ex. "<h2>") confuse the algorithm.


Jun 17, 2007 at 7:30 PM // reply »
63 Comments

Yeah thats what I was thinking :P


Jun 18, 2007 at 6:52 PM // reply »
5,406 Comments

@Shuns,

I have updated my color-coding algorithm. Can you take a quick look at this post and make sure the color coding is not breaking any more. It looks good to me, but I'd like a second pair of eyes.

Thanks.


Jun 18, 2007 at 6:58 PM // reply »
63 Comments

Yeah looks good to me, nice job.

BTW keep up the good work - alot of very good info on your site :D


Jun 19, 2007 at 7:14 AM // reply »
5,406 Comments

Awesome dude, glad you enjoy it. Thanks for the heads up on the coloring issues. Hopefully that should be the last time.


Jul 3, 2008 at 4:49 PM // reply »
22 Comments

Ben,

Will this function work with re lookarounds?

Thanks,
Will


Jul 3, 2008 at 4:58 PM // reply »
5,406 Comments

@Will,

The underlying regular expression engine being used is still the same. So, I believe that it will handle POSITIVE look ahead/behinds. But, I think it still doesn't handle NEGATIVE look ahead/behinds. Definitely not negative look-behinds. Maybe nothing behind positive or negative. I've been using the Java regex for anything complicated, so I can't really remember.


Bard
Jul 16, 2008 at 12:42 PM // reply »
1 Comments

Ben,

you write: Let's compare the REMatch() function to how something like this would have been accomplished in pre-ColdFusion-8 days. In that sample you use the following code: <cfset arrTitles = [] />
which happens not to work in CF 7 - bad example! ;-)

Better use:
<cfset arrTitles = ArrayNew(1) />

Best wishes,
Bard


Jul 16, 2008 at 12:46 PM // reply »
5,406 Comments

@Bard,

Good catch ;)


Nov 5, 2008 at 11:29 AM // reply »
1 Comments

Fantastic post.

Saved me hours of work.


Post Comment  |  Ask Ben

Recent Blog Comments
Secret Admirer
Jul 4, 2009 at 12:23 PM
Project HUGE: Huge In A Hurry - Get Big - Phase 2 / Week 3
My Poor Dreamboat :( I feel so sad when I know you are hurting. I hope you feel better soon. ... read »
Jul 4, 2009 at 9:42 AM
FLV 404 Error On Windows 2003 Server
I bookmarked this page. Thanks for given this great post.... ... read »
Jul 4, 2009 at 4:00 AM
Terms Of Service / Privacy Policy Document Generator
thanks ben, I'm not a big fan of contracts so to find your no no-nesense ToS generator has helped me no end. all the best matt ... read »
Justice
Jul 3, 2009 at 11:10 PM
Create A Running Average Without Storing Individual Values
@Ben, I think you're going about this the wrong way. You're trying to use complicated techniques when there is a simple and beautiful technique readily available (a la Gary Funk's comment). Instead ... read »
Bob
Jul 3, 2009 at 9:19 PM
Project HUGE: Huge In A Hurry - Get Big - Phase 3 / Week 1
a good technical explanation http://crossfitphoenix.typepad.com/crossfit_phoenix_forging_/the-overhead-squat.html ... read »
Jul 3, 2009 at 9:03 PM
Create A Running Average Without Storing Individual Values
If I wanted to do this and only carry two numbers, I'd keep track of the sum and N. Then you are pretty much accurate all the time. average = (sum + new_number) / (N + 1) But all this was in a for ... read »
Roland Collins
Jul 3, 2009 at 8:58 PM
Create A Running Average Without Storing Individual Values
@Martin - not just floating point though. Depending on what langauge you're working in, decimals can cause just as many headaches if they're not precise enough. But again, for most applications, th ... read »
Isnogood
Jul 3, 2009 at 7:16 PM
Project HUGE: Huge In A Hurry - Get Big - Phase 3 / Week 1
Watch this http://www.nsca-lift.org/videos/default.shtml ... read »