Learning ColdFusion 8: REMatch() For Regular Expression Matching

Posted June 13, 2007 at 8:21 AM

Tags: ColdFusion

ColdFusion has been awesome at finding regular expressions in strings with REFind() and REFindNoCase() and it has been easy to alter strings with regular expressions using REReplace() and REReplaceNoCase(). But, until now, ColdFusion has been extremely weak when it came to retrieving substrings from a chunk of text using regular expressions. Thankfully, ColdFusion 8 has introduced REMatch() and REMatchNoCase(). REMatch() searches through a string using a regular expression and returns all matching patterns in an array.

To demonstrate the power and ease of use of this new regular expression matching function, let's grab some search results from Google.com and then from that content, let's grab out the titles and links using a regular expression:

 Launch code in new window » Download code as text file »

  • <!---
  • Let's search for some google results for the phrase
  • "Maria Bello is hot" (seriously!!). We are going to
  • grab the first page in our CFHttp object.
  • --->
  • <cfhttp
  • url="http://www.google.com/search?q=maria+bello+is+hot"
  • method="GET"
  • useragent="#CGI.http_user_agent#"
  • result="objGet">
  •  
  • <!---
  • Tell Google that we just came from the Google.com
  • home page. This is not necessary, really, but it is
  • good practice when dealing with better security.
  • --->
  • <cfhttpparam
  • type="CGI"
  • name="referer"
  • value="http://www.google.com"
  • />
  •  
  • </cfhttp>
  •  
  •  
  • <!---
  • Now, we want to grab all of the titles from our
  • search results. Based on looking at the source of
  • the results page, we can gather that there is a
  • consistent pattern for the HTML. We can use this
  • pattern to create a regular expression based on
  • the H2 tags and A tags.
  • --->
  • <cfset arrTitles = REMatch(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />
  •  
  •  
  • <!---
  • At this point, our arrTitles should be an array that
  • holds zero or more (hopefully more) matches to our
  • above regular expression.
  • --->
  • <cfdump
  • var="#arrTitles#"
  • label="Maria Bello Is Hot REMatch()"
  • />

Here, we are using the combination of H2 and A tags to create our search results pattern. Running the above code, we get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Matching  
 
 
 

How easy was that?!?

If you have not performed this action before, you might not even realize what kind of a gold mine this is. Let's compare the REMatch() function to how something like this would have been accomplished in pre-ColdFusion-8 days. One way would be to keep looping over REFind() using the Len and Pos arrays... but honestly, that is the most ghetto of all possible solutions and it is shameful that ColdFusion even wanted people to do that (no offense if that's how you do it). So, instead, I will compare this to the ColdFusion / Java equivalent.

ColdFusion's Java libraries have regular expression Patterns libraries that allow us to loop over patterns in a string:

 Launch code in new window » Download code as text file »

  • <!---
  • Create a Java pattern that will match our H2 and A
  • search results pattern.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>"
  • ) />
  •  
  •  
  • <!---
  • Create a matcher that will apply this pattern to
  • our Google search results (that we scrapped via the
  • CFHttp call above). This matcher will have the
  • ability to loop over all pattern matches in the
  • target string.
  • --->
  • <cfset objMatcher = objPattern.Matcher(
  • objGet.FileContent
  • ) />
  •  
  •  
  • <!--- Create an array to hold our search results. --->
  • <cfset arrTitles = [] />
  •  
  •  
  • <!---
  • Keep iterating over the matcher while there are
  • still search results to be found.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!--- Add this match to our title array. --->
  • <cfset ArrayAppend(
  • arrTitles,
  • objMatcher.Group()
  • ) />
  •  
  • </cfloop>
  •  
  •  
  • <!---
  • At this point, our arrTitles should be an array that
  • holds zero or more (hopefully more) matches to our
  • above regular expression.
  • --->
  • <cfdump
  • var="#arrTitles#"
  • label="Maria Bello Is Hot Java Pattern / Matcher"
  • />

As you can see, this is not too complicated, but it is much more code than the above REMatch() method. And, just so you can see that these are doing the exact same thing, when you run the code above, you get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Java Pattern Matching Equivalent  
 
 
 

If you are used to doing things like this (the above), then REMatch() is going to make your life a lot easier. Of course, REMatch() is no silver bullet. ColdFusion 8 and REMatch() still use the same regular expression engine that was used in earlier versions of ColdFusion. This means that the regular expressions used in REMatch() are still more limited than those used in the Java Pattern / Matcher solution.

In our above examples, we are grabbing the H2 tags as well as the links within them. We are doing this so that we don't just grab random links off the page (only the search results have that H2 tag with that class). But let's say we didn't want to retrieve the H2 tags, we just wanted to make sure they existed in our pattern. To achieve this, we could use positive look behinds and positive look aheads in our pattern:

(?i)(?<=<h2 class=r>)<a[^>]+>.+?</a>(?=</h2>)

Here, we are saying that the H2 tags have to start the pattern and end the pattern, however, we do NOT want to grab them. This is purely for assertive matching. This is a valid regular expression, but if you tried to run this in the REMatch() function it will throw a ColdFusion error about ill-formed regular expressions. ColdFusion 8, using the older regular expression engine, still cannot handle look-behinds.

If you are dead-set on using them, you can, of course still use the Java Pattern / Matcher to grab the results. If we modified our Java Pattern / Matcher example above to use our new regular expression with look ahead/behinds, we would get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Java Matching With Look Behinds  
 
 
 

Notice that, now, our search results only have the A tags. The H2 tags, used in the ahead/behind assertive matching, were not captured in our results group.

REMatchNoCase() does the same thing as REMatch() only it does not have case sensitivity. However, you can build case-insensitivity directly into your regular expression using the flag (?i). Notice that all of my regular expressions above start with this flag - that makes them case-insensitive. Running:

 Launch code in new window » Download code as text file »

  • <cfset arrTitles = REMatch(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />

... would be the same thing as running:

 Launch code in new window » Download code as text file »

  • <cfset arrTitles = REMatchNoCase(
  • "<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />

Notice that I have changed the method from REMatch() to REMatchNoCase() and removed the (?i) flag.

So, while REMatch() and REMatchNoCase() are not the cure-all for your regular expression matching, I am sure you can see that for less-complex patterns, these new ColdFusion 8 functions are going to make your life a lot more cozy. One thing that I think these functions lacks is a scope. When you use REReplace(), you can tell it to replace ONE instance of the pattern or ALL instances of the pattern. This seems like something that REMatch() and REMatchNoCase() should have an optional third argument (defaulting to ALL). I can imagine situations where you know you might have multiple pattern matches, but for optimization, you only care about the first one and you want ColdFusion to stop matching after the first match. But oh well, maybe in ColdFusion 9 :)

Download Code Snippet ZIP File

Comments (13)  |  Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page


Related Blog Entries



Adobe ColdFusion 8.0.1 Update - Helping Programmers To Be Signifanctly Less Girlie - Download ColdFusion 8 Update 8.0.1 Now.

Reader Comments

Another not necessary but a good idea would be to send a user-agent string along with your http request so your script looks like a real browser.

Posted by Dustin on Jun 13, 2007 at 11:38 AM


@Dustin,

Good point... I am actually sending the user agent as part of the CFHttp tag itself (#CGI.http_user_agent#), however this only works because I am testing in a browser. If this was launched via a scheduled task, I believe that shows up as "ColdFusion" user agent. Therefore, it is a good idea to actually put in an explicit user agent.

Additionally, this value can be done as a CFHttpParam:

<cfhttpparam
type="HEADER"
name="user-agent"
value="Mozilla......."
/>

Thanks for pointing that out.

Posted by Ben Nadel on Jun 13, 2007 at 11:44 AM


This function is fantastic-- thanks for blogging it, Ben! Just this Monday, as I was writing code using ReFindNoCase() and looping over the string to find each new match, I was wishing that CF would just have some function to return all of the matches across the whole string. Now I know it's there. Thank goodness-- it will save a lot of work!

Posted by Tom Mollerus on Jun 13, 2007 at 12:37 PM


@Tom,

Yeah looping through in indexes is just so ganky and painful (I know because that's how I did it for a long, long time). If you are feeling adventurous, check out the Java-ish method above. Its a bit strange at first, but once you get going with it, it is AWESOME. Plus, the Matcher.Group() can take arguments to actually give you individual group matches. So, for example, Matcher.Group( 2 ) will return the second captured group.

Of course, if Java is not your thing, ColdFusion 8 should hopefully be here soon :D

Posted by Ben Nadel on Jun 13, 2007 at 12:47 PM


"these functions lack a scope"

What!? That's just retarded. It's extremely common to only want the first match. And the scope should default to "one", not "all", so that it would be like REReplace, as well as every other programming language's default regular expression matching construct that I'm aware of.

Posted by Steve on Jun 14, 2007 at 3:34 PM


@Steve,

I agree. If you only know you want the first match, or even more, if you know there is ONLY one match, you would be able to save a lot of processing by having it halt the moment it finds that match.

Check it out:

http://www.forta.com/blog/index.cfm/2007/5/4/Scorpio-Adds-Two-New-RegEx-Functions

Before Forta even had a chance to explain how the function worked, I took a guess... what's that third argument? That's right, a scope. Oh well, dare to dream :)

Posted by Ben Nadel on Jun 14, 2007 at 6:52 PM


On a different subject, I think that there may be something wrong with the page's syntax coloring (I'm viewing in IE7) - I can see all the text down the entry has gone navy, as though a tag is not closed...

Posted by Shuns on Jun 17, 2007 at 7:23 PM


@Shuns,

Thanks for pointing that out. It looks like the REMatch() highlighting is getting messed up for some reason. Notice that the (?i) flag is a different color than the text next to it (which is should not be). I will check this first thing in the morning.

Posted by Ben Nadel on Jun 17, 2007 at 7:25 PM


I think attributes that have HTML tags in the (ex. "<h2>") confuse the algorithm.

Posted by Ben Nadel on Jun 17, 2007 at 7:26 PM


Yeah thats what I was thinking :P

Posted by Shuns on Jun 17, 2007 at 7:30 PM


@Shuns,

I have updated my color-coding algorithm. Can you take a quick look at this post and make sure the color coding is not breaking any more. It looks good to me, but I'd like a second pair of eyes.

Thanks.

Posted by Ben Nadel on Jun 18, 2007 at 6:52 PM


Yeah looks good to me, nice job.

BTW keep up the good work - alot of very good info on your site :D

Posted by Shuns on Jun 18, 2007 at 6:58 PM


Awesome dude, glad you enjoy it. Thanks for the heads up on the coloring issues. Hopefully that should be the last time.

Posted by Ben Nadel on Jun 19, 2007 at 7:14 AM


Post Comment  |  Ask Ben


Home   |   Web Log   |   ColdFusion   |   Projects   |   Resume   |   Job Form   |   Search   |   Contact
Epicenter Consulting - Custom Software Solutions for Business Evolution HostMySite.com - The Leader In ColdFusion Hosting