Learning ColdFusion 8: REMatch() For Regular Expression Matching

Posted June 13, 2007 at 8:21 AM

Tags: ColdFusion

ColdFusion has been awesome at finding regular expressions in strings with REFind() and REFindNoCase() and it has been easy to alter strings with regular expressions using REReplace() and REReplaceNoCase(). But, until now, ColdFusion has been extremely weak when it came to retrieving substrings from a chunk of text using regular expressions. Thankfully, ColdFusion 8 has introduced REMatch() and REMatchNoCase(). REMatch() searches through a string using a regular expression and returns all matching patterns in an array.

To demonstrate the power and ease of use of this new regular expression matching function, let's grab some search results from Google.com and then from that content, let's grab out the titles and links using a regular expression:

 Launch code in new window » Download code as text file »

  • <!---
  • Let's search for some google results for the phrase
  • "Maria Bello is hot" (seriously!!). We are going to
  • grab the first page in our CFHttp object.
  • --->
  • <cfhttp
  • url="http://www.google.com/search?q=maria+bello+is+hot"
  • method="GET"
  • useragent="#CGI.http_user_agent#"
  • result="objGet">
  •  
  • <!---
  • Tell Google that we just came from the Google.com
  • home page. This is not necessary, really, but it is
  • good practice when dealing with better security.
  • --->
  • <cfhttpparam
  • type="CGI"
  • name="referer"
  • value="http://www.google.com"
  • />
  •  
  • </cfhttp>
  •  
  •  
  • <!---
  • Now, we want to grab all of the titles from our
  • search results. Based on looking at the source of
  • the results page, we can gather that there is a
  • consistent pattern for the HTML. We can use this
  • pattern to create a regular expression based on
  • the H2 tags and A tags.
  • --->
  • <cfset arrTitles = REMatch(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />
  •  
  •  
  • <!---
  • At this point, our arrTitles should be an array that
  • holds zero or more (hopefully more) matches to our
  • above regular expression.
  • --->
  • <cfdump
  • var="#arrTitles#"
  • label="Maria Bello Is Hot REMatch()"
  • />

Here, we are using the combination of H2 and A tags to create our search results pattern. Running the above code, we get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Matching  
 
 
 

How easy was that?!?

If you have not performed this action before, you might not even realize what kind of a gold mine this is. Let's compare the REMatch() function to how something like this would have been accomplished in pre-ColdFusion-8 days. One way would be to keep looping over REFind() using the Len and Pos arrays... but honestly, that is the most ghetto of all possible solutions and it is shameful that ColdFusion even wanted people to do that (no offense if that's how you do it). So, instead, I will compare this to the ColdFusion / Java equivalent.

ColdFusion's Java libraries have regular expression Patterns libraries that allow us to loop over patterns in a string:

 Launch code in new window » Download code as text file »

  • <!---
  • Create a Java pattern that will match our H2 and A
  • search results pattern.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>"
  • ) />
  •  
  •  
  • <!---
  • Create a matcher that will apply this pattern to
  • our Google search results (that we scrapped via the
  • CFHttp call above). This matcher will have the
  • ability to loop over all pattern matches in the
  • target string.
  • --->
  • <cfset objMatcher = objPattern.Matcher(
  • objGet.FileContent
  • ) />
  •  
  •  
  • <!--- Create an array to hold our search results. --->
  • <cfset arrTitles = [] />
  •  
  •  
  • <!---
  • Keep iterating over the matcher while there are
  • still search results to be found.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!--- Add this match to our title array. --->
  • <cfset ArrayAppend(
  • arrTitles,
  • objMatcher.Group()
  • ) />
  •  
  • </cfloop>
  •  
  •  
  • <!---
  • At this point, our arrTitles should be an array that
  • holds zero or more (hopefully more) matches to our
  • above regular expression.
  • --->
  • <cfdump
  • var="#arrTitles#"
  • label="Maria Bello Is Hot Java Pattern / Matcher"
  • />

As you can see, this is not too complicated, but it is much more code than the above REMatch() method. And, just so you can see that these are doing the exact same thing, when you run the code above, you get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Java Pattern Matching Equivalent  
 
 
 

If you are used to doing things like this (the above), then REMatch() is going to make your life a lot easier. Of course, REMatch() is no silver bullet. ColdFusion 8 and REMatch() still use the same regular expression engine that was used in earlier versions of ColdFusion. This means that the regular expressions used in REMatch() are still more limited than those used in the Java Pattern / Matcher solution.

In our above examples, we are grabbing the H2 tags as well as the links within them. We are doing this so that we don't just grab random links off the page (only the search results have that H2 tag with that class). But let's say we didn't want to retrieve the H2 tags, we just wanted to make sure they existed in our pattern. To achieve this, we could use positive look behinds and positive look aheads in our pattern:

(?i)(?<=<h2 class=r>)<a[^>]+>.+?</a>(?=</h2>)

Here, we are saying that the H2 tags have to start the pattern and end the pattern, however, we do NOT want to grab them. This is purely for assertive matching. This is a valid regular expression, but if you tried to run this in the REMatch() function it will throw a ColdFusion error about ill-formed regular expressions. ColdFusion 8, using the older regular expression engine, still cannot handle look-behinds.

If you are dead-set on using them, you can, of course still use the Java Pattern / Matcher to grab the results. If we modified our Java Pattern / Matcher example above to use our new regular expression with look ahead/behinds, we would get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Java Matching With Look Behinds  
 
 
 

Notice that, now, our search results only have the A tags. The H2 tags, used in the ahead/behind assertive matching, were not captured in our results group.

REMatchNoCase() does the same thing as REMatch() only it does not have case sensitivity. However, you can build case-insensitivity directly into your regular expression using the flag (?i). Notice that all of my regular expressions above start with this flag - that makes them case-insensitive. Running:

 Launch code in new window » Download code as text file »

  • <cfset arrTitles = REMatch(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />

... would be the same thing as running:

 Launch code in new window » Download code as text file »

  • <cfset arrTitles = REMatchNoCase(
  • "<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />

Notice that I have changed the method from REMatch() to REMatchNoCase() and removed the (?i) flag.

So, while REMatch() and REMatchNoCase() are not the cure-all for your regular expression matching, I am sure you can see that for less-complex patterns, these new ColdFusion 8 functions are going to make your life a lot more cozy. One thing that I think these functions lacks is a scope. When you use REReplace(), you can tell it to replace ONE instance of the pattern or ALL instances of the pattern. This seems like something that REMatch() and REMatchNoCase() should have an optional third argument (defaulting to ALL). I can imagine situations where you know you might have multiple pattern matches, but for optimization, you only care about the first one and you want ColdFusion to stop matching after the first match. But oh well, maybe in ColdFusion 9 :)

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page


You Might Also Be Interested In:



Learning ColdFusion 9 - ColdFusion 9 tutorials, samples, examples, demos

Reader Comments

Dustin
Jun 13, 2007 at 11:38 AM // reply »
42 Comments

Another not necessary but a good idea would be to send a user-agent string along with your http request so your script looks like a real browser.


Jun 13, 2007 at 11:44 AM // reply »
6,371 Comments

@Dustin,

Good point... I am actually sending the user agent as part of the CFHttp tag itself (#CGI.http_user_agent#), however this only works because I am testing in a browser. If this was launched via a scheduled task, I believe that shows up as "ColdFusion" user agent. Therefore, it is a good idea to actually put in an explicit user agent.

Additionally, this value can be done as a CFHttpParam:

<cfhttpparam
type="HEADER"
name="user-agent"
value="Mozilla......."
/>

Thanks for pointing that out.


Jun 13, 2007 at 12:37 PM // reply »
26 Comments

This function is fantastic-- thanks for blogging it, Ben! Just this Monday, as I was writing code using ReFindNoCase() and looping over the string to find each new match, I was wishing that CF would just have some function to return all of the matches across the whole string. Now I know it's there. Thank goodness-- it will save a lot of work!


Jun 13, 2007 at 12:47 PM // reply »
6,371 Comments

@Tom,

Yeah looping through in indexes is just so ganky and painful (I know because that's how I did it for a long, long time). If you are feeling adventurous, check out the Java-ish method above. Its a bit strange at first, but once you get going with it, it is AWESOME. Plus, the Matcher.Group() can take arguments to actually give you individual group matches. So, for example, Matcher.Group( 2 ) will return the second captured group.

Of course, if Java is not your thing, ColdFusion 8 should hopefully be here soon :D


Jun 14, 2007 at 3:34 PM // reply »
164 Comments

"these functions lack a scope"

What!? That's just retarded. It's extremely common to only want the first match. And the scope should default to "one", not "all", so that it would be like REReplace, as well as every other programming language's default regular expression matching construct that I'm aware of.


Jun 14, 2007 at 6:52 PM // reply »
6,371 Comments

@Steve,

I agree. If you only know you want the first match, or even more, if you know there is ONLY one match, you would be able to save a lot of processing by having it halt the moment it finds that match.

Check it out:

http://www.forta.com/blog/index.cfm/2007/5/4/Scorpio-Adds-Two-New-RegEx-Functions

Before Forta even had a chance to explain how the function worked, I took a guess... what's that third argument? That's right, a scope. Oh well, dare to dream :)


Jun 17, 2007 at 7:23 PM // reply »
73 Comments

On a different subject, I think that there may be something wrong with the page's syntax coloring (I'm viewing in IE7) - I can see all the text down the entry has gone navy, as though a tag is not closed...


Jun 17, 2007 at 7:25 PM // reply »
6,371 Comments

@Shuns,

Thanks for pointing that out. It looks like the REMatch() highlighting is getting messed up for some reason. Notice that the (?i) flag is a different color than the text next to it (which is should not be). I will check this first thing in the morning.


Jun 17, 2007 at 7:26 PM // reply »
6,371 Comments

I think attributes that have HTML tags in the (ex. "<h2>") confuse the algorithm.


Jun 17, 2007 at 7:30 PM // reply »
73 Comments

Yeah thats what I was thinking :P


Jun 18, 2007 at 6:52 PM // reply »
6,371 Comments

@Shuns,

I have updated my color-coding algorithm. Can you take a quick look at this post and make sure the color coding is not breaking any more. It looks good to me, but I'd like a second pair of eyes.

Thanks.


Jun 18, 2007 at 6:58 PM // reply »
73 Comments

Yeah looks good to me, nice job.

BTW keep up the good work - alot of very good info on your site :D


Jun 19, 2007 at 7:14 AM // reply »
6,371 Comments

Awesome dude, glad you enjoy it. Thanks for the heads up on the coloring issues. Hopefully that should be the last time.


Jul 3, 2008 at 4:49 PM // reply »
24 Comments

Ben,

Will this function work with re lookarounds?

Thanks,
Will


Jul 3, 2008 at 4:58 PM // reply »
6,371 Comments

@Will,

The underlying regular expression engine being used is still the same. So, I believe that it will handle POSITIVE look ahead/behinds. But, I think it still doesn't handle NEGATIVE look ahead/behinds. Definitely not negative look-behinds. Maybe nothing behind positive or negative. I've been using the Java regex for anything complicated, so I can't really remember.


Bard
Jul 16, 2008 at 12:42 PM // reply »
1 Comments

Ben,

you write: Let's compare the REMatch() function to how something like this would have been accomplished in pre-ColdFusion-8 days. In that sample you use the following code: <cfset arrTitles = [] />
which happens not to work in CF 7 - bad example! ;-)

Better use:
<cfset arrTitles = ArrayNew(1) />

Best wishes,
Bard


Jul 16, 2008 at 12:46 PM // reply »
6,371 Comments

@Bard,

Good catch ;)


Nov 5, 2008 at 11:29 AM // reply »
1 Comments

Fantastic post.

Saved me hours of work.


Wesley
Sep 24, 2009 at 9:52 AM // reply »
3 Comments

Hi Ben,

First a BIG thank you for your POI information you provided us.

I have a string for ex:

string 1 is: Mr X / Mrs P / Mster Y(1 years) / Ms POI(2 years)

string 2 is: Mrs X / Mr Y

I need to extract
1. the complete name
2. the status(Mrs or Mr or Mster or Ms)
3. the number of years if exist.

DO you have an idea how i can deal with that?

thanking you in advance,

Wesley


Sep 24, 2009 at 10:01 AM // reply »
6,371 Comments

@Wesley,

It might just be easier to split this string into an array using "/" as the list delimiter and then just checking the index values?


Wesley
Sep 24, 2009 at 12:49 PM // reply »
3 Comments

Yeah, thats a good approach that was really far from my mind....I shall investigate and make you know after.

Thanks a lot Mr.Nadel

Wesley


Wesley
Sep 24, 2009 at 1:42 PM // reply »
3 Comments

Yeha,

i'll manage in some sort of way which i think is not the optimize one but anyway, the results are here.

There are about 50 lines of code, do you think i need to post them herE? or in some way send it to you if you wanna have a look and see whether there is optimization out therE?:p

thanks,

Wesley


Sep 29, 2009 at 9:34 AM // reply »
6,371 Comments

@Wesley,

You can always send stuff via my Ask Ben form. That one is best because it has a special form field for code (which will keep the formatting on my end).


Oct 20, 2009 at 10:43 AM // reply »
2 Comments

And here I am stuck on CF7 looping over this damned string... Hi Ben!


Oct 20, 2009 at 10:56 AM // reply »
2 Comments

"One way would be to keep looping over REFind() using the Len and Pos arrays... but honestly, that is the most ghetto of all possible solutions and it is shameful that ColdFusion even wanted people to do that (no offense if that's how you do it)."

Yeah, this is how I was trying to do it =(

However, the Java solution worked perfectly. Party time!


Oct 31, 2009 at 4:35 PM // reply »
6,371 Comments

@Jon,

Yeah, dipping into the Java is good times, especially if it works around having to use REFind().


Post Comment  |  Ask Ben

Recent Blog Comments
Jill
Nov 7, 2009 at 11:40 AM
How To Unformat Your Code (Like A Pro)
Derek, I think you might be right - sweet! Thanks for the link :) ... read »
Nov 7, 2009 at 11:25 AM
How To Unformat Your Code (Like A Pro)
I think it would be way easier to just use this http://www.logichammer.com/html-formatter/ He just released v3 and it rocks. ... read »
Jill
Nov 7, 2009 at 7:58 AM
How To Unformat Your Code (Like A Pro)
LMAO - this was pretty funny! I have to admit - I also love to reformat code so I can read it. My boss used to tell me to leave my OCD at home. Now I don't feel so bad after reading everyone else' ... read »
Nov 6, 2009 at 10:10 PM
How To Unformat Your Code (Like A Pro)
The timing of this post is just uncanny. I spent the last 15-20 minutes manually un-formatting my "Ben Nadel" style code within a CFC of mine. I was really digging the readability a few weeks ago, bu ... read »
Roe
Nov 6, 2009 at 5:11 PM
Passing Arrays By Reference In ColdFusion - SWEEET!
ArraySort also reorders the results of these java obj's ... read »
Nov 6, 2009 at 4:53 PM
How To Unformat Your Code (Like A Pro)
I tried to go *back* the other way. Adding formatting is actually a much more complicated problem than removing formatting. Anyway, here is what I could put together with a minimal amount of time: ... read »
Asaf
Nov 6, 2009 at 2:35 PM
ColdFusion GetPageContext() Massive Exploration
Hi, I actually found this post useful. I recently acquired a SSL certificate for my website and when I switched over to HTTPS Internet Explorer would throw an error when trying to download a dynamic ... read »
Nov 6, 2009 at 2:19 PM
How To Unformat Your Code (Like A Pro)
@Chuck, @Nathan, Well, now I feel like it's a challenge.... I accept. ... read »