Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at Scotch On The Rocks (SOTR) 2011 (Edinburgh) with:

Learning ColdFusion 8: REMatch() For Regular Expression Matching

By Ben Nadel on
Tags: ColdFusion

ColdFusion has been awesome at finding regular expressions in strings with REFind() and REFindNoCase() and it has been easy to alter strings with regular expressions using REReplace() and REReplaceNoCase(). But, until now, ColdFusion has been extremely weak when it came to retrieving substrings from a chunk of text using regular expressions. Thankfully, ColdFusion 8 has introduced REMatch() and REMatchNoCase(). REMatch() searches through a string using a regular expression and returns all matching patterns in an array.

To demonstrate the power and ease of use of this new regular expression matching function, let's grab some search results from Google.com and then from that content, let's grab out the titles and links using a regular expression:

  • <!---
  • Let's search for some google results for the phrase
  • "Maria Bello is hot" (seriously!!). We are going to
  • grab the first page in our CFHttp object.
  • --->
  • <cfhttp
  • url="http://www.google.com/search?q=maria+bello+is+hot"
  • method="GET"
  • useragent="#CGI.http_user_agent#"
  • result="objGet">
  •  
  • <!---
  • Tell Google that we just came from the Google.com
  • home page. This is not necessary, really, but it is
  • good practice when dealing with better security.
  • --->
  • <cfhttpparam
  • type="CGI"
  • name="referer"
  • value="http://www.google.com"
  • />
  •  
  • </cfhttp>
  •  
  •  
  • <!---
  • Now, we want to grab all of the titles from our
  • search results. Based on looking at the source of
  • the results page, we can gather that there is a
  • consistent pattern for the HTML. We can use this
  • pattern to create a regular expression based on
  • the H2 tags and A tags.
  • --->
  • <cfset arrTitles = REMatch(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />
  •  
  •  
  • <!---
  • At this point, our arrTitles should be an array that
  • holds zero or more (hopefully more) matches to our
  • above regular expression.
  • --->
  • <cfdump
  • var="#arrTitles#"
  • label="Maria Bello Is Hot REMatch()"
  • />

Here, we are using the combination of H2 and A tags to create our search results pattern. Running the above code, we get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Matching  
 
 
 

How easy was that?!?

If you have not performed this action before, you might not even realize what kind of a gold mine this is. Let's compare the REMatch() function to how something like this would have been accomplished in pre-ColdFusion-8 days. One way would be to keep looping over REFind() using the Len and Pos arrays... but honestly, that is the most ghetto of all possible solutions and it is shameful that ColdFusion even wanted people to do that (no offense if that's how you do it). So, instead, I will compare this to the ColdFusion / Java equivalent.

ColdFusion's Java libraries have regular expression Patterns libraries that allow us to loop over patterns in a string:

  • <!---
  • Create a Java pattern that will match our H2 and A
  • search results pattern.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>"
  • ) />
  •  
  •  
  • <!---
  • Create a matcher that will apply this pattern to
  • our Google search results (that we scrapped via the
  • CFHttp call above). This matcher will have the
  • ability to loop over all pattern matches in the
  • target string.
  • --->
  • <cfset objMatcher = objPattern.Matcher(
  • objGet.FileContent
  • ) />
  •  
  •  
  • <!--- Create an array to hold our search results. --->
  • <cfset arrTitles = [] />
  •  
  •  
  • <!---
  • Keep iterating over the matcher while there are
  • still search results to be found.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!--- Add this match to our title array. --->
  • <cfset ArrayAppend(
  • arrTitles,
  • objMatcher.Group()
  • ) />
  •  
  • </cfloop>
  •  
  •  
  • <!---
  • At this point, our arrTitles should be an array that
  • holds zero or more (hopefully more) matches to our
  • above regular expression.
  • --->
  • <cfdump
  • var="#arrTitles#"
  • label="Maria Bello Is Hot Java Pattern / Matcher"
  • />

As you can see, this is not too complicated, but it is much more code than the above REMatch() method. And, just so you can see that these are doing the exact same thing, when you run the code above, you get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Java Pattern Matching Equivalent  
 
 
 

If you are used to doing things like this (the above), then REMatch() is going to make your life a lot easier. Of course, REMatch() is no silver bullet. ColdFusion 8 and REMatch() still use the same regular expression engine that was used in earlier versions of ColdFusion. This means that the regular expressions used in REMatch() are still more limited than those used in the Java Pattern / Matcher solution.

In our above examples, we are grabbing the H2 tags as well as the links within them. We are doing this so that we don't just grab random links off the page (only the search results have that H2 tag with that class). But let's say we didn't want to retrieve the H2 tags, we just wanted to make sure they existed in our pattern. To achieve this, we could use positive look behinds and positive look aheads in our pattern:

(?i)(?<=<h2 class=r>)<a[^>]+>.+?</a>(?=</h2>)

Here, we are saying that the H2 tags have to start the pattern and end the pattern, however, we do NOT want to grab them. This is purely for assertive matching. This is a valid regular expression, but if you tried to run this in the REMatch() function it will throw a ColdFusion error about ill-formed regular expressions. ColdFusion 8, using the older regular expression engine, still cannot handle look-behinds.

If you are dead-set on using them, you can, of course still use the Java Pattern / Matcher to grab the results. If we modified our Java Pattern / Matcher example above to use our new regular expression with look ahead/behinds, we would get the following CFDump output:


 
 
 

 
ColdFusion REMatch() Regular Expression Java Matching With Look Behinds  
 
 
 

Notice that, now, our search results only have the A tags. The H2 tags, used in the ahead/behind assertive matching, were not captured in our results group.

REMatchNoCase() does the same thing as REMatch() only it does not have case sensitivity. However, you can build case-insensitivity directly into your regular expression using the flag (?i). Notice that all of my regular expressions above start with this flag - that makes them case-insensitive. Running:

  • <cfset arrTitles = REMatch(
  • "(?i)<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />

... would be the same thing as running:

  • <cfset arrTitles = REMatchNoCase(
  • "<h2 class=r><a[^>]+>.+?</a></h2>",
  • objGet.FileContent
  • ) />

Notice that I have changed the method from REMatch() to REMatchNoCase() and removed the (?i) flag.

So, while REMatch() and REMatchNoCase() are not the cure-all for your regular expression matching, I am sure you can see that for less-complex patterns, these new ColdFusion 8 functions are going to make your life a lot more cozy. One thing that I think these functions lacks is a scope. When you use REReplace(), you can tell it to replace ONE instance of the pattern or ALL instances of the pattern. This seems like something that REMatch() and REMatchNoCase() should have an optional third argument (defaulting to ALL). I can imagine situations where you know you might have multiple pattern matches, but for optimization, you only care about the first one and you want ColdFusion to stop matching after the first match. But oh well, maybe in ColdFusion 9 :)




Reader Comments

Another not necessary but a good idea would be to send a user-agent string along with your http request so your script looks like a real browser.

Reply to this Comment

@Dustin,

Good point... I am actually sending the user agent as part of the CFHttp tag itself (#CGI.http_user_agent#), however this only works because I am testing in a browser. If this was launched via a scheduled task, I believe that shows up as "ColdFusion" user agent. Therefore, it is a good idea to actually put in an explicit user agent.

Additionally, this value can be done as a CFHttpParam:

<cfhttpparam
type="HEADER"
name="user-agent"
value="Mozilla......."
/>

Thanks for pointing that out.

Reply to this Comment

This function is fantastic-- thanks for blogging it, Ben! Just this Monday, as I was writing code using ReFindNoCase() and looping over the string to find each new match, I was wishing that CF would just have some function to return all of the matches across the whole string. Now I know it's there. Thank goodness-- it will save a lot of work!

Reply to this Comment

@Tom,

Yeah looping through in indexes is just so ganky and painful (I know because that's how I did it for a long, long time). If you are feeling adventurous, check out the Java-ish method above. Its a bit strange at first, but once you get going with it, it is AWESOME. Plus, the Matcher.Group() can take arguments to actually give you individual group matches. So, for example, Matcher.Group( 2 ) will return the second captured group.

Of course, if Java is not your thing, ColdFusion 8 should hopefully be here soon :D

Reply to this Comment

"these functions lack a scope"

What!? That's just retarded. It's extremely common to only want the first match. And the scope should default to "one", not "all", so that it would be like REReplace, as well as every other programming language's default regular expression matching construct that I'm aware of.

Reply to this Comment

@Steve,

I agree. If you only know you want the first match, or even more, if you know there is ONLY one match, you would be able to save a lot of processing by having it halt the moment it finds that match.

Check it out:

http://www.forta.com/blog/index.cfm/2007/5/4/Scorpio-Adds-Two-New-RegEx-Functions

Before Forta even had a chance to explain how the function worked, I took a guess... what's that third argument? That's right, a scope. Oh well, dare to dream :)

Reply to this Comment

On a different subject, I think that there may be something wrong with the page's syntax coloring (I'm viewing in IE7) - I can see all the text down the entry has gone navy, as though a tag is not closed...

Reply to this Comment

@Shuns,

Thanks for pointing that out. It looks like the REMatch() highlighting is getting messed up for some reason. Notice that the (?i) flag is a different color than the text next to it (which is should not be). I will check this first thing in the morning.

Reply to this Comment

@Shuns,

I have updated my color-coding algorithm. Can you take a quick look at this post and make sure the color coding is not breaking any more. It looks good to me, but I'd like a second pair of eyes.

Thanks.

Reply to this Comment

Yeah looks good to me, nice job.

BTW keep up the good work - alot of very good info on your site :D

Reply to this Comment

Awesome dude, glad you enjoy it. Thanks for the heads up on the coloring issues. Hopefully that should be the last time.

Reply to this Comment

@Will,

The underlying regular expression engine being used is still the same. So, I believe that it will handle POSITIVE look ahead/behinds. But, I think it still doesn't handle NEGATIVE look ahead/behinds. Definitely not negative look-behinds. Maybe nothing behind positive or negative. I've been using the Java regex for anything complicated, so I can't really remember.

Reply to this Comment

Ben,

you write: Let's compare the REMatch() function to how something like this would have been accomplished in pre-ColdFusion-8 days. In that sample you use the following code: <cfset arrTitles = [] />
which happens not to work in CF 7 - bad example! ;-)

Better use:
<cfset arrTitles = ArrayNew(1) />

Best wishes,
Bard

Reply to this Comment

Hi Ben,

First a BIG thank you for your POI information you provided us.

I have a string for ex:

string 1 is: Mr X / Mrs P / Mster Y(1 years) / Ms POI(2 years)

string 2 is: Mrs X / Mr Y

I need to extract
1. the complete name
2. the status(Mrs or Mr or Mster or Ms)
3. the number of years if exist.

DO you have an idea how i can deal with that?

thanking you in advance,

Wesley

Reply to this Comment

@Wesley,

It might just be easier to split this string into an array using "/" as the list delimiter and then just checking the index values?

Reply to this Comment

Yeah, thats a good approach that was really far from my mind....I shall investigate and make you know after.

Thanks a lot Mr.Nadel

Wesley

Reply to this Comment

Yeha,

i'll manage in some sort of way which i think is not the optimize one but anyway, the results are here.

There are about 50 lines of code, do you think i need to post them herE? or in some way send it to you if you wanna have a look and see whether there is optimization out therE?:p

thanks,

Wesley

Reply to this Comment

@Wesley,

You can always send stuff via my Ask Ben form. That one is best because it has a special form field for code (which will keep the formatting on my end).

Reply to this Comment

"One way would be to keep looping over REFind() using the Len and Pos arrays... but honestly, that is the most ghetto of all possible solutions and it is shameful that ColdFusion even wanted people to do that (no offense if that's how you do it)."

Yeah, this is how I was trying to do it =(

However, the Java solution worked perfectly. Party time!

Reply to this Comment

@Jon,

Yeah, dipping into the Java is good times, especially if it works around having to use REFind().

Reply to this Comment

I have been fighting with getting Regular Expressions to simplfy a task, no luck, finally asking for some input....hoping its not obvious....

Match this tel number to the longest iteration in a database of countrycodes and areacodes:

1808
180855
1808554
1808555

A match would be the last entry. i have been doing this with a loop.

Can I harness the power or ReMatch in some way to do this?

Reply to this Comment

@Ben,

I am not sure what you are asking exactly. You can definitely use reMatch() to extract numeric patterns. But, are you trying to extract? or match? You can use reFind() to check for pattern existence. But, if you are trying to do the matching in a database lookup, you can't use ColdFusion's RE functions.

Reply to this Comment

Ah my post is flawed...

I am given a phone number 18085551234. Need to match it to the longest match of area codes (for lack of a better term). So in the above list, although the beginning sequence of my number matches the 1, 2 & 4 entries, the correct match would be the last entry because it is longest.

The list of area codes is in a database but could be an array if required.

Reply to this Comment

@Ben,

By default, regular expressions are greedy. That is, they will matching the longest possible match first, and then, only if they don't have a match, start to try and match smaller patterns next. So, for example, the pattern:

1808\d*

(String "1808" followed by zero of more digits). Will match "1808555" before it tries to match "18085".

That said, I am still not 100% sure what you are doing. Given the value, "18085551234", are you trying to extract the value, "1808555"?

Reply to this Comment

If I need to find everything that is between two curly braces?

For example if I have this string:

{aaaa|bbbb|cccc|dddd}

And I want:

aaaa|bbbb|cccc|dddd

What is the regex?

I have tried this, but don't work:

(?=\{)(.*?)(?=\})

Could you help me?

Reply to this Comment

@Francesco,

1. you should use : refindnocase
For the regular expression, if '{' is a special character so place the \ before it, else no need.
2. reg exp: {(.*)}

this should take everything inside the curly braces, thats what i would have done on my first try.

Didn't test it though. Sorry, hope this may help.

Reply to this Comment

Thanks for the reply. But unfortunately this not work.

I've tried this:

<cfset ar="Ciao mi chiamo {Francesco|Carlo|Marino|Guido} e faccio l'artista ">

<cfset ag = ReMatch("(?=\{)(.*?)(?=\})","#ar#")>

The output is: {Francesco|Carlo|Marino|Guido

But I want eliminate also the first bracket.

If I try this regex "(?<=\{)(.*?)(?=\})" on http://gskinner.com/RegExr/ it works perfectly, but coldfusion give me error "Sequence (?<...) not recognized".

How can I solve it?

Reply to this Comment

@Francesco,

ColdFusion does not support look-behinds in the native regular expression functions. I would suggest looking at something like reMatchGroups(), which allows you extract certain captured groups:

http://www.bennadel.com/blog/1132-REMatchGroup-UDF-To-Return-Only-Specified-Group-In-RegEx-Pattern.htm

... or a more robust version:

http://www.bennadel.com/blog/1040-REMatchGroups-ColdFusion-User-Defined-Function.htm

Also, I have a ColdFusion component that is basically a wrapper for the core Java regular expression functionality which can provides for less structure, but the most flexibility:

http://www.bennadel.com/blog/2097-PatternMatcher-cfc-A-ColdFusion-Component-Wrapper-For-The-Java-Regular-Expression-Engine.htm

The problem you are having is just one of syntax support. Hopefully those links should help.

Reply to this Comment

Great post. I would totally date REMatch() if she were a hot chick.

The only thing that's slowing me down is my noobish brain when it comes to writing regular expressions from scratch. Can anyone recommend a good tutorial site that breaks down regular expression building for beginners on the subject? I've Googled around a bit but I haven't really found a site that breaks it down very well.

Reply to this Comment

I have a one string which contain three '/' I want to search 3rd '/' and then some pattern after that '/'.
please help

Reply to this Comment

I am using the ReMatch you described to find any number in a string of text such as: Final Score: University of Nowhere 14, State College 24

  • <cfset arrScores = ReMatch("[\d]+",str)>

I can cfdump the array or use ArrayToList() easily enough to get 14,24, but what I really want is to highlight or bold whatever is found by the ReMatch within the string itself, so 14 and 24 would stand out to the reader. Is there a simple way I can compare what exists in the ReMatch array to the string and apply a style? Or will I need to loop.

I've done character highlighting before using RePlaceNoCase without looping like:

  • <cfset wtf = 'word to find'>
  • <cfoutput>
  • #replaceNoCase(string,wtf,"<span style='background:yellow; font-weight:bold'>#wtf#</span>","all")#
  • </cfoutput>

I want to somehow find any and all numbers in the string like ReMatch, but be able to use the array values like in the RePlaceNoCase.

Any ideas?

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.