Using The Regular Expression Boundary Match \G To Find The End Of The Previous Match

Posted September 23, 2010 at 10:40 AM by Ben Nadel

Tags: ColdFusion

A couple of years ago, Steve Levithan - Regular Expression ninja and co-author of the Regular Expression Cookbook - helped me with a regular expression pattern for parsing comma-separated values (CSV). In the pattern, he made use of a RegEx boundary character that I had not seen before: \G. This boundary character matches the end of the previous match, preventing the parsing engine from skipping characters between matches. While I mostly understood his explanation, I thought it was time to look into this RegEx construct myself.

At its core, ColdFusion provides access to two different regular expression engines: POSIX and Java. The POSIX regular expression engine is what powers the built-in RegEx functions like reFind(), reMatch(), and reReplace(). The Java regular expression engine is made available to us at the Java layer via createObject(java.util.regex.Pattern).

These two engines have a lot in common and a lot that is unique to each other. The \G boundary match appears to be one of those things that is more or less unique to each engine. While the POSIX engine does have some basic support for \G, I have found it to be buggy and, in some cases, capable of causing a stack overflow error.

To see the \G regular expression boundary match in action, I wanted to use it in a common use case - parsing the missingMethodName argument passed into the onMissingMethod() ColdFusion event handler. I will try this using both the POSIX and the Java regular expression engines:

  • <!--- Mimic a missingMethodName argument. --->
  • <cfset missingMethodName = "getPhoneNumber" />
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
    <!--- Define our regular expression pattern. --->
  • <cfsavecontent variable="patternText">(?x)
  •  
  • ## The \G here tells the pattern matching engine to ensure
  • ## that the current match starts and the end of the last
  • ## match - no characters can be skipped between matches.
  • ##
  • ## The very first match must be at the beginning of the string
  • ## (which is at position zero).
  •  
  • \G
  •  
  • (
  • ## Greater strings starts with "get" or "set".
  •  
  • ^( get | set )
  •  
  • |
  •  
  • ## Greater string ends with word characters.
  •  
  • \w+$
  • )
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <!---
  • First, we're going to try using this pattern with the POSIX
  • regular expression engine - the one that powers core ColdFusion
  • functions like reFind() and reMatch().
  • --->
  •  
  • <!--- Parse the missing method name. --->
  • <cfset nameParts = reMatch( patternText, missingMethodName ) />
  •  
  • <!--- Output matches. --->
  • <cfoutput>
  •  
  • POSIX: [ #arrayToList( nameParts, " ] [ " )# ]
  •  
  • </cfoutput>
  •  
  •  
  • <br />
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <br />
  •  
  •  
  • <!---
  • Next, we're going to try the Java regular expression engine
  • using the Pattern class.
  • --->
  •  
  • <!--- Get a matcher for the compiled pattern. --->
  • <cfset matcher = createObject( "java", "java.util.regex.Pattern" )
  • .compile( javaCast( "string", patternText ) )
  • .matcher( javaCast( "string", missingMethodName ) )
  • />
  •  
  • <cfoutput>
  •  
  • JAVA:
  •  
  • <!--- Find each match in the target string. --->
  • <cfloop condition="matcher.find()">
  •  
  • [ #matcher.group()# ]
  •  
  • </cfloop>
  •  
  • </cfoutput>

In this demo, I am using a Verbose regular expression (as defined by the ?x mode flag). This mode allows us to add white space and comments to the pattern for clarity and readability. As you can see, my \G boundary is the very first part of my pattern. This ensures that each match starts at the end of the previous match (or at the beginning of the string - position zero - for the first match). The meat of the pattern then matches either the get/set action or (|) the component property being accessed (ex. phoneNumber).

When I run this code through the two different regular expression engines, I get the following output:

POSIX: [ get ]
JAVA: [ get ] [ PhoneNumber ]

As you can see, the POSIX engine (reMatch() in this case) didn't quite like the use of \G. The Java regular expression engine, on the other hand, had no problem with it at all.

When I first started experimenting with the \G boundary match, I tried using this pattern:

  • ^(get|set)|\G\w+$

Here, the \G is a possible part of the match, not a definite part of the match. While this worked in the Java regular expression engine, it caused the following ColdFusion error in the POSIX engine:

ROOT CAUSE: java.lang.OutOfMemoryError: Java heap space

While the Java engine was able to handle this particular pattern, from what I have read, it seems to be a best practice to always put the \G boundary match at the beginning of the regular expression pattern (as I have done in my main demo).

Regular expressions are both awesome and powerful. Every time that I learn a little something new in the pattern matching world, it tends to open up more possibilities for problem solving. I'm glad that I finally took the time to look deeper into the \G boundary matcher.




Reader Comments

Sep 24, 2010 at 7:44 AM // reply »
9 Comments

Hi Ben,
Display idea "No one has left a comment is nice", lol


Sep 24, 2010 at 8:42 AM // reply »
10,743 Comments

@CFFan,

Thanks - I thought it was fun :)


Sep 24, 2010 at 11:04 AM // reply »
43 Comments

  • <cfcomponent hint="Ben Class" extends="Superman Class">
  • <cffunction name="init">
  • <cfscript>
  • benchPress( 400 );
  • legPress( 800 );
  • killBearWithBareHands( 10 );
  • eatRawMeatForBreakfast();
  • saveTheDay();
  • </cfscript>
  • </cffunction>
  •  
  • <cffunction name="flirt" output="always" hint="The only hint you need is me flexing my pecks">
  • <cfreturn "Hey baby, wanna wrestle?" />
  • </cffunction>
  •  
  • <cffunction name="flirtEnEspanol" output="¡si si!" hint="Yo Quiero Taco Bell">
  • <cfreturn "Hola señorita. ¿Quieres luchar?" />
  • </cffunction>
  • </cfcomponent>


Sep 27, 2010 at 11:37 AM // reply »
1 Comments

I must admit I still do not understand the expression quite well though reread the post few times.
Regex expressions are nightmare. Regarding CSV file - it would take few lines of Java or C++ code. I will try to read this post with fresh head. Anyway thank you for post.
PS: You have to be proud of yourself: I am C++/Java programmer with more than 10 years of experience :)
Oleg


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
InVision App - Prototyping Made Beautiful With Prototyping Tools Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 16, 2012 at 8:18 PM
Best Of ColdFusion 10 Contest Entry - HTML Email Utility
Just found this, looks good! I'm trying to run it on local, it's the 64bit version and I'm experiencing horrible lag. On average the generate.cfm processes the content change in 60-90 seconds. I've ... read »
May 16, 2012 at 6:40 PM
Maintaining Sessions Across Multiple ColdFusion CFHttp Requests
I am trying to integrate this CFHTTPsession into an application that will log into zeekrewards.com to post ads and I am not having any luck. The code works perfectly for logging into other websites, ... read »
May 16, 2012 at 2:44 PM
Creating A Sometimes-Fixed-Position Element With jQuery
Thank you, very useful technique! Worked like a charm. ... read »
May 16, 2012 at 1:58 PM
Movies As A Religious Experience
Acting can, in a way, ruin the movie-goer's experience. I used to be able to get so caught up in movies and their plots, and totally engaged. But lately, I haven't been able to as much with a lot o ... read »
May 16, 2012 at 1:52 PM
The Science Of Optimal Post-Exercise Nutrition
children of this age eat very less vegetables so u can opt for salads they will like it also carrot ,cucumber,onion and as far as pulses are concerned u can boil them ,give him along with mashed rice ... read »
May 16, 2012 at 1:34 PM
Strange ColdFusion JRUN Stack Overflow Error
Hey, Recently I updated my jrun4 using the latest updater 7 and now i am having memory issues :(:(:( any help is appreciated ... read »
May 16, 2012 at 9:56 AM
ColdFusion 10 Beta, Apache Tomcat, And Symbolic Links On Mac OSX
Hi, Now that ColdFusion 10 is out I have stumbled over this as well and I cannot figure out the proper solution. We're running virtual hosts via Apache2; the ColdFusion-applications store their fil ... read »
May 15, 2012 at 6:03 PM
Movies As A Religious Experience
@Ben, I don't know whether you'd consider this a religious observation, but it seems to me, in a sense, movies multiply how many lives we get to have. Each movie is like a little extra life we get ... read »