Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at NCDevCon 2011 (Raleigh, NC) with:

Using The Regular Expression Boundary Match \G To Find The End Of The Previous Match

By Ben Nadel on
Tags: ColdFusion

A couple of years ago, Steve Levithan - Regular Expression ninja and co-author of the Regular Expression Cookbook - helped me with a regular expression pattern for parsing comma-separated values (CSV). In the pattern, he made use of a RegEx boundary character that I had not seen before: \G. This boundary character matches the end of the previous match, preventing the parsing engine from skipping characters between matches. While I mostly understood his explanation, I thought it was time to look into this RegEx construct myself.

At its core, ColdFusion provides access to two different regular expression engines: POSIX and Java. The POSIX regular expression engine is what powers the built-in RegEx functions like reFind(), reMatch(), and reReplace(). The Java regular expression engine is made available to us at the Java layer via createObject(java.util.regex.Pattern).

These two engines have a lot in common and a lot that is unique to each other. The \G boundary match appears to be one of those things that is more or less unique to each engine. While the POSIX engine does have some basic support for \G, I have found it to be buggy and, in some cases, capable of causing a stack overflow error.

To see the \G regular expression boundary match in action, I wanted to use it in a common use case - parsing the missingMethodName argument passed into the onMissingMethod() ColdFusion event handler. I will try this using both the POSIX and the Java regular expression engines:

  • <!--- Mimic a missingMethodName argument. --->
  • <cfset missingMethodName = "getPhoneNumber" />
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
    <!--- Define our regular expression pattern. --->
  • <cfsavecontent variable="patternText">(?x)
  •  
  • ## The \G here tells the pattern matching engine to ensure
  • ## that the current match starts and the end of the last
  • ## match - no characters can be skipped between matches.
  • ##
  • ## The very first match must be at the beginning of the string
  • ## (which is at position zero).
  •  
  • \G
  •  
  • (
  • ## Greater strings starts with "get" or "set".
  •  
  • ^( get | set )
  •  
  • |
  •  
  • ## Greater string ends with word characters.
  •  
  • \w+$
  • )
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <!---
  • First, we're going to try using this pattern with the POSIX
  • regular expression engine - the one that powers core ColdFusion
  • functions like reFind() and reMatch().
  • --->
  •  
  • <!--- Parse the missing method name. --->
  • <cfset nameParts = reMatch( patternText, missingMethodName ) />
  •  
  • <!--- Output matches. --->
  • <cfoutput>
  •  
  • POSIX: [ #arrayToList( nameParts, " ] [ " )# ]
  •  
  • </cfoutput>
  •  
  •  
  • <br />
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <br />
  •  
  •  
  • <!---
  • Next, we're going to try the Java regular expression engine
  • using the Pattern class.
  • --->
  •  
  • <!--- Get a matcher for the compiled pattern. --->
  • <cfset matcher = createObject( "java", "java.util.regex.Pattern" )
  • .compile( javaCast( "string", patternText ) )
  • .matcher( javaCast( "string", missingMethodName ) )
  • />
  •  
  • <cfoutput>
  •  
  • JAVA:
  •  
  • <!--- Find each match in the target string. --->
  • <cfloop condition="matcher.find()">
  •  
  • [ #matcher.group()# ]
  •  
  • </cfloop>
  •  
  • </cfoutput>

In this demo, I am using a Verbose regular expression (as defined by the ?x mode flag). This mode allows us to add white space and comments to the pattern for clarity and readability. As you can see, my \G boundary is the very first part of my pattern. This ensures that each match starts at the end of the previous match (or at the beginning of the string - position zero - for the first match). The meat of the pattern then matches either the get/set action or (|) the component property being accessed (ex. phoneNumber).

When I run this code through the two different regular expression engines, I get the following output:

POSIX: [ get ]
JAVA: [ get ] [ PhoneNumber ]

As you can see, the POSIX engine (reMatch() in this case) didn't quite like the use of \G. The Java regular expression engine, on the other hand, had no problem with it at all.

When I first started experimenting with the \G boundary match, I tried using this pattern:

  • ^(get|set)|\G\w+$

Here, the \G is a possible part of the match, not a definite part of the match. While this worked in the Java regular expression engine, it caused the following ColdFusion error in the POSIX engine:

ROOT CAUSE: java.lang.OutOfMemoryError: Java heap space

While the Java engine was able to handle this particular pattern, from what I have read, it seems to be a best practice to always put the \G boundary match at the beginning of the regular expression pattern (as I have done in my main demo).

Regular expressions are both awesome and powerful. Every time that I learn a little something new in the pattern matching world, it tends to open up more possibilities for problem solving. I'm glad that I finally took the time to look deeper into the \G boundary matcher.




Reader Comments

  • <cfcomponent hint="Ben Class" extends="Superman Class">
  • <cffunction name="init">
  • <cfscript>
  • benchPress( 400 );
  • legPress( 800 );
  • killBearWithBareHands( 10 );
  • eatRawMeatForBreakfast();
  • saveTheDay();
  • </cfscript>
  • </cffunction>
  •  
  • <cffunction name="flirt" output="always" hint="The only hint you need is me flexing my pecks">
  • <cfreturn "Hey baby, wanna wrestle?" />
  • </cffunction>
  •  
  • <cffunction name="flirtEnEspanol" output="¡si si!" hint="Yo Quiero Taco Bell">
  • <cfreturn "Hola señorita. ¿Quieres luchar?" />
  • </cffunction>
  • </cfcomponent>

Reply to this Comment

I must admit I still do not understand the expression quite well though reread the post few times.
Regex expressions are nightmare. Regarding CSV file - it would take few lines of Java or C++ code. I will try to read this post with fresh head. Anyway thank you for post.
PS: You have to be proud of yourself: I am C++/Java programmer with more than 10 years of experience :)
Oleg

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.