Using The Regular Expression Boundary Match \G To Find The End Of The Previous Match

Posted September 23, 2010 at 10:40 AM by Ben Nadel

Tags: ColdFusion

A couple of years ago, Steve Levithan - Regular Expression ninja and co-author of the Regular Expression Cookbook - helped me with a regular expression pattern for parsing comma-separated values (CSV). In the pattern, he made use of a RegEx boundary character that I had not seen before: \G. This boundary character matches the end of the previous match, preventing the parsing engine from skipping characters between matches. While I mostly understood his explanation, I thought it was time to look into this RegEx construct myself.

At its core, ColdFusion provides access to two different regular expression engines: POSIX and Java. The POSIX regular expression engine is what powers the built-in RegEx functions like reFind(), reMatch(), and reReplace(). The Java regular expression engine is made available to us at the Java layer via createObject(java.util.regex.Pattern).

These two engines have a lot in common and a lot that is unique to each other. The \G boundary match appears to be one of those things that is more or less unique to each engine. While the POSIX engine does have some basic support for \G, I have found it to be buggy and, in some cases, capable of causing a stack overflow error.

To see the \G regular expression boundary match in action, I wanted to use it in a common use case - parsing the missingMethodName argument passed into the onMissingMethod() ColdFusion event handler. I will try this using both the POSIX and the Java regular expression engines:

  • <!--- Mimic a missingMethodName argument. --->
  • <cfset missingMethodName = "getPhoneNumber" />
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
    <!--- Define our regular expression pattern. --->
  • <cfsavecontent variable="patternText">(?x)
  •  
  • ## The \G here tells the pattern matching engine to ensure
  • ## that the current match starts and the end of the last
  • ## match - no characters can be skipped between matches.
  • ##
  • ## The very first match must be at the beginning of the string
  • ## (which is at position zero).
  •  
  • \G
  •  
  • (
  • ## Greater strings starts with "get" or "set".
  •  
  • ^( get | set )
  •  
  • |
  •  
  • ## Greater string ends with word characters.
  •  
  • \w+$
  • )
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <!---
  • First, we're going to try using this pattern with the POSIX
  • regular expression engine - the one that powers core ColdFusion
  • functions like reFind() and reMatch().
  • --->
  •  
  • <!--- Parse the missing method name. --->
  • <cfset nameParts = reMatch( patternText, missingMethodName ) />
  •  
  • <!--- Output matches. --->
  • <cfoutput>
  •  
  • POSIX: [ #arrayToList( nameParts, " ] [ " )# ]
  •  
  • </cfoutput>
  •  
  •  
  • <br />
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <br />
  •  
  •  
  • <!---
  • Next, we're going to try the Java regular expression engine
  • using the Pattern class.
  • --->
  •  
  • <!--- Get a matcher for the compiled pattern. --->
  • <cfset matcher = createObject( "java", "java.util.regex.Pattern" )
  • .compile( javaCast( "string", patternText ) )
  • .matcher( javaCast( "string", missingMethodName ) )
  • />
  •  
  • <cfoutput>
  •  
  • JAVA:
  •  
  • <!--- Find each match in the target string. --->
  • <cfloop condition="matcher.find()">
  •  
  • [ #matcher.group()# ]
  •  
  • </cfloop>
  •  
  • </cfoutput>

In this demo, I am using a Verbose regular expression (as defined by the ?x mode flag). This mode allows us to add white space and comments to the pattern for clarity and readability. As you can see, my \G boundary is the very first part of my pattern. This ensures that each match starts at the end of the previous match (or at the beginning of the string - position zero - for the first match). The meat of the pattern then matches either the get/set action or (|) the component property being accessed (ex. phoneNumber).

When I run this code through the two different regular expression engines, I get the following output:

POSIX: [ get ]
JAVA: [ get ] [ PhoneNumber ]

As you can see, the POSIX engine (reMatch() in this case) didn't quite like the use of \G. The Java regular expression engine, on the other hand, had no problem with it at all.

When I first started experimenting with the \G boundary match, I tried using this pattern:

  • ^(get|set)|\G\w+$

Here, the \G is a possible part of the match, not a definite part of the match. While this worked in the Java regular expression engine, it caused the following ColdFusion error in the POSIX engine:

ROOT CAUSE: java.lang.OutOfMemoryError: Java heap space

While the Java engine was able to handle this particular pattern, from what I have read, it seems to be a best practice to always put the \G boundary match at the beginning of the regular expression pattern (as I have done in my main demo).

Regular expressions are both awesome and powerful. Every time that I learn a little something new in the pattern matching world, it tends to open up more possibilities for problem solving. I'm glad that I finally took the time to look deeper into the \G boundary matcher.




Reader Comments

Sep 24, 2010 at 7:44 AM // reply »
9 Comments

Hi Ben,
Display idea "No one has left a comment is nice", lol


Sep 24, 2010 at 8:42 AM // reply »
11,243 Comments

@CFFan,

Thanks - I thought it was fun :)


Sep 24, 2010 at 11:04 AM // reply »
44 Comments

  • <cfcomponent hint="Ben Class" extends="Superman Class">
  • <cffunction name="init">
  • <cfscript>
  • benchPress( 400 );
  • legPress( 800 );
  • killBearWithBareHands( 10 );
  • eatRawMeatForBreakfast();
  • saveTheDay();
  • </cfscript>
  • </cffunction>
  •  
  • <cffunction name="flirt" output="always" hint="The only hint you need is me flexing my pecks">
  • <cfreturn "Hey baby, wanna wrestle?" />
  • </cffunction>
  •  
  • <cffunction name="flirtEnEspanol" output="¡si si!" hint="Yo Quiero Taco Bell">
  • <cfreturn "Hola señorita. ¿Quieres luchar?" />
  • </cffunction>
  • </cfcomponent>


Sep 27, 2010 at 11:37 AM // reply »
1 Comments

I must admit I still do not understand the expression quite well though reread the post few times.
Regex expressions are nightmare. Regarding CSV file - it would take few lines of Java or C++ code. I will try to read this post with fresh head. Anyway thank you for post.
PS: You have to be proud of yourself: I am C++/Java programmer with more than 10 years of experience :)
Oleg


Feb 13, 2013 at 4:56 PM // reply »
17 Comments

http://xkcd.com/1171/


Feb 13, 2013 at 10:01 PM // reply »
11,243 Comments

@Paul,

Ha ha, what can I say - I love me some RegEx :D


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 22, 2013 at 5:35 PM
Script Tags, jQuery, And Html(), Text() And Contents()
This is still an issue 2 years later. jQuery is supposed to remediate these cross browser issues, no? I have been unable to find any statement from the jQuery team calling this behavior "by de ... read »
May 22, 2013 at 12:44 PM
Ask Ben: Query Loop Inside CFScript Tags
In cf10, if you call a function that has: local.result = {}; local.result.msg = ""; local.svc = new query(); local.svc.setSQL("SELECT * FROM..."); local.obj = local.svc.exe ... read »
May 22, 2013 at 12:29 PM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben: What version of Java are you using? Also, did you test users.id to see what Java reports as the data type? I wonder if it's not a Java primitive data type, but getting returned as something ... read »
May 22, 2013 at 11:47 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Dana, Awesome - so it looks like this bug was fixed in ColdFusion 10. Thanks so much for double-checking that. ... read »
May 22, 2013 at 11:37 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
When I c&p and run on cf10, I get: Selected User IDs: 1,4 User 1 selected: YES - YES User 2 selected: NO - NO User 3 selected: NO - NO User 4 selected: YES - YES User 5 selected: NO - ... read »
May 22, 2013 at 11:27 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Tom, Good thought, but no dice. Both of these still exhibit the same behavior: users.id[ users.currentRow ] users[ "id" ][ users.currentRow ] It's just something whacky happening with ... read »
May 22, 2013 at 11:07 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
Could your problem be that "users.id" is actually an ARRAY, not a single value? Perhaps try it again with "users.id[1]" (I only have CF8 here at work). ... read »
May 22, 2013 at 7:52 AM
Nested Views, Routing, And Deep Linking With AngularJS
Hi, Just a quick thank you. As it happens, for my own purposes, the pending ui-router work being done in native angular is likely the one I'll adopt, but your exploration, code and documentation of ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools