Using The Regular Expression Boundary Match \G To Find The End Of The Previous Match

Posted September 23, 2010 at 10:40 AM by Ben Nadel

Tags: ColdFusion

A couple of years ago, Steve Levithan - Regular Expression ninja and co-author of the Regular Expression Cookbook - helped me with a regular expression pattern for parsing comma-separated values (CSV). In the pattern, he made use of a RegEx boundary character that I had not seen before: \G. This boundary character matches the end of the previous match, preventing the parsing engine from skipping characters between matches. While I mostly understood his explanation, I thought it was time to look into this RegEx construct myself.

At its core, ColdFusion provides access to two different regular expression engines: POSIX and Java. The POSIX regular expression engine is what powers the built-in RegEx functions like reFind(), reMatch(), and reReplace(). The Java regular expression engine is made available to us at the Java layer via createObject(java.util.regex.Pattern).

These two engines have a lot in common and a lot that is unique to each other. The \G boundary match appears to be one of those things that is more or less unique to each engine. While the POSIX engine does have some basic support for \G, I have found it to be buggy and, in some cases, capable of causing a stack overflow error.

To see the \G regular expression boundary match in action, I wanted to use it in a common use case - parsing the missingMethodName argument passed into the onMissingMethod() ColdFusion event handler. I will try this using both the POSIX and the Java regular expression engines:

  • <!--- Mimic a missingMethodName argument. --->
  • <cfset missingMethodName = "getPhoneNumber" />
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
    <!--- Define our regular expression pattern. --->
  • <cfsavecontent variable="patternText">(?x)
  •  
  • ## The \G here tells the pattern matching engine to ensure
  • ## that the current match starts and the end of the last
  • ## match - no characters can be skipped between matches.
  • ##
  • ## The very first match must be at the beginning of the string
  • ## (which is at position zero).
  •  
  • \G
  •  
  • (
  • ## Greater strings starts with "get" or "set".
  •  
  • ^( get | set )
  •  
  • |
  •  
  • ## Greater string ends with word characters.
  •  
  • \w+$
  • )
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <!---
  • First, we're going to try using this pattern with the POSIX
  • regular expression engine - the one that powers core ColdFusion
  • functions like reFind() and reMatch().
  • --->
  •  
  • <!--- Parse the missing method name. --->
  • <cfset nameParts = reMatch( patternText, missingMethodName ) />
  •  
  • <!--- Output matches. --->
  • <cfoutput>
  •  
  • POSIX: [ #arrayToList( nameParts, " ] [ " )# ]
  •  
  • </cfoutput>
  •  
  •  
  • <br />
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <br />
  •  
  •  
  • <!---
  • Next, we're going to try the Java regular expression engine
  • using the Pattern class.
  • --->
  •  
  • <!--- Get a matcher for the compiled pattern. --->
  • <cfset matcher = createObject( "java", "java.util.regex.Pattern" )
  • .compile( javaCast( "string", patternText ) )
  • .matcher( javaCast( "string", missingMethodName ) )
  • />
  •  
  • <cfoutput>
  •  
  • JAVA:
  •  
  • <!--- Find each match in the target string. --->
  • <cfloop condition="matcher.find()">
  •  
  • [ #matcher.group()# ]
  •  
  • </cfloop>
  •  
  • </cfoutput>

In this demo, I am using a Verbose regular expression (as defined by the ?x mode flag). This mode allows us to add white space and comments to the pattern for clarity and readability. As you can see, my \G boundary is the very first part of my pattern. This ensures that each match starts at the end of the previous match (or at the beginning of the string - position zero - for the first match). The meat of the pattern then matches either the get/set action or (|) the component property being accessed (ex. phoneNumber).

When I run this code through the two different regular expression engines, I get the following output:

POSIX: [ get ]
JAVA: [ get ] [ PhoneNumber ]

As you can see, the POSIX engine (reMatch() in this case) didn't quite like the use of \G. The Java regular expression engine, on the other hand, had no problem with it at all.

When I first started experimenting with the \G boundary match, I tried using this pattern:

  • ^(get|set)|\G\w+$

Here, the \G is a possible part of the match, not a definite part of the match. While this worked in the Java regular expression engine, it caused the following ColdFusion error in the POSIX engine:

ROOT CAUSE: java.lang.OutOfMemoryError: Java heap space

While the Java engine was able to handle this particular pattern, from what I have read, it seems to be a best practice to always put the \G boundary match at the beginning of the regular expression pattern (as I have done in my main demo).

Regular expressions are both awesome and powerful. Every time that I learn a little something new in the pattern matching world, it tends to open up more possibilities for problem solving. I'm glad that I finally took the time to look deeper into the \G boundary matcher.




Reader Comments

Sep 24, 2010 at 7:44 AM // reply »
9 Comments

Hi Ben,
Display idea "No one has left a comment is nice", lol


Sep 24, 2010 at 8:42 AM // reply »
11,246 Comments

@CFFan,

Thanks - I thought it was fun :)


Sep 24, 2010 at 11:04 AM // reply »
44 Comments

  • <cfcomponent hint="Ben Class" extends="Superman Class">
  • <cffunction name="init">
  • <cfscript>
  • benchPress( 400 );
  • legPress( 800 );
  • killBearWithBareHands( 10 );
  • eatRawMeatForBreakfast();
  • saveTheDay();
  • </cfscript>
  • </cffunction>
  •  
  • <cffunction name="flirt" output="always" hint="The only hint you need is me flexing my pecks">
  • <cfreturn "Hey baby, wanna wrestle?" />
  • </cffunction>
  •  
  • <cffunction name="flirtEnEspanol" output="¡si si!" hint="Yo Quiero Taco Bell">
  • <cfreturn "Hola señorita. ¿Quieres luchar?" />
  • </cffunction>
  • </cfcomponent>


Sep 27, 2010 at 11:37 AM // reply »
1 Comments

I must admit I still do not understand the expression quite well though reread the post few times.
Regex expressions are nightmare. Regarding CSV file - it would take few lines of Java or C++ code. I will try to read this post with fresh head. Anyway thank you for post.
PS: You have to be proud of yourself: I am C++/Java programmer with more than 10 years of experience :)
Oleg


Feb 13, 2013 at 4:56 PM // reply »
17 Comments

http://xkcd.com/1171/


Feb 13, 2013 at 10:01 PM // reply »
11,246 Comments

@Paul,

Ha ha, what can I say - I love me some RegEx :D


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 24, 2013 at 11:21 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@WebManWalking, Ha ha, let's us never speak of justifying "##" notation again :P ... read »
May 24, 2013 at 11:18 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben, Ah, so it was indeed how I vaguely remembered it to be: A direct assignment value = users.id[ i ] causes value to retain the sticky datatype of the query column. Although unnecessary in ... read »
May 24, 2013 at 9:11 AM
Preventing Links In Standalone iPhone Applications From Opening In Mobile Safari
@Brandon, Hi, No, I haven't been able to do that. I have just kept it as it is. ... read »
May 23, 2013 at 9:52 PM
Preventing Links In Standalone iPhone Applications From Opening In Mobile Safari
@Muhmmadibn Did you figure out a solution to launching PDFs? I am running into the same issues myself. There is no way to close the PDF or go back once you launch it. Thanks in advance! ... read »
May 23, 2013 at 6:06 PM
The Girl Who Broke My Heart, And Made Me A Better Person
Good day,ladies and gentle men, my name is Dr AMADI the great spell caster in Africa, i have help so many people for different kind of problems,who say there is no solution to problems on earth, that ... read »
May 23, 2013 at 4:26 PM
ColdFusion QueryAppend( qOne, qTwo )
@Heather, Glad people are still getting value out of this! ... read »
May 23, 2013 at 3:49 PM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@WebManWalking, I meant the code at the bottom (not the video). I did try to experiment with an intermediary variable, like: value = users.id[ i ]; arrayContains( userIDs, value ); ... but t ... read »
May 23, 2013 at 11:06 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben, Are you talking about As Number: YES As String: YES As Java: YES? If so, that's with 3 different ways of referencing the constant 1, not users.id[1]. Query object references(*) are what seem ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools