Turning Modes On And Off Within A Regular Expression

Posted January 13, 2011 at 10:50 AM by Ben Nadel

Tags: ColdFusion

Most regular expression engines have at least some support for flags and modes within their patterns. Flags such a (x) Verbose, (i) Case-Insensitive, and (m) Multiline change the mode of the regular expression matching which, of course, changes the type of text that can be matched. I've used these flags plenty of times in the past; however, with my Scotch on the Rocks (SOTR) presentation approaching, I figured it was time to really see just how these flags can be used within a regular expression.

Up until now, I've only ever used these flags at the beginning of a regular expression in order to turn on a given mode for the entire pattern. For example, as you saw in my ColdFusion custom tag blog post yesterday, I was prepending every pattern with the (x) flag:

  • <cfset pattern = ("(?x)" & pattern) />

... in order to turn on Verbose mode for the entire pattern matching operation.

As it turns out, however, the modes determined by these flags don't just have to be toggled once. These flags can be used throughout the regular expression pattern in order to turn on and off the associated modes at will. So, for example, if you wanted to turn on the case-insensitive mode half-way through the pattern, you would just include the (?i) flag in the middle:

abc(?i)xyz

In this case, the literal "abc" would be matched by case; however, the latter half of the pattern, "xyz," would be matched without case. Each flag turns on the corresponding mode for the part of the pattern that follows it.

While everything I've looked at so far involved turning on modes, flags can also be used to turn off modes within a regular expression. To turn a mode off, simple precede the flag (or set of flags) with a minus sign:

(?-i)
(?i-m)
(?-ixm)
(?mx-i)

The above flags simply demonstrate a variety of ways in which the minus sign can be integrated within the flag construct. All the flags to the left of the minus sign are used to turn modes on; all flags to the right of the minus sign are used to turn modes off. So, for example, the pattern:

(?i-xm)

... is turning on case-insensitive mode (i), but turning off verbose mode (x) and multi-line mode (m).

With these on/off toggles, we can now apply a given set of modes to a portion of a regular expression pattern. There is, however, an even more terse way to apply a set of modes to a single portion of a regular expression pattern: a non-capturing group.

In the past, I've only ever used a non-capturing group to define a group that doesn't get tracked as back-reference:

(?:non captured group)

As it turns out, pattern flags can be applied in this context; and, when applied, they are only turned on or off for the duration of the non-capturing group:

(?i:non captured group)

In this example, we are turning on the case-insensitive mode (i), but only for the duration of the non-capturing group.

This is pretty awesome stuff!

Well, sort of. This level of support doesn't actually exist in all regular expression engines. In fact, in my testing, I discovered that this level of support doesn't even exist at the ColdFusion level. Just as with Javascript, all flags used within a regular expression pattern get applied to the entire pattern, not just to the portion of the pattern that follows it. As such, you can't turn on a mode for only part of a pattern.

And, any attempt to turn off a mode within ColdFusion will throw an error like this:

Sequence (?-...) not recognized null

... and, any attempt to use a non-capturing group to apply a mode to a specific portion of a regular expression pattern will throw an error like this:

Sequence (?:...) not recognized null

The regular expression engine at the ColdFusion level is good for basic stuff; but, unfortunately, it sucks for almost everything else (including performance). Luckily, however, none of the constraints at the ColdFusion level apply to the regular expression engine at the Java level. So, let's dip down and get our hands dirty with the Java Pattern Matcher object.

  • <cffunction
  • name="jreMatch"
  • access="public"
  • returntype="array"
  • output="false"
  • hint="I gather all instances of the given pattern within the given string.">
  •  
  • <!--- Define arguments. --->
  • <cfargument
  • name="pattern"
  • type="string"
  • required="true"
  • hint="I am the regular expression pattern being matched within the input string."
  • />
  •  
  • <cfargument
  • name="input"
  • type="string"
  • required="true"
  • hint="I am the input in which the patterns are being matched."
  • />
  •  
  • <!--- Define the local scope. --->
  • <cfset var local = {} />
  •  
  • <!--- Get the matcher for the given regular expression. --->
  • <cfset local.matcher =
  • createObject( "java", "java.util.regex.Pattern" )
  • .compile( javaCast( "string", arguments.pattern ) )
  • .matcher( javaCast( "string", arguments.input ) )
  • />
  •  
  • <!--- Create an array to hold the matches. --->
  • <cfset local.matches = [] />
  •  
  • <!---
  • Keep searching the input string while matches of our regular
  • expression pattern can be found.
  • --->
  • <cfloop condition="local.matcher.find()">
  •  
  • <!--- Add the current match to the collection. --->
  • <cfset arrayAppend(
  • local.matches,
  • local.matcher.group()
  • ) />
  •  
  • </cfloop>
  •  
  • <!--- Return the aggregated matches. --->
  • <cfreturn local.matches />
  • </cffunction>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <!--- Define our input. --->
  • <cfset input = "ABCxyz" />
  •  
  • <cfoutput>
  •  
  • INPUT == #input#<br />
  • <br />
  •  
  • abc:
  • #arrayToList( jreMatch( "abc", input ) )#
  • <br />
  •  
  • (?i)abc ==
  • #arrayToList( jreMatch( "(?i)abc", input ) )#
  • <br />
  •  
  • (?i)ab(?-i)c ==
  • #arrayToList( jreMatch( "(?i)ab(?-i)c", input ) )#
  • <br />
  •  
  • (?i:ab)c ==
  • #arrayToList( jreMatch( "(?i:ab)c", input ) )#
  • <br />
  •  
  • (?i:abc) ==
  • #arrayToList( jreMatch( "(?i:abc)", input ) )#
  • <br />
  •  
  • (?i)(abc)(XYZ) ==
  • #arrayToList( jreMatch( "(?i)(abc)(XYZ)", input ) )#
  • <br />
  •  
  • (?i)(ABC)(?-i)(XYZ) ==
  • #arrayToList( jreMatch( "(?i)(ABC)(?-i)(XYZ)", input ) )#
  • <br />
  •  
  • (?i)(abc)(?-i:xyz) ==
  • #arrayToList( jreMatch( "(?i)(abc)(?-i:xyz)", input ) )#
  • <br />
  •  
  • abc(?i)XYZ ==
  • #arrayToList( jreMatch( "abc(?i)XYZ", input ) )#
  • <br />
  •  
  • ABC(?i)XYZ ==
  • #arrayToList( jreMatch( "ABC(?i)XYZ", input ) )#
  • <br />
  •  
  • </cfoutput>

Since there is no reMatch()-style method in Java, we start out by creating a ColdFusion UDF that uses the Java Matcher object in order to compile a collection of pattern matches. Then, we go about trying various flags to turn on and off case-insensitive mode within various regular expressions. When we run the above code, we get the following output:

INPUT == ABCxyz

abc:
(?i)abc == ABC
(?i)ab(?-i)c ==
(?i:ab)c ==
(?i:abc) == ABC
(?i)(abc)(XYZ) == ABCxyz
(?i)(ABC)(?-i)(XYZ) ==
(?i)(abc)(?-i:xyz) == ABCxyz
abc(?i)XYZ ==
ABC(?i)XYZ == ABCxyz

So there you go; at the Java level, you can use flags to both turn on and off a given mode (or set of modes) for very specific durations of a regular expression pattern. I'll tell you, though, if we ever get the ability to overwrite native methods in ColdFusion, the first thing I'm gonna do is overwrite all the Regular-Expression functions to use the Java pattern matching engine; the more I learn about regular expressions, the more I am I'm finding the ColdFusion regular expression engine to be quite limited.




Reader Comments

Apr 18, 2011 at 9:35 AM // reply »
1 Comments

Its really really awsome stuff. You have exposed some hidden jewels in regular expressions. Thanks a lot.


Apr 18, 2011 at 10:59 AM // reply »
11,243 Comments

@Rakesh,

Glad you liked this - regular expressions are so awesome :)


Jul 1, 2012 at 12:13 AM // reply »
63 Comments

Great stuff Ben ...

One thing though ... I thought the mode modifier (?x) was to enable comment and whitespace support ...

Methods of the Java Pattern Class ...

"Permits whitespace and comments in the pattern. In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line. Comments mode can also be enabled via the embedded flag expression (?x)."


Jul 1, 2012 at 8:44 AM // reply »
63 Comments

Uh ... yeah, maybe I should connect the dots ... {facepalm}


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 22, 2013 at 5:35 PM
Script Tags, jQuery, And Html(), Text() And Contents()
This is still an issue 2 years later. jQuery is supposed to remediate these cross browser issues, no? I have been unable to find any statement from the jQuery team calling this behavior "by de ... read »
May 22, 2013 at 12:44 PM
Ask Ben: Query Loop Inside CFScript Tags
In cf10, if you call a function that has: local.result = {}; local.result.msg = ""; local.svc = new query(); local.svc.setSQL("SELECT * FROM..."); local.obj = local.svc.exe ... read »
May 22, 2013 at 12:29 PM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben: What version of Java are you using? Also, did you test users.id to see what Java reports as the data type? I wonder if it's not a Java primitive data type, but getting returned as something ... read »
May 22, 2013 at 11:47 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Dana, Awesome - so it looks like this bug was fixed in ColdFusion 10. Thanks so much for double-checking that. ... read »
May 22, 2013 at 11:37 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
When I c&p and run on cf10, I get: Selected User IDs: 1,4 User 1 selected: YES - YES User 2 selected: NO - NO User 3 selected: NO - NO User 4 selected: YES - YES User 5 selected: NO - ... read »
May 22, 2013 at 11:27 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Tom, Good thought, but no dice. Both of these still exhibit the same behavior: users.id[ users.currentRow ] users[ "id" ][ users.currentRow ] It's just something whacky happening with ... read »
May 22, 2013 at 11:07 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
Could your problem be that "users.id" is actually an ARRAY, not a single value? Perhaps try it again with "users.id[1]" (I only have CF8 here at work). ... read »
May 22, 2013 at 7:52 AM
Nested Views, Routing, And Deep Linking With AngularJS
Hi, Just a quick thank you. As it happens, for my own purposes, the pending ui-router work being done in native angular is likely the one I'll adopt, but your exploration, code and documentation of ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools