Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at the New York ColdFusion User Group (May. 2008) with: Michael Smith

Turning Modes On And Off Within A Regular Expression

By Ben Nadel on
Tags: ColdFusion

Most regular expression engines have at least some support for flags and modes within their patterns. Flags such a (x) Verbose, (i) Case-Insensitive, and (m) Multiline change the mode of the regular expression matching which, of course, changes the type of text that can be matched. I've used these flags plenty of times in the past; however, with my Scotch on the Rocks (SOTR) presentation approaching, I figured it was time to really see just how these flags can be used within a regular expression.

Up until now, I've only ever used these flags at the beginning of a regular expression in order to turn on a given mode for the entire pattern. For example, as you saw in my ColdFusion custom tag blog post yesterday, I was prepending every pattern with the (x) flag:

  • <cfset pattern = ("(?x)" & pattern) />

... in order to turn on Verbose mode for the entire pattern matching operation.

As it turns out, however, the modes determined by these flags don't just have to be toggled once. These flags can be used throughout the regular expression pattern in order to turn on and off the associated modes at will. So, for example, if you wanted to turn on the case-insensitive mode half-way through the pattern, you would just include the (?i) flag in the middle:

abc(?i)xyz

In this case, the literal "abc" would be matched by case; however, the latter half of the pattern, "xyz," would be matched without case. Each flag turns on the corresponding mode for the part of the pattern that follows it.

While everything I've looked at so far involved turning on modes, flags can also be used to turn off modes within a regular expression. To turn a mode off, simple precede the flag (or set of flags) with a minus sign:

(?-i)
(?i-m)
(?-ixm)
(?mx-i)

The above flags simply demonstrate a variety of ways in which the minus sign can be integrated within the flag construct. All the flags to the left of the minus sign are used to turn modes on; all flags to the right of the minus sign are used to turn modes off. So, for example, the pattern:

(?i-xm)

... is turning on case-insensitive mode (i), but turning off verbose mode (x) and multi-line mode (m).

With these on/off toggles, we can now apply a given set of modes to a portion of a regular expression pattern. There is, however, an even more terse way to apply a set of modes to a single portion of a regular expression pattern: a non-capturing group.

In the past, I've only ever used a non-capturing group to define a group that doesn't get tracked as back-reference:

(?:non captured group)

As it turns out, pattern flags can be applied in this context; and, when applied, they are only turned on or off for the duration of the non-capturing group:

(?i:non captured group)

In this example, we are turning on the case-insensitive mode (i), but only for the duration of the non-capturing group.

This is pretty awesome stuff!

Well, sort of. This level of support doesn't actually exist in all regular expression engines. In fact, in my testing, I discovered that this level of support doesn't even exist at the ColdFusion level. Just as with Javascript, all flags used within a regular expression pattern get applied to the entire pattern, not just to the portion of the pattern that follows it. As such, you can't turn on a mode for only part of a pattern.

And, any attempt to turn off a mode within ColdFusion will throw an error like this:

Sequence (?-...) not recognized null

... and, any attempt to use a non-capturing group to apply a mode to a specific portion of a regular expression pattern will throw an error like this:

Sequence (?:...) not recognized null

The regular expression engine at the ColdFusion level is good for basic stuff; but, unfortunately, it sucks for almost everything else (including performance). Luckily, however, none of the constraints at the ColdFusion level apply to the regular expression engine at the Java level. So, let's dip down and get our hands dirty with the Java Pattern Matcher object.

  • <cffunction
  • name="jreMatch"
  • access="public"
  • returntype="array"
  • output="false"
  • hint="I gather all instances of the given pattern within the given string.">
  •  
  • <!--- Define arguments. --->
  • <cfargument
  • name="pattern"
  • type="string"
  • required="true"
  • hint="I am the regular expression pattern being matched within the input string."
  • />
  •  
  • <cfargument
  • name="input"
  • type="string"
  • required="true"
  • hint="I am the input in which the patterns are being matched."
  • />
  •  
  • <!--- Define the local scope. --->
  • <cfset var local = {} />
  •  
  • <!--- Get the matcher for the given regular expression. --->
  • <cfset local.matcher =
  • createObject( "java", "java.util.regex.Pattern" )
  • .compile( javaCast( "string", arguments.pattern ) )
  • .matcher( javaCast( "string", arguments.input ) )
  • />
  •  
  • <!--- Create an array to hold the matches. --->
  • <cfset local.matches = [] />
  •  
  • <!---
  • Keep searching the input string while matches of our regular
  • expression pattern can be found.
  • --->
  • <cfloop condition="local.matcher.find()">
  •  
  • <!--- Add the current match to the collection. --->
  • <cfset arrayAppend(
  • local.matches,
  • local.matcher.group()
  • ) />
  •  
  • </cfloop>
  •  
  • <!--- Return the aggregated matches. --->
  • <cfreturn local.matches />
  • </cffunction>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <!--- Define our input. --->
  • <cfset input = "ABCxyz" />
  •  
  • <cfoutput>
  •  
  • INPUT == #input#<br />
  • <br />
  •  
  • abc:
  • #arrayToList( jreMatch( "abc", input ) )#
  • <br />
  •  
  • (?i)abc ==
  • #arrayToList( jreMatch( "(?i)abc", input ) )#
  • <br />
  •  
  • (?i)ab(?-i)c ==
  • #arrayToList( jreMatch( "(?i)ab(?-i)c", input ) )#
  • <br />
  •  
  • (?i:ab)c ==
  • #arrayToList( jreMatch( "(?i:ab)c", input ) )#
  • <br />
  •  
  • (?i:abc) ==
  • #arrayToList( jreMatch( "(?i:abc)", input ) )#
  • <br />
  •  
  • (?i)(abc)(XYZ) ==
  • #arrayToList( jreMatch( "(?i)(abc)(XYZ)", input ) )#
  • <br />
  •  
  • (?i)(ABC)(?-i)(XYZ) ==
  • #arrayToList( jreMatch( "(?i)(ABC)(?-i)(XYZ)", input ) )#
  • <br />
  •  
  • (?i)(abc)(?-i:xyz) ==
  • #arrayToList( jreMatch( "(?i)(abc)(?-i:xyz)", input ) )#
  • <br />
  •  
  • abc(?i)XYZ ==
  • #arrayToList( jreMatch( "abc(?i)XYZ", input ) )#
  • <br />
  •  
  • ABC(?i)XYZ ==
  • #arrayToList( jreMatch( "ABC(?i)XYZ", input ) )#
  • <br />
  •  
  • </cfoutput>

Since there is no reMatch()-style method in Java, we start out by creating a ColdFusion UDF that uses the Java Matcher object in order to compile a collection of pattern matches. Then, we go about trying various flags to turn on and off case-insensitive mode within various regular expressions. When we run the above code, we get the following output:

INPUT == ABCxyz

abc:
(?i)abc == ABC
(?i)ab(?-i)c ==
(?i:ab)c ==
(?i:abc) == ABC
(?i)(abc)(XYZ) == ABCxyz
(?i)(ABC)(?-i)(XYZ) ==
(?i)(abc)(?-i:xyz) == ABCxyz
abc(?i)XYZ ==
ABC(?i)XYZ == ABCxyz

So there you go; at the Java level, you can use flags to both turn on and off a given mode (or set of modes) for very specific durations of a regular expression pattern. I'll tell you, though, if we ever get the ability to overwrite native methods in ColdFusion, the first thing I'm gonna do is overwrite all the Regular-Expression functions to use the Java pattern matching engine; the more I learn about regular expressions, the more I am I'm finding the ColdFusion regular expression engine to be quite limited.



Looking For A New Job?

100% of job board revenue is donated to Kiva. Loans that change livesFind out more »

Reader Comments

Its really really awsome stuff. You have exposed some hidden jewels in regular expressions. Thanks a lot.

Reply to this Comment

Great stuff Ben ...

One thing though ... I thought the mode modifier (?x) was to enable comment and whitespace support ...

Methods of the Java Pattern Class ...

"Permits whitespace and comments in the pattern. In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line. Comments mode can also be enabled via the embedded flag expression (?x)."

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.