Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at CFUNITED 2010 (Landsdown, VA) with:

Verbose Regular Expressions In ColdFusion And Java

Posted by Ben Nadel
Tags: ColdFusion

As with much of the regular expression testing that I do, it doesn't work directly in ColdFusion as used with the REFind() and REReplace() type methods. However, when accessing the underlying Java String objects, we have access to all the Java regular expression power and flexibility.

My most recent testing has been with the Verbose flag. Regular expressions have several flags:

i = Case Insensitive
L = Local dependent
m = Multiline (which I have already touched on)
s = Dot All equivalent
u = Unicode
x = Verbose

... which can be invoked using the (?X) notation where X can be one or more of the flags above. But I digress, the verbose flag allows you to make a more defined regular expression complete with inline comments. As a trade off to the use of inline comments, white space is ignored. Any white space that is intended for use must be escaped.

So, let's build a test target string. The following is a tab-delimited data file:

  • <!---
  • Store the target text with TAB field delimiters.
  • The white space pre and post data will be handled
  • by the regular expression.
  • --->
  • <cfsavecontent variable="strText">
  • 1 Cindy Cho Black Very
  • 2 Libby Smith Brunette Very
  • 3 Julia Niles Blonde Fairly
  • </cfsavecontent>

Please just assume that the spaces between those fields are indeed TABS... I know it doesn't come through to well on the web. Now, before we get into the verbose flag, let's take a look at what this would look like in a standard regular expression:

  • <!--- Reformat the data. --->
  • #strText.Trim().ReplaceAll(
  • <!--- The regular expression. --->
  • "(?im)^[\s]*?([0-9]+)[ ]+([^ ]+)[ ]+([^ ]+)[ ]+([^ ]+)[ ]+([^\s]+)[\s]*?$",
  •  
  • <!--- The target formatting. --->
  • "ID: $1 | FName: $2 | Hair: $4 | Cute: $5<br />"
  • )#

This gives us the output:

ID: 1 | FName: Cindy | Hair: Black | Cute: Very
ID: 2 | FName: Libby | Hair: Brunette | Cute: Very
ID: 3 | FName: Julia | Hair: Blonde | Cute: Fairly

Ok, so now, let's take a look at the verbose expressions. As I explained before, to flag the regular expression as verbose, I have to start it with the flag (?x). This must be the FIRST item in the expression. It cannot have any white space before hand. Let's store the regular expression in a CFSaveContent tag:

  • <!--- Store the regular expression. --->
  • <cfsavecontent variable="strRegEx"
  • >(?ixm)
  • ## This regular expression has been defined as being
  • ## verbose.. That allows us to use white space and
  • ## comments to make it more readable. Notice though,
  • ## that it had to be the VERY FIRST token in the
  • ## regular expression. Also notice the use of the
  • ## double hash sign. This is not required by regular
  • ## expressions, but is required by ColdFusion since I am
  • ## in a CFOutput tag (not in demo here).
  •  
  • ## The flags for this expression are:
  • ## i = Case Insensitive
  • ## x = Verbose
  • ## m = Multiline
  •  
  • ## Match the beginning of the line.
  • ^
  •  
  • ## Leading white space for the line.
  • [\s]*?
  •  
  • ## The first group will be the ID of the girl.
  • ([0-9]+)
  •  
  • ## Because this regular expression is verbose, the
  • ## expression evaluation is ignoring white space in our
  • ## expression. Therefore, we have to escape any
  • ## white space that we want to use, even those that are
  • ## in character sets. In this case, I am escaping
  • ## the TAB character. We do NOT have to do this if we
  • ## used the special tab character (\t). This is ONLY
  • ## for actual white space characters.
  • [\ ]+
  •  
  • ## The second group will be the girl's first name.
  • ([^\ ]+)
  •  
  • ## White space.
  • [\ ]+
  •  
  • ## The third group will be the girl's last name.
  • ([^\ ]+)
  •  
  • ## White space.
  • [\ ]+
  •  
  • ## The fourth group will be the girl's hair color.
  • ([^\ ]+)
  •  
  • ## White space.
  • [\ ]+
  •  
  • ## The fifth group will be the girl's cuteness factor.
  • ## We want to get this one until we hit the end of the
  • ## the line.
  • ([^\s]+)
  •  
  • ## White space at the end of the line.
  • [\s]*?
  •  
  • ## Match the end of the line.
  • $
  • </cfsavecontent>

This gives us the following output.

ID: 1 | FName: Cindy | Hair: Black | Cute: Very
ID: 2 | FName: Libby | Hair: Brunette | Cute: Very
ID: 3 | FName: Julia | Hair: Blonde | Cute: Fairly

As you can see, we have taken the same exact regular expression as the first example and made it about 1000 time longer. However, the regular expression is fully documented and perhaps much easier to understand (although I guess that is going to come down to a personal thing).

Now, a few things to note. All the white space in the expression has been ignored. The beginning line tabs, the line breaks, none of it is used as part of the matching expression. Also notice that I have to escape the tab character "\ " in the verbose expression. If you look in the first example, you will notice that no tab characters have been escaped. Just a trade-off of verbose. And this needs to be done in AND out of character sets (ie. [a-z] type usage).

So that's the demo. Pretty cool, huh? This seems like something that is going to be EXTREMELY useful when it comes to writing out very large and complex regular expressions.




Reader Comments

Adam,

I did not mean to suggest that CF doesn't support Verbose regular expressions. I only mean to suggest that I had not tested it in CF. I tent to do most of my regular expressions directly in the Java string now.

Also, I am not sure what CF is doing with the regular expression, but I don't think that it passes directly onto Java otherwise it would support negative/positive look behinds and it doesn't seem to support those.

Thanks for the link, though, lots of good information there.

Reply to this Comment

Adam,

When I go back and read the paragraph, you are right, it doesn sound like I was saying it doesn't work in CF. Sorry, that is misleading. I was trying to say that since some RE stuff doesn't work in CF, I tend to do Java RE (which is the only place I tested).

Thanks for pointing that out.

Reply to this Comment

From http://www.amk.ca/python/howto/regex/regex.html#SECTION000450000000000000000:

<blockquote>
L
LOCALE
Make \w, \W, \b, and \B, dependent on the current locale.

Locales are a feature of the C library intended to help in writing programs that take account of language differences. For example, if you're processing French text, you'd want to be able to write \w+ to match words, but \w only matches the character class [A-Za-z]; it won't match "é" or "ç". If your system is configured properly and a French locale is selected, certain C functions will tell the program that "é" should also be considered a letter. Setting the LOCALE flag when compiling a regular expression will cause the resulting compiled object to use these C functions for \w; this is slower, but also enables \w+ to match French words as you'd expect.
</blockquote>

Makes sense.

I can't find any reference to it in either CF or Java regexes, though.

--
Adam

Reply to this Comment

Adam,

Great explanation. I did not know that. I simply listed it because I came across a list of flags for regular expressions. I have never actually used it. Most of the time, I don't think about internationalization of my code. That's good to know though.

Reply to this Comment

That must be a C-specific thing, which would explain why I've never heard of it. With many regex libraries, \w will match foreign characters without having to use any L operator. It fact, the exact characters matched by \w varies significantly from library to library (see http://regular-expressions.info/screens/rxbcharclass.png for a screen shot of exactly what it matches in RegexBuddy). In all libraries, it will include [A-Za-z]. In most, the underscore and digits are also included. I tend to avoid \w unless I want to include foreign chatacters without using an all-character operater such as . or [\S\s].

Reply to this Comment

Post A Comment

?
You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.