Dot-Character Matches In ColdFusion And Java Regular Expressions

Posted November 26, 2008 at 10:22 AM by Ben Nadel

Tags: ColdFusion

The other day on Twitter, Mark Mandel posted a tweet about regular expressions and multi-line mode. He was trying to match patterns on a per-line basis. After he posted this, I went and did a little experimentation with multi-line mode and I discovered a really interesting (read: frustrating) discrepancy between ColdFusion's regular expression engine and Java's regular expression engine which is absolutely crucial to the effectiveness of multi-line mode pattern matching. What I found was that the dot-character matches a different set of characters.

Traditionally, the dot (.) character matches all characters except the new line and line return characters. In essence, the dot pattern is the logical equivalent of the negated characters class:

[^\n\r]

The only exception to this is single-line mode. In single-line mode, the dot also matches these new line characters. According to www.regular-expressions.info, this is a mode that has to be explicitly turned on in all modern regular expression engines.

Ok, this seems like a really powerful pattern, so what's the problem? The problem is that it seems as though ColdFusion's regular expression engine operates in this "single-line" mode by default, whereas the underlying Java regular expression engine does not. Let's examine this more closely. First, I am gonna create a string that has like-patterns on each line:

  • <!--- Store text to parse using regular expressions. --->
  • <cfsavecontent variable="strText">
  •  
  • Betty: Cutie
  • Kit: Kinky
  • Sarah: Stubby
  • Julie: Happy
  •  
  • </cfsavecontent>

Now, I am gonna match each pattern per line using the multi-line flag:

(?m)^\s+\w+:.+$

Here, the left half of the pattern matches the name:

\s+\w+

... and the right half of the pattern selects the adjective by matching everything until the end of the line:

.+

First we are gonna do this with ColdFusion, then we are going to use this with Java's pattern matcher:

  • <!---
  • Use ColdFusion regular expressions in multi-line mode
  • to grab the matches.
  • --->
  • <cfset arrMatches = REMatch(
  • "(?m)^\s+\w+:.+$",
  • strText
  • ) />
  •  
  • <!--- Dump out the ColdFusion REMatch() captures. --->
  • <cfdump
  • var="#arrMatches#"
  • label="REMatch() Multi-line Matches"
  • />

When we run this code, we get the following output:

 
 
 
 
 
 
REMatch() Multi-Line Matching Using Dot-Character Class. 
 
 
 

As you can see, the dot-character in ColdFusion matched EVERY character including the new line characters. And, since regular expression are, by default, greedy, it kept matching until it hit the end of the entire string (not just the given line).

OK, now, let's take that same pattern and do the matching using Java's Pattern Matcher object:

  • <!--- Create a pattern to match our per-line patterns. --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • JavaCast( "string", "(?m)^\s+\w+:.+$" )
  • )
  • />
  •  
  • <!--- Get a pattern matcher for our target text. --->
  • <cfset objMatcher = objPattern.Matcher(
  • JavaCast( "string", strText )
  • ) />
  •  
  • <!--- Create an array to store our matches. --->
  • <cfset arrMatches = [] />
  •  
  • <!--- Put all matches into our array. --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!--- Add match to array. --->
  • <cfset ArrayAppend( arrMatches, objMatcher.Group() ) />
  •  
  • </cfloop>
  •  
  • <!--- Dump out Java Pattern captures. --->
  • <cfdump
  • var="#arrMatches#"
  • label="Patter-Matcher Multi-line Matches"
  • />

Granted, this code is much more involved (and faster, and powerful), but when we run it, we get this output:

 
 
 
 
 
 
Java Pattern Matcher Using Dot-Character Class. 
 
 
 

As you can see here, the dot-character did not match the new line characters and the matching successfully remained on any given line.

You can get the underlying Java regular expression engine to operate in single-line mode by explicitly flagging it to do so using (?s):

(?ms)^\s+\w+:.+$

Doing this will result in the same output as that produced by ColdFusion's REMatch() method.

There's something about this that I really don't like. I can accept the fact that the Java regular expression engine is more powerful and can handle things like negative look behinds. But, I have to say that I am quite unsettled by the fact that there is this fundamental difference in behavior between the two engines. Something about this feels very wrong! I really hope that with ColdFusion 9 or future releases that Adobe finally gets rid of this silly POSIX-compliant regular expression engine and embraces the awesome power of the Java Pattern Matcher.




Reader Comments

Nov 26, 2008 at 11:42 AM // reply »
26 Comments

Maybe it's just me, but the more I use regular expressions in web applications, the more I realize that it would just be easier to use Java rather than CF. The last three or four times I started out using REFind, I ended up using a Java Pattern Matcher.


Nov 26, 2008 at 12:06 PM // reply »
11,238 Comments

@Matt,

I agree 100%!! I think part of the reason I didn't even realize this deficiency existed is because I am so used to just going to the pattern matcher by default.


Nov 26, 2008 at 3:28 PM // reply »
1 Comments

Under the hood CF doesn't use java.util.regex, but the ORO library:
http://jakarta.apache.org/oro/

Back when CF 6.0 was released, it had to run on Java 1.3, and java.util.regex was added in 1.4.

Being a newer implementation, java.util.regex offers some additional features, like inline modifier and lookbehind. In order to make it easy to use them, I wrote a CFC that acts as a wrapper to java.util.regex . You can downloaded it here:
http://www.massimocorner.com/coldfusion/cfc/tmt_java_regexp.zip

Hope it could help


Nov 28, 2008 at 5:14 PM // reply »
11,238 Comments

@Massimo,

Thanks for the insight. At least that makes sense as to why they couldn't use it. Of course, that doesn't mean they can't integrate it soon (hint hint Adobe). When I get some time, I'll take a look at your CFC. Thanks.


Dec 22, 2008 at 8:22 AM // reply »
49 Comments

How would you propose the implement this without (those in favor) breaking existing websites with the change of function? It seems that perhaps the solution would be to create an alternate REGEX function as the article above clearly presents a use case where the functionality would change if the underlying engine changed. What do you guys think? reFind2()? reMatch2()?


Dec 22, 2008 at 8:31 AM // reply »
11,238 Comments

@John,

Backwards-compatibility is an interesting issue. I personally don't worry too much about that. In my mind, if you can make something better, sometimes that requires making it different.

The key that we have to remember is that backwards-compatibility *only* becomes an issue when someone wants to upgrade their server software. I think people hear the phrase "backwards-compatibility" and it immediately causes stress cause there's some emotional belief that "uh-oh! All my code will break!" But, the truth of the matter is, one release of ColdFusion doesn't affect the next release of ColdFusion unless you are planning to upgrade.

So, let's get rid of that emotional, illogical feeling.

Then, let's look at what an upgrade would entail - well, I assume you'd upgrade your local development boxes first; then test and debug locally. Then upgrade production and sync code. Is that a huge deal?

Well how many people use regular expressions? How many people use the dot-match? How long is it going to take people to actually upgrade that so it works with the new regular expression engine?

Obviously I cannot answer that for other people. What I can tell you about myself, however is that I would absolutely do it! I have no irrational fear of upgrading my code to work better than it does currently. In fact, I have a desire to make my code work better.

So, for me at least, I say, bring it on!


Dec 22, 2008 at 9:32 AM // reply »
49 Comments

Yes, you upgrade your code. Yet many people have hired outside ColdFusion programmers. You and I need to be mindful we are not the only persona using ColdFusion. :)


Dec 22, 2008 at 9:36 AM // reply »
11,238 Comments

@John,

True - but, the people who are hiring outside developers - are they really the ones who are going to haphazardly upgrade their ColdFusion server? I don't think so.


Dec 22, 2008 at 10:15 AM // reply »
49 Comments

I hear you. We don't apparently serve the same customer base. That might explain why we disagree. Customers I serve usually upgrade for features and don't consider an upgrade something that will break features. That is like upgrading MS Word and it cannot read some of your old files. Industry wide this type of an upgrade is considered irresponsible. That doesn't make it irresponsible but Adobe and ColdFusion would get raked over the coals thinking like that. :) IMO != Yours... doesn't make either of us right. Adobe will have to choose.


Dec 22, 2008 at 10:27 AM // reply »
11,238 Comments

@John,

If we *serve* our customer base, then they wouldn't have to worry about upgrades, cause we would take care up of that for them. I thought you were talking about people who didn't even consult a developer before upgrading their server software.

I guess we will just have to agree to disagree :) No worries. At the end of the day, I can't really come up with a good scenario where an issue with backwards compatibility would cause that much concern.... so long as it was KNOWN that there were issues to be known about.


Dec 22, 2008 at 10:32 AM // reply »
26 Comments

@Ben

Even for those of us who do our own development, backwards-compatibility can be a PITA. Even upgrades that are supposed to work (ie, 6.1 to 7) have issues that you probably won't notice until an application starts throwing errors. I remember an issue with <cfxml> working slightly differently between version 6.1 and 7.01, fer instance.

There's enough that can go wrong as it is. Yes, proper testing is the answer, but if you have a large number of applications (I don't know what you consider a large number of applications, but I'm pretty sure we're over a hundred), then this is not a trivial task. I'd rather Adobe didn't deliberately cause functions to break if you upgrade.


Dec 22, 2008 at 10:38 AM // reply »
11,238 Comments

@Matt,

I agree - there is enough that can go wrong with the upgrades as-is. However, stuff does go wrong, as you have pointed out; and, does anyone regret that they have continued to upgrade their ColdFusion server? I know that when I went from MX6 to MX7, I had to go through dozens of applications to remove the now built-in method IsUserInRole(). Was that a pain in the butt? Definitely! Was it worth it? Absolutely!

So, yes, I think things should be as backwards compatible as necessary; but, If there can be a marked improvement in something that is not backwards compatible - I am not against dealing with it as part of the upgrade migration.


Dec 22, 2008 at 10:43 AM // reply »
26 Comments

@Ben

I'm not against dealing with an upgrade at all. I'd just rather Adobe didn't deliberately make these things harder.

Especially, as in this case, when there's a perfectly workable solution using Java. One that, IMO, isn't noticeable more complicated than using CF.


Dec 22, 2008 at 10:49 AM // reply »
11,238 Comments

@Matt,

True - when there is an existing workaround, it does seem a bit silly; we kind of went off track into more of a theoretical "upgrade" discussion.

Personally, I wouldn't mind if they simply made alternate RE methods. Much like what John is suggesting. My only tweak would be to do something like:

JREFind()

... where the "J" in the front is for the "Java" version.


Sep 8, 2009 at 12:13 AM // reply »
1 Comments

Hi ben
Can i store url value in cf variable if yes please give some code.



Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 17, 2013 at 7:42 PM
HashKeyCopier - An AngularJS Utility Class For Merging Cached And Live Data
Ben - thanks so much for posting these Angular articles and findings, they've been a huge help towards learning one of the more 'complex' JavaScript frameworks out there (IMO). I have been using Angu ... read »
May 16, 2013 at 5:01 PM
UPDATE: Parsing CSV Data Files In ColdFusion With csvToArray()
Your code was the closest thing I've found to obtaining some direction for converting ISO fields to values that CF can translate properly. Thank you for posting! ... read »
May 15, 2013 at 10:37 PM
Very Simple Pusher And ColdFusion Powered Chat
hi id making plz easy ... read »
May 15, 2013 at 6:07 PM
Making SOAP Web Service Requests With ColdFusion And CFHTTP
Ben, you once again saved my bacon at work. Thank you, thank you, thank you! ... read »
May 15, 2013 at 4:15 PM
What If All User Interface (UI) Data Came In Reports?
@Josh, Thanks! @Ben, I definitely recommend the David West book "Object Thinking" I've been quoting from. It goes deeply into the philosophy and history of OO programming. His breadth ... read »
May 15, 2013 at 11:36 AM
Ask Ben: Print Part Of A Web Page With jQuery
I found this helpfull when you need to keep (refresh) the original parent page after closing the iframe child print dialog (Hoping you're not using a form at this time so it won't submit again): On ... read »
May 14, 2013 at 7:13 PM
What If All User Interface (UI) Data Came In Reports?
@Jonah, If there's any books you'd recommend on the subject of domain modelling, I'd love to hear it. I just downloaded the free PDF of "Domain Driven Design Quickly". Figured I'd give it ... read »
May 14, 2013 at 6:57 PM
The UX Of Prototyping: Low-Fidelity Is The New High-Fidelity
@Phillip, I'm not sure I follow what you mean? Are you saying that you looked at the list of widgets provided by the jQuery UI and let that be your style guide? ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools