Dot-Character Matches In ColdFusion And Java Regular Expressions
Posted November 26, 2008 at 10:22 AM by Ben Nadel
The other day on Twitter, Mark Mandel posted a tweet about regular expressions and multi-line mode. He was trying to match patterns on a per-line basis. After he posted this, I went and did a little experimentation with multi-line mode and I discovered a really interesting (read: frustrating) discrepancy between ColdFusion's regular expression engine and Java's regular expression engine which is absolutely crucial to the effectiveness of multi-line mode pattern matching. What I found was that the dot-character matches a different set of characters.
Traditionally, the dot (.) character matches all characters except the new line and line return characters. In essence, the dot pattern is the logical equivalent of the negated characters class:
The only exception to this is single-line mode. In single-line mode, the dot also matches these new line characters. According to www.regular-expressions.info, this is a mode that has to be explicitly turned on in all modern regular expression engines.
Ok, this seems like a really powerful pattern, so what's the problem? The problem is that it seems as though ColdFusion's regular expression engine operates in this "single-line" mode by default, whereas the underlying Java regular expression engine does not. Let's examine this more closely. First, I am gonna create a string that has like-patterns on each line:
- <!--- Store text to parse using regular expressions. --->
- <cfsavecontent variable="strText">
- Betty: Cutie
- Kit: Kinky
- Sarah: Stubby
- Julie: Happy
Now, I am gonna match each pattern per line using the multi-line flag:
Here, the left half of the pattern matches the name:
... and the right half of the pattern selects the adjective by matching everything until the end of the line:
First we are gonna do this with ColdFusion, then we are going to use this with Java's pattern matcher:
- Use ColdFusion regular expressions in multi-line mode
- to grab the matches.
- <cfset arrMatches = REMatch(
- ) />
- <!--- Dump out the ColdFusion REMatch() captures. --->
- label="REMatch() Multi-line Matches"
When we run this code, we get the following output:
| || || || || |
| || |
| || || |
As you can see, the dot-character in ColdFusion matched EVERY character including the new line characters. And, since regular expression are, by default, greedy, it kept matching until it hit the end of the entire string (not just the given line).
OK, now, let's take that same pattern and do the matching using Java's Pattern Matcher object:
- <!--- Create a pattern to match our per-line patterns. --->
- <cfset objPattern = CreateObject(
- JavaCast( "string", "(?m)^\s+\w+:.+$" )
- <!--- Get a pattern matcher for our target text. --->
- <cfset objMatcher = objPattern.Matcher(
- JavaCast( "string", strText )
- ) />
- <!--- Create an array to store our matches. --->
- <cfset arrMatches =  />
- <!--- Put all matches into our array. --->
- <cfloop condition="objMatcher.Find()">
- <!--- Add match to array. --->
- <cfset ArrayAppend( arrMatches, objMatcher.Group() ) />
- <!--- Dump out Java Pattern captures. --->
- label="Patter-Matcher Multi-line Matches"
Granted, this code is much more involved (and faster, and powerful), but when we run it, we get this output:
| || || || || |
| || |
| || || |
As you can see here, the dot-character did not match the new line characters and the matching successfully remained on any given line.
You can get the underlying Java regular expression engine to operate in single-line mode by explicitly flagging it to do so using (?s):
Doing this will result in the same output as that produced by ColdFusion's REMatch() method.
There's something about this that I really don't like. I can accept the fact that the Java regular expression engine is more powerful and can handle things like negative look behinds. But, I have to say that I am quite unsettled by the fact that there is this fundamental difference in behavior between the two engines. Something about this feels very wrong! I really hope that with ColdFusion 9 or future releases that Adobe finally gets rid of this silly POSIX-compliant regular expression engine and embraces the awesome power of the Java Pattern Matcher.
What Other People Are Searching For
[ local search ]
coldfusion regular expression multi-line matching
[ local search ] using dot character in coldfusion regular expression
[ local search ] single-line mode and dot character
[ local search ] multi-line matches in coldfusion regular expressions
Maybe it's just me, but the more I use regular expressions in web applications, the more I realize that it would just be easier to use Java rather than CF. The last three or four times I started out using REFind, I ended up using a Java Pattern Matcher.
I agree 100%!! I think part of the reason I didn't even realize this deficiency existed is because I am so used to just going to the pattern matcher by default.
Under the hood CF doesn't use java.util.regex, but the ORO library:
Back when CF 6.0 was released, it had to run on Java 1.3, and java.util.regex was added in 1.4.
Being a newer implementation, java.util.regex offers some additional features, like inline modifier and lookbehind. In order to make it easy to use them, I wrote a CFC that acts as a wrapper to java.util.regex . You can downloaded it here:
Hope it could help
Thanks for the insight. At least that makes sense as to why they couldn't use it. Of course, that doesn't mean they can't integrate it soon (hint hint Adobe). When I get some time, I'll take a look at your CFC. Thanks.
How would you propose the implement this without (those in favor) breaking existing websites with the change of function? It seems that perhaps the solution would be to create an alternate REGEX function as the article above clearly presents a use case where the functionality would change if the underlying engine changed. What do you guys think? reFind2()? reMatch2()?
Backwards-compatibility is an interesting issue. I personally don't worry too much about that. In my mind, if you can make something better, sometimes that requires making it different.
The key that we have to remember is that backwards-compatibility *only* becomes an issue when someone wants to upgrade their server software. I think people hear the phrase "backwards-compatibility" and it immediately causes stress cause there's some emotional belief that "uh-oh! All my code will break!" But, the truth of the matter is, one release of ColdFusion doesn't affect the next release of ColdFusion unless you are planning to upgrade.
So, let's get rid of that emotional, illogical feeling.
Then, let's look at what an upgrade would entail - well, I assume you'd upgrade your local development boxes first; then test and debug locally. Then upgrade production and sync code. Is that a huge deal?
Well how many people use regular expressions? How many people use the dot-match? How long is it going to take people to actually upgrade that so it works with the new regular expression engine?
Obviously I cannot answer that for other people. What I can tell you about myself, however is that I would absolutely do it! I have no irrational fear of upgrading my code to work better than it does currently. In fact, I have a desire to make my code work better.
So, for me at least, I say, bring it on!
Yes, you upgrade your code. Yet many people have hired outside ColdFusion programmers. You and I need to be mindful we are not the only persona using ColdFusion. :)
True - but, the people who are hiring outside developers - are they really the ones who are going to haphazardly upgrade their ColdFusion server? I don't think so.
I hear you. We don't apparently serve the same customer base. That might explain why we disagree. Customers I serve usually upgrade for features and don't consider an upgrade something that will break features. That is like upgrading MS Word and it cannot read some of your old files. Industry wide this type of an upgrade is considered irresponsible. That doesn't make it irresponsible but Adobe and ColdFusion would get raked over the coals thinking like that. :) IMO != Yours... doesn't make either of us right. Adobe will have to choose.
If we *serve* our customer base, then they wouldn't have to worry about upgrades, cause we would take care up of that for them. I thought you were talking about people who didn't even consult a developer before upgrading their server software.
I guess we will just have to agree to disagree :) No worries. At the end of the day, I can't really come up with a good scenario where an issue with backwards compatibility would cause that much concern.... so long as it was KNOWN that there were issues to be known about.
Even for those of us who do our own development, backwards-compatibility can be a PITA. Even upgrades that are supposed to work (ie, 6.1 to 7) have issues that you probably won't notice until an application starts throwing errors. I remember an issue with <cfxml> working slightly differently between version 6.1 and 7.01, fer instance.
There's enough that can go wrong as it is. Yes, proper testing is the answer, but if you have a large number of applications (I don't know what you consider a large number of applications, but I'm pretty sure we're over a hundred), then this is not a trivial task. I'd rather Adobe didn't deliberately cause functions to break if you upgrade.
I agree - there is enough that can go wrong with the upgrades as-is. However, stuff does go wrong, as you have pointed out; and, does anyone regret that they have continued to upgrade their ColdFusion server? I know that when I went from MX6 to MX7, I had to go through dozens of applications to remove the now built-in method IsUserInRole(). Was that a pain in the butt? Definitely! Was it worth it? Absolutely!
So, yes, I think things should be as backwards compatible as necessary; but, If there can be a marked improvement in something that is not backwards compatible - I am not against dealing with it as part of the upgrade migration.
I'm not against dealing with an upgrade at all. I'd just rather Adobe didn't deliberately make these things harder.
Especially, as in this case, when there's a perfectly workable solution using Java. One that, IMO, isn't noticeable more complicated than using CF.
True - when there is an existing workaround, it does seem a bit silly; we kind of went off track into more of a theoretical "upgrade" discussion.
Personally, I wouldn't mind if they simply made alternate RE methods. Much like what John is suggesting. My only tweak would be to do something like:
... where the "J" in the front is for the "Java" version.
Can i store url value in cf variable if yes please give some code.