Regular Expression Quote Makes Finding Literals Easy In ColdFusion Search
I love regular expressions and all the time I am finding new ways to leverage their insane black magic voodoo powers within ColdFusion. I just came across the regular expression "quote" character, \Q. \Q allows us to create a pattern of characters that are to be evaluated as character literals, not as possible regular expression constructs. \Q will perform a literal match of everything to the right of it until it hits the end of the expression or the end construct, \E. Therefore, this patterns:
\Qben\E
... will search for the literal string "ben".
While this might not seem very important, it can make searching text much easier. One of the things that ColdFusion sorely lacks is a way to easily iterate over matches in a string (literal or regular expression). Luckily, Java provides us with the Pattern and Matcher objects to make iterating over pattern matches easy. But what about literal strings? That's where this \Q-\E construct comes into play.
Using the regular expression quote constructs, we can leverage the power of the Java pattern matcher to iterate over literal string matches in a chunk of target text. I have thought about using the Java pattern matcher before, but because there are so many special characters in regular expressions, I have always been very hesitant to do this - most of the time, I cannot be sure that the user-entered search phrases don't have special regular expression characters (like "." or "\"). Now, this is no longer a concern; take a look at this demo:
<!---
Store some text that we want to search. We are going
to make sure that this text has characters that would
be considered special characters within a regular
expression.
--->
<cfsavecontent variable="strText">
Hey Maria, you better stop. I don't think it's a good
idea for you to change while I'm still in the room?!?!?
I mean, sure you're looking hella fine [sic]! But,
what would your parents think?!?!?
</cfsavecontent>
<!---
We are going to store the search phrase in variable.
This is just to demonstrate that the search phrase
could come from anywhere, including a search form
with user-entered criteria.
In our case, we are going to use one phrase that has
the ? which is a zero-or-more matcher and the []
which creates a character set.
--->
<cfset strPhrase1 = "?!?!?" />
<cfset strPhrase2 = "[sic]" />
<!---
Now, let's create a Java pattern to find our search
phrase. Notice that we are putting the above search
phrase into our patterns and using the \Q ... \E
escape pattern. Using \Q and \E will match literal
values in between even if they contain special
regular expression characters.
--->
<cfset objPattern = CreateObject(
"java",
"java.util.regex.Pattern"
).Compile(
"(?i)(\Q#strPhrase1#\E|\Q#strPhrase2#\E)"
)
/>
<!---
Create a matcher for out pattern that will be able
to search the target string for out literal pattern.
--->
<cfset objMatcher = objPattern.Matcher( strText ) />
<!---
Keep looping over the matcher until we have run out
of matching patterns.
--->
<cfloop condition="objMatcher.Find()">
<p>
Found: #objMatcher.Group()#<br />
Found At: #objMatcher.Start()#
</p>
</cfloop>
If you look at the two search phrases we are searching on:
?!?!?
[sic]
... you will see that both of these phrases contain special regex characters, the ?,[, and ]. Normally, if we took these strings and just dynamically included them into a regular expression search, we would get very unexpected results. However, since we wrapped both of these phrases in \Q and \E within our pattern, running the above code, we get the following output:
Found: ?!?!?
Found At: 109Found: [sic]
Found At: 156Found: ?!?!?
Found At: 199
Notice that our phrases were matches as literals, not as "patterns" (they're still patterns, but you know what I mean).
Now, this doesn't put us 100% in the clear; we don't have to worry about 99% of the regular expression characters being in our string, since they are being matched as literals, but will still need to be careful of one: \E. The regular expression will match the "quote" starting at the \Q and ending with the \E. If someone entered a search phrase that has \E in it, then our regular expression will be malformed, having two \E instances and only one \Q instance. I have tried to find a way to escape the \E, but nothing I did seemed to work. Therefore, the one step we might have to take is to make sure the user doesn't enter \E in their search criteria. This is a little irritating, but heck, it's 100% better than having to worry about the entire set of special regular expression characters.
Want to use code from this post? Check out the license.
Reader Comments
Just to be clear, it's java.util.regex that supports \Q...\E literal text spans, not ColdFusion. If you try to use them in CF functions like ReFind, it won't work as you described (at least in MX 7 and lower). In any case, I've never found the need for \Q...\E literal text spans, because for one, it's easy to escape regular expression special characters manually, and secondly, I believe Java 1.4 and 1.5 have some bugs involving literal text spans which start within character classes. PCRE lower than v7.0 has a bug where it incorrectly handles \Q...\E as the start or end of a range in a character class, e.g.: [a-\Qz\E]. Finally, as you mentioned, if you use it to escape user input, you still have to worry about the \E metasequence.
For all these reasons, I would recommend avoiding the feature in most cases, and instead escaping special regex characters manually, when necessary. Here's a CF example:
<cfset escapedRe = ReReplace(input, "[.*+?^${}()|[\]/\\]", "\\\0", "ALL") />
@Steve,
Thanks for pointing that out (re: it working in CF directly). I only tried it in the Java regex because that is where I see it being the most useful. But you do raise a good point - if I am going to have to escape a single character, why not just escape all the special characters. Damn Steve! why you gotta kill my buzz :)
Also, I didn't know you could refer to \0 as the matched pattern. Very cool!
reReplace utizes the POSIX regular expression standard.
It is quite straight forward to match literal strings using POSIX expressions and reReplace rather than jumping through all the Java hoops.
See:
http://www.dc.turkuamk.fi/docs/gnu/rx/rx_3.html
"reReplace utizes the POSIX regular expression standard."
That is simply false.
Just because ColdFusion supports POSIX-style character classes does not mean that it uses a POSIX-compliant regular expression engine. POSIX-standard regular expression mechanics and syntax are very different from Traditional NFA regex engines like ColdFusion's. Look at the page you linked to.... with statements like "In every case [of alternation], the longer match is preferred." That is patently false with ColdFusion regexes.
Good point even though I've never had any problems applying any regular expression using POSIX syntax using reFind().
Besides, often it is more time (aka cost) effective to spend 10 minutes writing two CFIF statments than all afternoon getting a particular reg expression to work.
This also makes the code more legible for posterity.
"Damn Steve! why you gotta kill my buzz :)"
Sometimes I might come off as a bit abrasive (e.g., with my response to Peter just above), so I apologize in advance for this, and maybe I'll try to tone it down just a little. :)
"I didn't know you could refer to \0 as the matched pattern. Very cool!"
The entire match is backreference zero, so it makes a lot of sense when you think of it that way. It's also the only example I can think of offhand where ColdFusion uses a zero-based index (though a bit confusingly, backreference zero is returned as array index 1 by ReFind when using the returnSubExpressions argument ... which is of course understandable given that CF arrays don't have a [0] index).
In some languages (e.g., JavaScript), you can refer to the entire match in the replacement string by using "$&" (this notation comes from Perl). When the entire match is needed in the replacement string, FAR too many people use the wasteful approach of enclosing the entire match in capturing parentheses and then referring to $1 or \1.
@Steve,
Don't worry - you didn't come off as abrasive at all... I was just pouting because you made me realize my find was not as cool as I thought :)
As far as the \0 reference goes, I totally get it. I know that in Java, you can do Group( 0 ) to get the entire match. I just never thought of referring to it with a back reference. Again a very cool tip!
"Good point even though I've never had any problems applying any regular expression using POSIX syntax using reFind()." --Peter
That probably results from using only very simple regular expressions, or not understanding what the POSIX standard means. There are fundamental differences between NFA, DFA, POSIX, and Tcl-style hybrid regex engines. For example, the POSIX standard requires that if you have multiple possible matches that start at the same position, the one matching the most text *must* be the one returned. This is fundamentally (and very impactively) different from a traditional NFA, which is the type of regex engine most people are familiar with, and which is used by CF, Java, .NET, Perl, PCRE, JavaScript, etc.
...By the way, if by "POSIX syntax" you simply mean POSIX-style pre-defined character classes like [[:digit:]] or [^[:upper:]], that is a different matter. I am simply disagreeing with your claim that CF utilizes "the POSIX regular expression standard."
That's it Ben, I'm getting you a Maria Bello calendar for christmas :)
Sweeeeet :D
I get confused when using regular expressions that require quotes in the pattern. For example, searching for the html tags:
So instead I use the following code to avoid the headache.
If I need to search for a pound character I use #chr(35)#
I just wish that the POSIX-style pre-defined character classes included [:quote:]. Unfortunately it does not. Maybe Adobe should add it to coldfusion?
@Dangle,
Yeah, having to escape the quotes in a ColdFusion string can definitely make things harder to read. I've come to love the Verbose regular expression which allows us to build regular expression patterns in a content buffer (CFSaveContent) such that we don't have to worry about escaping strings:
www.bennadel.com/blog/333-Verbose-Regular-Expressions-In-ColdFusion-And-Java.htm
Also, just as a note, if you use reMatchNoCase(), you don't have to worry about the possible character variations [aA][hH], etc.
Steven, your code:
did not quite work for me. The second "[" also needed to be escaped. Also, I wanted it for multiple languages, and easier to read, so for a safer more universal Regular Expression escape code, I changed it to: