Regular Expression Quote Makes Finding Literals Easy In ColdFusion Search

Posted June 21, 2007 at 8:41 AM by Ben Nadel

Tags: ColdFusion

I love regular expressions and all the time I am finding new ways to leverage their insane black magic voodoo powers within ColdFusion. I just came across the regular expression "quote" character, \Q. \Q allows us to create a pattern of characters that are to be evaluated as character literals, not as possible regular expression constructs. \Q will perform a literal match of everything to the right of it until it hits the end of the expression or the end construct, \E. Therefore, this patterns:

\Qben\E

... will search for the literal string "ben".

While this might not seem very important, it can make searching text much easier. One of the things that ColdFusion sorely lacks is a way to easily iterate over matches in a string (literal or regular expression). Luckily, Java provides us with the Pattern and Matcher objects to make iterating over pattern matches easy. But what about literal strings? That's where this \Q-\E construct comes into play.

Using the regular expression quote constructs, we can leverage the power of the Java pattern matcher to iterate over literal string matches in a chunk of target text. I have thought about using the Java pattern matcher before, but because there are so many special characters in regular expressions, I have always been very hesitant to do this - most of the time, I cannot be sure that the user-entered search phrases don't have special regular expression characters (like "." or "\"). Now, this is no longer a concern; take a look at this demo:

  • <!---
  • Store some text that we want to search. We are going
  • to make sure that this text has characters that would
  • be considered special characters within a regular
  • expression.
  • --->
  • <cfsavecontent variable="strText">
  • Hey Maria, you better stop. I don't think it's a good
  • idea for you to change while I'm still in the room?!?!?
  • I mean, sure you're looking hella fine [sic]! But,
  • what would your parents think?!?!?
  • </cfsavecontent>
  •  
  •  
  • <!---
  • We are going to store the search phrase in variable.
  • This is just to demonstrate that the search phrase
  • could come from anywhere, including a search form
  • with user-entered criteria.
  •  
  • In our case, we are going to use one phrase that has
  • the ? which is a zero-or-more matcher and the []
  • which creates a character set.
  • --->
  • <cfset strPhrase1 = "?!?!?" />
  • <cfset strPhrase2 = "[sic]" />
  •  
  •  
  • <!---
  • Now, let's create a Java pattern to find our search
  • phrase. Notice that we are putting the above search
  • phrase into our patterns and using the \Q ... \E
  • escape pattern. Using \Q and \E will match literal
  • values in between even if they contain special
  • regular expression characters.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • "(?i)(\Q#strPhrase1#\E|\Q#strPhrase2#\E)"
  • )
  • />
  •  
  • <!---
  • Create a matcher for out pattern that will be able
  • to search the target string for out literal pattern.
  • --->
  • <cfset objMatcher = objPattern.Matcher( strText ) />
  •  
  •  
  • <!---
  • Keep looping over the matcher until we have run out
  • of matching patterns.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <p>
  • Found: #objMatcher.Group()#<br />
  • Found At: #objMatcher.Start()#
  • </p>
  •  
  • </cfloop>

If you look at the two search phrases we are searching on:

?!?!?
[sic]

... you will see that both of these phrases contain special regex characters, the ?,[, and ]. Normally, if we took these strings and just dynamically included them into a regular expression search, we would get very unexpected results. However, since we wrapped both of these phrases in \Q and \E within our pattern, running the above code, we get the following output:

Found: ?!?!?
Found At: 109

Found: [sic]
Found At: 156

Found: ?!?!?
Found At: 199

Notice that our phrases were matches as literals, not as "patterns" (they're still patterns, but you know what I mean).

Now, this doesn't put us 100% in the clear; we don't have to worry about 99% of the regular expression characters being in our string, since they are being matched as literals, but will still need to be careful of one: \E. The regular expression will match the "quote" starting at the \Q and ending with the \E. If someone entered a search phrase that has \E in it, then our regular expression will be malformed, having two \E instances and only one \Q instance. I have tried to find a way to escape the \E, but nothing I did seemed to work. Therefore, the one step we might have to take is to make sure the user doesn't enter \E in their search criteria. This is a little irritating, but heck, it's 100% better than having to worry about the entire set of special regular expression characters.




Reader Comments

Jun 21, 2007 at 10:00 AM // reply »
168 Comments

Just to be clear, it's java.util.regex that supports \Q...\E literal text spans, not ColdFusion. If you try to use them in CF functions like ReFind, it won't work as you described (at least in MX 7 and lower). In any case, I've never found the need for \Q...\E literal text spans, because for one, it's easy to escape regular expression special characters manually, and secondly, I believe Java 1.4 and 1.5 have some bugs involving literal text spans which start within character classes. PCRE lower than v7.0 has a bug where it incorrectly handles \Q...\E as the start or end of a range in a character class, e.g.: [a-\Qz\E]. Finally, as you mentioned, if you use it to escape user input, you still have to worry about the \E metasequence.

For all these reasons, I would recommend avoiding the feature in most cases, and instead escaping special regex characters manually, when necessary. Here's a CF example:

<cfset escapedRe = ReReplace(input, "[.*+?^${}()|[\]/\\]", "\\\0", "ALL") />


Jun 21, 2007 at 10:09 AM // reply »
10,640 Comments

@Steve,

Thanks for pointing that out (re: it working in CF directly). I only tried it in the Java regex because that is where I see it being the most useful. But you do raise a good point - if I am going to have to escape a single character, why not just escape all the special characters. Damn Steve! why you gotta kill my buzz :)


Jun 21, 2007 at 10:10 AM // reply »
10,640 Comments

Also, I didn't know you could refer to \0 as the matched pattern. Very cool!


Jun 21, 2007 at 10:41 AM // reply »
4 Comments

reReplace utizes the POSIX regular expression standard.

It is quite straight forward to match literal strings using POSIX expressions and reReplace rather than jumping through all the Java hoops.

See:
http://www.dc.turkuamk.fi/docs/gnu/rx/rx_3.html


Jun 21, 2007 at 10:54 AM // reply »
168 Comments

"reReplace utizes the POSIX regular expression standard."

That is simply false.

Just because ColdFusion supports POSIX-style character classes does not mean that it uses a POSIX-compliant regular expression engine. POSIX-standard regular expression mechanics and syntax are very different from Traditional NFA regex engines like ColdFusion's. Look at the page you linked to.... with statements like "In every case [of alternation], the longer match is preferred." That is patently false with ColdFusion regexes.


Jun 21, 2007 at 11:31 AM // reply »
4 Comments

Good point even though I've never had any problems applying any regular expression using POSIX syntax using reFind().

Besides, often it is more time (aka cost) effective to spend 10 minutes writing two CFIF statments than all afternoon getting a particular reg expression to work.

This also makes the code more legible for posterity.


Jun 21, 2007 at 11:31 AM // reply »
168 Comments

"Damn Steve! why you gotta kill my buzz :)"

Sometimes I might come off as a bit abrasive (e.g., with my response to Peter just above), so I apologize in advance for this, and maybe I'll try to tone it down just a little. :)

"I didn't know you could refer to \0 as the matched pattern. Very cool!"

The entire match is backreference zero, so it makes a lot of sense when you think of it that way. It's also the only example I can think of offhand where ColdFusion uses a zero-based index (though a bit confusingly, backreference zero is returned as array index 1 by ReFind when using the returnSubExpressions argument ... which is of course understandable given that CF arrays don't have a [0] index).

In some languages (e.g., JavaScript), you can refer to the entire match in the replacement string by using "$&" (this notation comes from Perl). When the entire match is needed in the replacement string, FAR too many people use the wasteful approach of enclosing the entire match in capturing parentheses and then referring to $1 or \1.


Jun 21, 2007 at 11:43 AM // reply »
10,640 Comments

@Steve,

Don't worry - you didn't come off as abrasive at all... I was just pouting because you made me realize my find was not as cool as I thought :)

As far as the \0 reference goes, I totally get it. I know that in Java, you can do Group( 0 ) to get the entire match. I just never thought of referring to it with a back reference. Again a very cool tip!


Jun 21, 2007 at 12:46 PM // reply »
168 Comments

"Good point even though I've never had any problems applying any regular expression using POSIX syntax using reFind()." --Peter

That probably results from using only very simple regular expressions, or not understanding what the POSIX standard means. There are fundamental differences between NFA, DFA, POSIX, and Tcl-style hybrid regex engines. For example, the POSIX standard requires that if you have multiple possible matches that start at the same position, the one matching the most text *must* be the one returned. This is fundamentally (and very impactively) different from a traditional NFA, which is the type of regex engine most people are familiar with, and which is used by CF, Java, .NET, Perl, PCRE, JavaScript, etc.


Jun 21, 2007 at 1:19 PM // reply »
168 Comments

...By the way, if by "POSIX syntax" you simply mean POSIX-style pre-defined character classes like [[:digit:]] or [^[:upper:]], that is a different matter. I am simply disagreeing with your claim that CF utilizes "the POSIX regular expression standard."


Jun 21, 2007 at 7:42 PM // reply »
16 Comments

That's it Ben, I'm getting you a Maria Bello calendar for christmas :)


Jun 22, 2007 at 7:10 AM // reply »
10,640 Comments

Sweeeeet :D


Oct 24, 2010 at 12:07 AM // reply »
9 Comments

I get confused when using regular expressions that require quotes in the pattern. For example, searching for the html tags:

  • a href="../file.htm" title="mytitle"
  • img src="../file.jpg"

So instead I use the following code to avoid the headache.

  • <cfset regExStr="<\s*[aA]\s+[hH][rR][eE][fF]\s*[=]\s*[#chr(34)##chr(39)#]([^#chr(34)##chr(39)#])+[#chr(34)##chr(39)#]"
  • >
  •  
  • <cfset LOCAL.matchingTagArr = REMatch(regExStr, htmlToSearchStr)
  • >

If I need to search for a pound character I use #chr(35)#

I just wish that the POSIX-style pre-defined character classes included [:quote:]. Unfortunately it does not. Maybe Adobe should add it to coldfusion?


Oct 24, 2010 at 12:25 PM // reply »
10,640 Comments

@Dangle,

Yeah, having to escape the quotes in a ColdFusion string can definitely make things harder to read. I've come to love the Verbose regular expression which allows us to build regular expression patterns in a content buffer (CFSaveContent) such that we don't have to worry about escaping strings:

http://www.bennadel.com/blog/333-Verbose-Regular-Expressions-In-ColdFusion-And-Java.htm

Also, just as a note, if you use reMatchNoCase(), you don't have to worry about the possible character variations [aA][hH], etc.



Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
InVision App - Prototyping Made Beautiful With Prototyping Tools Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
Feb 10, 2012 at 7:21 PM
jQuery AJAX Strips Script Tags And Inserts Them After Parent-Most Elements
Update! Instead of $(eval(options.insertAfter)).after(data['insertData']); I now use: var ajaxNode = document.createElement('span'); var parent = $(eval(options.insertAfter))[0].parentNode; ... read »
Feb 10, 2012 at 6:18 PM
jQuery AJAX Strips Script Tags And Inserts Them After Parent-Most Elements
encountered this same, what I consider, jQuery bug last week. I'm building a site in which I load some content via AJAX. This content contains Linkedin share button placeholders which Linkedin API ne ... read »
Feb 10, 2012 at 11:30 AM
Cross-Origin Resource Sharing (CORS) AJAX Requests Between jQuery And Node.js
After you understand the concepts here, this is an awesome cheatsheet for enabling CORS in just about anything http://enable-cors.org/ ... read »
JM
Feb 10, 2012 at 9:10 AM
My Safari Browser SQLite Database Hello World Example
@Amy, Here is a very good tutorial on how to use JOIN: http://www.sqltutorial.org/sqljoin-innerjoin.aspx ... read »
Feb 10, 2012 at 4:42 AM
Building A Twitter-Inspired RESTful API Architecture In ColdFusion
This is great, very useful Ben. I spotted a small typo in the api.cgm listing: <cfthrow type="Unauthroized" /> Cheers Stefan ... read »
Feb 9, 2012 at 10:35 PM
CFDirectory Filtering Uses Pipe Character For Multiple Filters (Thanks Steve Withington)
I was wondering if there would be a filter you could apply so that you got everything but what you included in the filter. As in show me all docs that are not a .pdf. ... read »
Feb 9, 2012 at 10:29 PM
Learning ColdFusion 9: Application-Specific Data Sources
@Ben, No offence, but if people were really wanting advanced features they would be using a platform like ASP.NET MVC. CFML is so structurally compromised as a tag-based scripting language that ... read »
Feb 9, 2012 at 10:03 PM
Subversion - Cleanup Failed To Process The Following Paths
@Leviaguirre, do you still have problems with this? ... read »