When To Use \N And $N As Regular Expression Back-References
In a regular expression, most anything wrapped in parenthesis is known a captured group. There are some exceptions to this, and you can use a syntax that performs non-captured grouping; but, for the most part, groups are captured from left-to-right by parenthesis. So, for example, in the following regular expression pattern:
... You would get the following captured groups:
- Group 1: (he(ll)o)
- Group 2: (ll)
- Group 3: (world)
These groups can be referenced using back-references - \N and $N - both when matching and replacing a given pattern. In either case, the "N" is the numerical digit, 1-9, that represents the index of the captured group. Which notation you use - \N or $N - depends both on the technology and the execution phase (matching vs. replacing) and is what I will be exploring below.
ColdFusion Regular Expressions
In ColdFusion, we can use the reFind() and reReplace() functions to find and replace regular expressions respectively. In the following script, I am going to use both functions to test the various back-reference approaches in the two phases of pattern execution:
<!--- Check to see if \N works within pattern. ---> <cfif reFind( "(ha) \1", "ha ha" )> Find using \N <cfelse> No find using \N </cfif> <br /> <br /> <!--- Check to see if $N works within pattern. ---> <cfif reFind( "(ha) $1", "ha ha" )> Find using $N <cfelse> No find using $N </cfif> <br /> <br /> <!--- Check to see if \N or $N works in replace. ---> <cfoutput> #reReplace( "ha ha", "(ha) (ha)", "\1-$2" )# </cfoutput>
As you can see here, we are using the string, "ha ha" in all cases. This is a nice string because it is composed of a repeated pattern, "ha." When we run the above code, we get the following output:
Find using \N
No find using $N
To break down what is happening, here's the type of notation that you can use in the two phases of ColdFusion regular expression pattern execution:
Java Regular Expressions
ColdFusion is built on top of Java but, Java uses a different regular expression engine. Therefore, the pattern rules that apply to reFind() and reReplace() (POSIX) are not necessarily the same as the pattern rules that apply to instances of the Java class, java.util.regex.Pattern. In the following test, I am going to use the "undocumented" fact that ColdFusion strings are really Java strings and therefore provide access to the Java String's regular-expression-based methods:
<!--- Set string value. ---> <cfset value = "ha ha" /> <!--- Check to see if \N works within pattern. ---> <cfif value.matches( "(ha) \1" )> Find using \N <cfelse> No find using \N </cfif> <br /> <br /> <!--- Check to see if $N works within pattern. ---> <cfif value.matches( "(ha) $1" )> Find using $N <cfelse> No find using $N </cfif> <br /> <br /> <!--- Check to see if \N or $N works in replace. ---> <cfoutput> #value.replaceFirst( "(ha) (ha)", "\1-$2" )# </cfoutput>
Again, we are using the string, "ha ha." But, this time, we are accessing the matches() and replaceFirst() methods directly on the value, "ha ha." When we run the above code, we get the following output:
Find using \N
No find using $N
To break down what is happening, here's the type of notation that you can use in the two phases of Java regular expression pattern execution:
NOTE: The reason we get the "1" in the replace string is because in a regular expression, the syntax \X (where X is a non-special-character) simply denotes a literal character match. You'll also note that since we are executing Java through a ColdFusion context, we don't need to escape back-slashes in strings.
No find using \N
Find using \$
So there you have it - three powerful languages providing three different flavors of regular expression execution. I know these language are all running on different RegEx engines, but I am a bit curious as to why there is no standard on how back-references work. This seems like the kind of thing that would have been nailed down after PERL (or whoever) set the standard. In any case, I hope this helps. If you are a .NET or Ruby developer, I'd love to hear how they use back-references as well.
Want to use code from this post? Check out the license.
I was always annoyed by the difference between how Homesite+ implemented RegEx backreference for find/replace and how CF does it. Why would the (admittedly old) CF IDE use a different backreference than CF itself?
I know exactly what you mean. I happen to love HomeSite. In fact, HomeSite is where I learned RegEx for the first time, using the Find/Replace to clean data exports from clients. HomeSite has allll kinds of differences. It's like a sub-set of the POSIX functionality. Very frustrating when simple things like (\r\n) don't work.
Homesite+ was also my introduction to ReGex. Back then, the "extended" find/replace feature made it easy to include line breaks, even in your RegEx--as long as you put them in as literals! That's sort of contrary to RegEx and probably stunted my growth/understanding of RegEx overall.
I'm generally pleased with RegEx support in eclipse/cfEclipse find/replace these days. And have switched entirely over to eclipse for all my CF, HTML, JS development. I've added the non-paid version of Aptana to Eclipse for HTML, CSS, JS but haven't begun to take advantage of of Aptana's JS library recognition--haven't figured out how to tell it that I'm using jQuery or even my own libraries with a certain CF page. But the standard JS intellisense, color coding and code formatting alone are enough to abandon Homesite.
The extended find/replace definitely made line breaks easier! In fact, that's part of why I love the big box so much after all these years. Of course, once I started learning more about regular expressions, I wanted to just use \r\n... but no such luck. Still, it's a great feature.