While working on my ColdFusion custom tag DSL for HTML emails, I ran into an interesting problem when performing a multiline RegExp replace on my generated email content. This is not the first time that I've tripped over issues with multiline
(?m) Regular Expression patterns and line-breaks. Though, in this case, the issue was that my RegEx pattern was failing to match adjacent lines if the pattern ended with a line-break. Or rather, it was failing in POSIX (the default ColdFusion Regular Expression engine); but, it was succeeding in Java in Lucee CFML 184.108.40.206.
At the end of the ColdFusion custom tag DSL rendering, I attempt to strip-out unnecessary whitespace. Which means, removing "blank lines"; or, lines that contain nothing other than space and tab characters.
My first attempt at this operation used the native
reReplace() ColdFusion function, which uses the POSIX RegEx engine under the hood. This seemed to replace the first match of the pattern; but, left all adjacent matches in place. I then tried switching over to the Java Regular Expression engine (using the lower-level
String.replaceAll() method) with the same pattern text, and the operation succeeded.
To see this divergence in behavior, let's create some control content with a string of "blank lines" and then try to strip them out using both the POSIX and the Java RegEx engines - note that I'm using Verbose mode (aka Comments mode) so that I can add comments next to the pattern text:
<cfscript> // We're building content that multiple "blank lines" next to each other. content = arrayToList( [ "AAAAA", "BBBBB", "", // Blank line. " ", // Blank line. " #chr( 9 )# ", // Blank line. "", // Blank line. "CCCCC", " ", // Blank line. "", // Blank line. "DDDDD" ], chr( 10 ) ); // In order to make the Regular Expression (RegEx) pattern easier to read, I am // running it in VERBOSE mode (?x). This ignores incidental whitespace and requires // all whitespace characters to be explicitly provided. As such, I am using the // following HEX codes: // -- // \x20 => Space // \x09 => Tab // -- // This Regular Expression pattern is attempting to match "blank lines" (ie, lines // that have nothing but whitespace) so that I can strip those lines out in the // replacement operation. ``` <cfsavecontent variable="patternText" >(?mx) <!--- Multi-Line + Verbose mode enabled. ---> ^ <!--- Match at START OF LINE. ---> [\x20\x09]* <!--- Leading Space or Tab characters. ---> \n <!--- Match line-break at end of line. ---> </cfsavecontent> ``` // Note that we are using the SAME PATTERN TEXT to apply the changes using the // default ColdFusion Regular Expression engine (POSIX) and the lower-level Java // Regular Expression engine. cfResult = content.reReplace( patternText, "", "all" ); javaResult = javaCast( "string", content ).replaceAll( patternText, "" ); echo( "<h3> POSIX (CFML) Result - reReplace() </h3>" ); echo( "<pre>#encodeForHtml( cfResult )#</pre>" ); echo( "<h3> Java Result - .replaceAll() </h3>" ); echo( "<pre>#encodeForHtml( javaResult )#</pre>" ); </cfscript>
As you can see, I'm using multiline mode to find lines that have nothing but string of tabs and spaces followed by a newline character. And, when we run this ColdFusion code in Lucee CFML, we get the following output:
As you can see, we get a different result when using the POSIX RegEx engine vs. using the Java RegEx engine. In the POSIX output, the number of "blank lines" is cut in half whereas in the Java output, the "blank lines" are removed entirely.
We can get the POSIX version (the native
reReplace() function) to work by wrapping the pattern text in its own capture group and having it repeat:
<cfsavecontent variable="patternText" >(?mx) ^ <!--- By wrapping the "blank line" in a repeating capture group, we use the repeating nature of the pattern to replace adjacent lines rather than leaning entirely on the "all" behavior of the reReplace() function. ---> ( [\x20\x09]* \n )+ </cfsavecontent>
This gets around the issue by leaning on the repeating nature of the RegEx pattern rather than relying on the
"all" behavior of the
I absolutely love Regular Expressions. But, they can be complex; and, tripping over the differences between the POSIX engine and the Java engine is never fun. But, hopefully this will stick to the back of my mind; and, I'll have it on hand as I continue to write sweet, sweet pattern matching Lucee CFML code.
Switching Away From the POSIX Engine
As of Adobe ColdFusion 2018, you can actually configure your ColdFusion application to use Java as the default RegEx engine by enabling the
useJavaAsRegexEngine setting. I haven't tested this specifically for this example; but, I assume it means that both outputs would become identical.
Want to use code from this post? Check out the license.