While working on my ColdFusion custom tag DSL for HTML emails, I ran into an interesting problem when performing a multiline RegExp replace on my generated email content. This is not the first time that I've tripped over issues with multiline
(?m) Regular Expression patterns and line-breaks. Though, in this case, the issue was that my RegEx pattern was failing to match adjacent lines if the pattern ended with a line-break. Or rather, it was failing in POSIX (the default ColdFusion Regular Expression engine); but, it was succeeding in Java in Lucee CFML 126.96.36.199.
At the end of the ColdFusion custom tag DSL rendering, I attempt to strip-out unnecessary whitespace. Which means, removing "blank lines"; or, lines that contain nothing other than space and tab characters.
My first attempt at this operation used the native
reReplace() ColdFusion function, which uses the POSIX RegEx engine under the hood. This seemed to replace the first match of the pattern; but, left all adjacent matches in place. I then tried switching over to the Java Regular Expression engine (using the lower-level
String.replaceAll() method) with the same pattern text, and the operation succeeded.
To see this divergence in behavior, let's create some control content with a string of "blank lines" and then try to strip them out using both the POSIX and the Java RegEx engines - note that I'm using Verbose mode (aka Comments mode) so that I can add comments next to the pattern text:
<cfscript> // We're building content that multiple "blank lines" next to each other. content = arrayToList( [ "AAAAA", "BBBBB", "", // Blank line. " ", // Blank line. " #chr( 9 )# ", // Blank line. "", // Blank line. "CCCCC", " ", // Blank line. "", // Blank line. "DDDDD" ], chr( 10 ) ); // In order to make the Regular Expression (RegEx) pattern easier to read, I am // running it in VERBOSE mode (?x). This ignores incidental whitespace and requires // all whitespace characters to be explicitly provided. As such, I am using the // following HEX codes: // -- // \x20 => Space // \x09 => Tab // -- // This Regular Expression pattern is attempting to match "blank lines" (ie, lines // that have nothing but whitespace) so that I can strip those lines out in the // replacement operation. ``` <cfsavecontent variable="patternText" >(?mx) <!--- Multi-Line + Verbose mode enabled. ---> ^ <!--- Match at START OF LINE. ---> [\x20\x09]* <!--- Leading Space or Tab characters. ---> \n <!--- Match line-break at end of line. ---> </cfsavecontent> ``` // Note that we are using the SAME PATTERN TEXT to apply the changes using the // default ColdFusion Regular Expression engine (POSIX) and the lower-level Java // Regular Expression engine. cfResult = content.reReplace( patternText, "", "all" ); javaResult = javaCast( "string", content ).replaceAll( patternText, "" ); echo( "<h3> POSIX (CFML) Result - reReplace() </h3>" ); echo( "<pre>#encodeForHtml( cfResult )#</pre>" ); echo( "<h3> Java Result - .replaceAll() </h3>" ); echo( "<pre>#encodeForHtml( javaResult )#</pre>" ); </cfscript>
As you can see, I'm using multiline mode to find lines that have nothing but string of tabs and spaces followed by a newline character. And, when we run this ColdFusion code in Lucee CFML, we get the following output:
As you can see, we get a different result when using the POSIX RegEx engine vs. using the Java RegEx engine. In the POSIX output, the number of "blank lines" is cut in half whereas in the Java output, the "blank lines" are removed entirely.
We can get the POSIX version (the native
reReplace() function) to work by wrapping the pattern text in its own capture group and having it repeat:
<cfsavecontent variable="patternText" >(?mx) ^ <!--- By wrapping the "blank line" in a repeating capture group, we use the repeating nature of the pattern to replace adjacent lines rather than leaning entirely on the "all" behavior of the reReplace() function. ---> ( [\x20\x09]* \n )+ </cfsavecontent>
This gets around the issue by leaning on the repeating nature of the RegEx pattern rather than relying on the
"all" behavior of the
I absolutely love Regular Expressions. But, they can be complex; and, tripping over the differences between the POSIX engine and the Java engine is never fun. But, hopefully this will stick to the back of my mind; and, I'll have it on hand as I continue to write sweet, sweet pattern matching Lucee CFML code.
Switching Away From the POSIX Engine
As of Adobe ColdFusion 2018, you can actually configure your ColdFusion application to use Java as the default RegEx engine by enabling the
useJavaAsRegexEngine setting. I haven't tested this specifically for this example; but, I assume it means that both outputs would become identical.
Hi Ben. I always use:
REReplaceNoCase(string,"[\s]+"," ", "ALL");
When I need to strip out new lines, tabs & carriage
Replacing with a single space doesn't tend to cause any harm.
The next Lucee release, 5.3.8 also supports using the java regex engine
That's not a bad idea. In my particular case, I wanted to keep the line-breaks in place because I was outputting HTML source code - and, I wanted to keep the "View Source" a bit more readable. But, yeah, I like your thinking there.
Nice nice nice! I just love Lucee :)
OK. I see. Yes. My method zaps all line breaks.
Dealing with regex over multiple lines can be a bit buggy, in my experience.
I must say, I never knew about:
I must use this setting sometime and see if I can finally apply regex over more than one line.
This has been a bugbear of mine for many years...
I also tried to run your code in TryCF.com and the Lucee CFML engine started to complain about a missing CFTRY tag?
In the end, I copied your code to cffiddle.org.
Now, cfffiddle.org only allows us to choose ACF CFML engine.
I had to change your code to the following before it worked:
<cfscript> content = arrayToList( [ "AAAAA", "BBBBB", "", " ", " #chr( 9 )# ", "", "CCCCC", " ", "", "DDDDD" ], chr( 10 ) ); patternText = "(?mx)^[\x20\x09]*\n"; result = content.reReplace( patternText, "", "all" ); javaResult = javaCast( "string", content ).replaceAll( patternText, "" ); WriteOutput( "<h3> POSIX (CFML) Result - reReplace() </h3>" ); WriteOutput( "<pre>#encodeForHtml( result )#</pre>" ); WriteOutput( "<h3> Java Result - .replaceAll() </h3>" ); WriteOutput( "<pre>#encodeForHtml( javaResult )#</pre>" ); </cfscript>
Whats interesting about all of this, is how much ACF has now diverged from Lucee!
Anyway, this has been a very useful exploration, and I am sure, at some point, I will need to strip out blank lines, using regex. So, thanks...
By the way, your results were emulated on cffiddle.org
To be clear the
(?) pattern is used to turn
pattern flags on and off. So, in this case
(?mx) is actually turning on two different flags:
m- Multiline matching mode.
x- Verbose / comment mode.
You can also turn on:
i- Case insensitive mode.
Which means that
reFindNoCase( "pattern" ) is
the same as
reFind( "(?i)patttern" ).
This is very cool.
const regex = new RegExp('[\s]+', 'igm');
But I never found out how to do this in ColdFusion? I know, I feel like an idiot;)
So, now, in ColdFusion, all I have to do is:
REReplaceNoCase("string", "(?m)patttern","replacement","ALL" ); // equivalent to -> igm REReplaceNoCase("string", "(?m)patttern","replacement","ONE" ); // equivalent to -> im REReplace("string", "(?m)patttern","replacement","ALL" ); // equivalent to -> gm REReplace("string", "(?m)patttern","replacement","ONE" ); // equivalent to -> m
To be fair, I am not sure whether Adobe have any docs on regex flags? It would be great, if you could publish a full list of Coldfusion regex flags in a reply.
This would be incredibly useful for future reference.
Here are the flags I found on regex101:
Dollar end only
When it comes to RegEx flags, you just have to be careful that they aren't universally supported. So, what works in the Java RegEx engine may not work in the POSIX RegEx engine. Always be sure to test!