Application Setting "useJavaAsRegexEngine" Tells CFML To Use Java's RegEx Engine For Built-In Re-Functions In Adobe ColdFusion 2018

By Ben Nadel

Published 2020-05-17 in ColdFusion — Comments (10)

So, yesterday, when I was looking something up in the Adobe ColdFusion RegEx documentation, I saw something that blew my mind! Apparently, as of Update 5, there is now a ColdFusion 2018 Application setting - useJavaAsRegexEngine - that tells the CFML runtime to use Java's Regular Expression engine when executing built-in functions like reFind(), reMatch(), and reReplace(). While the default POSIX pattern matching engine is pretty great, it's less robust and less powerful than the Java pattern matching engine. As such, this will be a much welcome change to the Adobe ColdFusion community.

To see an example of this change, let's look at using a positive look-behind. In a Regular Expression pattern, a positive look-behind allows us to capture a value, but only if it immediately follows another pattern. And, it does so without us having to capture the "preceding part" of said pattern.

So, in the following code, we're going to try and capture the term pajamas; but, only if it immediately follows the term cat's:

  
          <cfscript>
        
          	value = "You are the cat's pajamas!";
        
          	// Normally, the POSIX-based Regular Expression engine that Adobe ColdFusion uses
        
          	// under the hood doesn't support things like LOOK-BEHINDS. As such, the following
        
          	// RegEx pattern - which is looking for the term "pajamas", but only if it comes
        
          	// right after the term "cat's " - will throw an error:
        
          	writeDump( value.reMatch( "(?<=cat's )pajamas" ) );
        
          </cfscript>

view raw test.cfm hosted with ❤ by GitHub

Here, we're using the built-in Adobe ColdFusion function reMatch(). And, if we run this in Adobe ColdFusion 2018 - without any custom settings - we get the following ColdFusion error:

Adobe ColdFusion 2018 is throwing an error about malformed RegEx patterns that contain positive look-behinds.

Malformed regular expression "(?<=cat's )pajamas".

Reason: Sequence (?<...) not recognized.

As you can see, the default POSIX RegEx engine that Adobe ColdFusion uses does not support look-behinds. But, if we create an Application.cfc ColdFusion framework component, we can switch our RegEx engine:

  
          component
        
          	output = false
        
          	hint = "I define the application settings and event-handlers."
        
          	{
        
          	this.name = "RegExTesting";
        
          	// Tell Adobe ColdFusion 2018, Update 5, to use the Java Regular Expression engine
        
          	// for all of its built-in re-Functions (like, reFind() and reMatch()).
        
          	this.useJavaAsRegexEngine = true;
        
          }

view raw Application.cfc hosted with ❤ by GitHub

Now, with the useJavaAsRegexEngine application setting enabled, if we re-run the same code from above, we get the following output:

Adobe ColdFusion 2018 supports positive look-behinds in RegEx patterns if the useJavaAsRegexEngine Application setting is enabled.

Oh sweet chickens! As you can see, with the useJavaAsRegexEngine ColdFusion application setting enabled, the look-behind works; and, we were able to locate and extract the phrase pajamas!

At the time of this writing, Lucee CFML does not yet support this feature. However, I did locate a JIRA ticket - LDEV-2495 - which has this feature listed as a compatibility issue. As such, I fully expect this to become available in a future Lucee CFML release.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/3829

Reader Comments

Zachary Spitzer May 18, 2020 at 4:31 AM

64 Comments

New Lucee bug filed for this specific feature https://luceeserver.atlassian.net/browse/LDEV-2892

Ben Nadel May 18, 2020 at 7:23 AM

16,020 Comments

@Zac,

You guys are so fast. It's definitely a fascinating feature. It's not without it's issues; but, I suspect that anything that can be set globally to an application has some degree of trade-offs. Personally, I just love the Java RegEx engine!

Tyler Clendenin May 18, 2020 at 9:10 AM

6 Comments

Any resources for the differences between the RegEx engines? I assume there are things that won't work with the Java RegEx that would work with the default engine.

Ben Nadel May 18, 2020 at 11:34 AM

16,020 Comments

@Tyler,

Great question! Yes and No. Yes in that the Adobe site appears to have a comparison of the Java and Perl pattern engines in their documentation. However, No, in that the list on the aforementioned docs doesn't appear to be valid. For example, the list in those docs mention that both engines support a "look-behind" of fixed length. However, as I demonstrated in this post, that is clearly not the case (as it can't even parse such a pattern in POSIX). So, I am not sure what else is wrong in those docs.

James Moberg Nov 11, 2022 at 6:05 PM

1 Comments

There doesn't appear to be any way to determine if this flag is enabled or not. If third-party modules are written that depend upon newer syntax, I would hope that developers wouldn't blindly enable this feature server or application-wide as it could potentially cause problems with preexisting regex functions, right?

I would have much preferred a per-function flag for better granular control. I'm going to keep explicitly using java or use your jRegex library (which also explicitly uses java).

Peter Boughton Nov 11, 2022 at 6:05 PM

57 Comments

Just seen the following article but couldn't comment there without registering:
https://coldfusion.adobe.com/2022/11/switching-cf-to-use-java-regex-engine/

Followed this link and was further disappointed to see Ben calling it a POSIX Engine. :/

ColdFusion by default uses the Apache ORO regex engine, which is neither Perl nor POSIX.

Apache ORO is/was a Java library that aimed for Perl5 compatibility, but never achieved it.
(Also, Apache ORO is not PCRE, which is a different attempt at a Perl-compatible engine.)

POSIX currently defines two regex engines (BRE and ERE), and there's GNU extended versions of both, but none of those are the same as Apache ORO.

I'm assuming the "useJavaAsRegexEngine" switches to the JVM's built-in java.util.regex engine, which has all the features of Apache ORO, along with several that it doesn't have, but there is one prominent thing that's missing in Java's engine: the ability to change the case of replacement text with \u or \L..\E etc.

Oh yeah, and Java uses $n instead of \n for groups - based on the error in Adam Cameron's blog post, that conversion wasn't properly tested/implemented when the flag was added to Lucee.

Ben Nadel Nov 11, 2022 at 6:14 PM

16,020 Comments

@Peter,

I'll have to plead ignorance on the POSIX vs Apache ORO engine stuff. I honestly don't really even know what POSIX is exactly. But, I'm pretty sure the ColdFusion docs - maybe years ago - talked about the RegEx patterns as being POSIX... or maybe it was simply POSIX compatible - I don't really recall. It's 100% possible that I simply misunderstood what I had read and constructing an incorrect mental model in my head for years.

As far as what does useJavaAsRegexEngine, I believe you are correct - this is using Java's java.util.regex engine under the hood. Or, at least that's what I believe it means.

But, it's funny, one thing that I really miss in Java's patterns is the \u (uppercase) and \l (lowercase) variations. I'm pretty sure I used to use those for enforcing "title casing" in some scenarios.

Ultimately, though, I've never turned the useJavaAsRegexEngine property on for any ColdFusion application I've been working on - not even for this blog (which 100% my own code). Just always feels like it's gonna set off a delayed bomb where some incompatibility will blow up at some point. I just keep using the Java's Pattern and Matcher libraries explicitly. Fewer surprises.

Ben Nadel Nov 11, 2022 at 6:18 PM

16,020 Comments

@James,

I haven't tried this, but I'm wondering if it would show up in the payload from a getApplicationMetadata() call?

That said, there are a number of things I would love to have on a per-Component basis. Not least of which is something like localmode="modern" in Lucee CFML. Also, being able to turn on full null-support on a per-Component basis would be awesome.

Peter Boughton Nov 12, 2022 at 10:48 AM

57 Comments

Should have checked before posting - on my last point I made an incorrect assumption - there is no conversion performed for the replacement string. (At least in Lucee; I'm not installing CF to test.)

So if one toggles useJavaAsRegexEngine they must know to go update existing replacement strings, escaping $ to \$ and change back references (\0 to $0, \1 to $1, etc).

It's not an unreasonable approach - if it were documented - but neither CF nor Lucee docs mention this; it should be covered by both rereplace page and the Application variables page where useJavaAsRegexEngine is mentioned.

Just seen your reply...

What Apache ORO supports is POSIX character classes, e.g. using [[:space:]] for whitespace - but I guess most people don't use those because \s is easier to type/read.

Apache ORO doesn't have full POSIX compatibility because Apache ORO only has \b for word boundary, whilst POSIX ERE also accepts \< and \> for start and end of words (and POSIX BRE only has \< and \>, with no support for \b). I have a feeling there's a few other minor differences between ORO/ERE, but can't recall for sure.

POSIX itself is a broad set of multiple standards to help maintain compatibility between operating systems - if an OS is POSIX compatible, it's easier to port software between. Most Unixes and Linuxes are (or can be made) POSIX-compatible. Regex is just one relatively small part of that picture.

And yeah, I have no intention of using this flag - safer to use Java functionality directly... well almost directly, since I wrote a wrapper library to reduce boilerplate. :)

Ben Nadel Nov 12, 2022 at 10:53 AM

16,020 Comments

@Peter,

The POSIX character class that I've definitely used in the past is [:punct:] for punctuation. That one really represents a lot of individual characters. Most of the other abbreviations seem like they are actually longer than the characters they represent 😆

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.

	<cfscript>

	value = "You are the cat's pajamas!";

	// Normally, the POSIX-based Regular Expression engine that Adobe ColdFusion uses
	// under the hood doesn't support things like LOOK-BEHINDS. As such, the following
	// RegEx pattern - which is looking for the term "pajamas", but only if it comes
	// right after the term "cat's " - will throw an error:
	writeDump( value.reMatch( "(?<=cat's )pajamas" ) );

	</cfscript>

	component
	output = false
	hint = "I define the application settings and event-handlers."
	{

	this.name = "RegExTesting";

	// Tell Adobe ColdFusion 2018, Update 5, to use the Java Regular Expression engine
	// for all of its built-in re-Functions (like, reFind() and reMatch()).
	this.useJavaAsRegexEngine = true;

	}