Skip to main content
Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.

Building reMatchGroups() Using reFind() In Adobe ColdFusion 2018 And Lucee CFML 5.3.7.47

By Ben Nadel on
Tags: ColdFusion

The other day, in my post about parsing strings like "5mb" into a number of bytes, I was griping about the fact that the ColdFusion language still doesn't have an reMatchGroups() function. To this, Adam Cameron mentioned that the reFind() function has had a "scope" argument since Adobe ColdFusion 2016 that will cause the Function to return all the matches in the input. I didn't realize this change. As such, I wanted to take a quick look at how reFind() can be used to build my reMatchGroups() function in Lucee CFML 5.3.7.47.

Adobe ColdFusion 2018

Since its inception, the reFind() function has supported the concept of returning sub-expressions. When called in "sub-expression mode", the result of the function will not be the index of the first match. Instead, it will be a structure that includes the match, the position and, the length of the matching groups within the input string. This data could then be used to iterate over all the matches in the input by repeatedly calling reFind() with an increasing "start" value:

<cfscript>

	input = "Hello there, you magnificent bastard!";
	start = 1;

	while ( true ) {

		result = input.reFindNoCase( "(\b[a-z])(\w*)", start, true );

		// When returning the sub-expressions, a non-matching result will still have a
		// LEN, POS, and MATCH structures. But, the position of the match will be zero.
		// If that's the case, we've found all the matches.
		if ( ! result.pos[ 1 ] ) {

			break;

		}

		writeOutput( result.match[ 1 ] & "<br />" );

		// On the next iteration, start at the end of the current match.
		start = ( result.pos[ 1 ] + result.len[ 1 ] );

	}

</cfscript>

As you can see, at the end of each loop we increment the start variable such that the next time we call the reFind(), we start at a position just past the current find. And, when we run this ColdFusion code, we get the following output:

Hello
there
you
magnificent
bastard

This "works"; but, it's very manual and it has to do extra processing because the ColdFusion engine has to re-parse the Regular Expression on every iteration; and, it has to re-traverse the input string on every iteration.

Now, as per Adam's insights, Adobe ColdFusion 2016 added a "scope" property to the reFind() function which gets the function to return all the matches instead of just the current match. This reduces the manual part of the iteration and makes it quite easy to build an reMatchGroups() function on the back of the reFind() function:

<cfscript>

	input = "Hello there, you magnificent bastard!";
	// Match any letter at the beginning of a word-boundary followed by N-number of
	// word-characters.
	regexPattern = "(\b[a-z])(\w*)";

	writeDump( reMatchGroupsNoCase( input, regexPattern ) );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I return the captured-group matches of the given Regular Expression pattern within
	* the given input.
	* 
	* @input I am the string being parsed.
	* @inputPattern I am the RegEx pattern used to locate matches.
	*/
	public array function reMatchGroups(
		required string input,
		required string inputPattern
		) {

		var results = input
			.reFind( inputPattern, 1, true, "all" )
			.map(
				( item ) => {

					// If the item has no position, it means that we didn't match any
					// substrings in our input.
					if ( ! item.pos[ 1 ] ) {

						return;

					}

					var groups = [:];

					for ( var i = 1 ; i <= item.match.len() ; i++ ) {

						// The zeroth group in the series is always the full match of the
						// Regular Expression pattern. As such, we have to translate the
						// 1-based ColdFusion system into the 0-based group system.
						groups[ i - 1 ] = item.match[ i ];

					}

					return( groups );

				}
			)
		;

		// If we didn't match any substrings in our input, our results will contain one
		// entry that is UNDEFINED. In that case, return an empty results-set.
		if ( ! results.isDefined( 1 ) ) {

			return( [] );

		}

		return( results );

	}


	/**
	* I return the captured-group matches of the given Regular Expression pattern within
	* the given input using CASE-INSENSITIVE matching.
	* 
	* @input I am the string being parsed.
	* @inputPattern I am the RegEx pattern used to locate matches.
	*/
	public array function reMatchGroupsNoCase(
		required string input,
		required string inputPattern
		) {

		// Instead of re-creating all of the logic, we can just the case-sensitive
		// function but prepend the CASE-INSENSITIVE FLAG to the Regular Expression
		// flag. This will run a case-insensitive search using the case-sensitive
		// reFind() function.
		return( reMatchGroups( input, ( "(?i)" & inputPattern ) ) );

	}

</cfscript>

Here, by passing "all" as the scope in the reFind() function, we receive an Array that contains the pos, len, and match attributes for all the matches in the input string. Getting the captured groups then because a simple translation of the 1-based ColdFusion values onto the traditional 0-based group values used in the Regular Expression world.

And, when we run this Adobe ColdFusion code, we get the following output:

Captured groups being output via reFind() in Adobe ColdFusion 2018.

As you can see, by using reFind() and the "scope" argument of "all", we can iterate over each match and pluck out the captured groups quite nicely.

Lucee CFML 5.3.7.47

Normally, I would write my demo in Lucee CFML only. However, it turns out that the .reFind() function is actually quite janky in Lucee CFML. The two big issues are:

  1. The .reFind() member method doesn't support the scope argument and complains about "too many arguments".

  2. The captured groups don't report the correct match values - each captured group contains the exactly same value for a given match.

The first issue is fine because the built-in, top-level reFind() function supports the "scope" argument. But, the second issue is quite unfortunate. You can see what I mean with a simple test:

<cfscript>

	dump(
		reFindNoCase(
			"(\b[a-z])(\w*)",
			"Hello there, you magnificent bastard!",
			1,
			true // Return sub-expressions.
		)
	);

</cfscript>

When we run this, we get the first full match, which contains:

Captured groups returned from reFind() are incorrect in Lucee CFML 5.3.7.47 - they are all the same.

As you can see, every captured group within a given match contains the same value. This is incorrect - I'll try to find a filed bug on this and leave it in the comments.

That said, while the captured groups are incorrect, the len and pos value are correct. Which means, we can still use the reFind() function to build a reMatchGroups() function - we just have to use a little more elbow-grease:

<cfscript>

	input = "Hello there, you magnificent bastard!";
	// Match any letter at the beginning of a word-boundary followed by N-number of
	// word-characters.
	regexPattern = "(\b[a-z])(\w*)";

	dump( reMatchGroupsNoCase( input, regexPattern ) );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I return the captured-group matches of the given Regular Expression pattern within
	* the given input.
	* 
	* @input I am the string being parsed.
	* @inputPattern I am the RegEx pattern used to locate matches.
	*/
	public array function reMatchGroups(
		required string input,
		required string inputPattern
		) {

		// NOTE: The member-method, .reFind(), does not seem to support "scope" at this
		// time and throws a "too many arguments" error. As such, I am using the top-
		// level built-in function for reFind().
		var results = reFind( inputPattern, input, 1, true, "all" )
			.map(
				( item ) => {

					// If the item has no position, it means that we didn't match any
					// substrings in our input.
					if ( ! item.pos[ 1 ] ) {

						return;

					}

					// The zeroth group in the series is always the full match of the
					// Regular Expression pattern. We don't have to parse it out.
					var groups = [
						0: item.match[ 1 ]
					];

					// For the 2..Nth group, we'll have to use the POSITION and LENGTH
					// matches to parse the captured group out of the input.
					for ( var i = 2 ; i <= item.match.len() ; i++ ) {

						var groupName = ( i - 1 );
						var groupStart = item.pos[ i ];
						var groupLength = item.len[ i ];

						// CAUTION: The "position" values that reFind() returns are
						// relative to the original input - NOT THE MATCHED VALUE. As
						// such, we have to re-parse the input to get the group.
						// --
						// NOTE: Optional capture groups have a zero-length if they were
						// not matched. In such cases, we'll just return an empty string.
						groups[ groupName ] = ( groupLength )
							? input.mid( groupStart, groupLength )
							: ""
						;

					}

					return( groups );

				}
			)
		;

		// If we didn't match any substrings in our input, our results will contain one
		// entry that is UNDEFINED. In that case, return an empty results-set.
		if ( ! results.isDefined( 1 ) ) {

			return( [] );

		}

		return( results );

	}


	/**
	* I return the captured-group matches of the given Regular Expression pattern within
	* the given input using CASE-INSENSITIVE matching.
	* 
	* @input I am the string being parsed.
	* @inputPattern I am the RegEx pattern used to locate matches.
	*/
	public array function reMatchGroupsNoCase(
		required string input,
		required string inputPattern
		) {

		// Instead of re-creating all of the logic, we can just the case-sensitive
		// function but prepend the CASE-INSENSITIVE FLAG to the Regular Expression
		// flag. This will run a case-insensitive search using the case-sensitive
		// reFind() function.
		return( reMatchGroups( input, ( "(?i)" & inputPattern ) ) );

	}

</cfscript>

As you can see, in the Lucee CFML version, with every match we have to go back to the input value and extract the captured group using the .mid() function. This is unfortunate as it adds unnecessary processing to the algorithm (especially when compared to Adobe ColdFusion that provides the captures groups quite cleanly).

ASIDE: I won't bother showing the output here since it's exactly the same as the Adobe ColdFusion 2018 above.

UPDATE: Lucee CFML 5.3.8-RC (Release Candidate)

As per Zac Spitzer's comment below, I checked this code in the new release candidate for Lucee CFML 5.3.8 and it is Adobe ColdFusion compliant. Meaning, the .reFind() member method works as expected; and, the captured groups all contain the right portion of the matching Regular Expression!

Sweet sweet sweet!

Thanks to Adam Cameron, I now can file the "scope" argument for reFind() in the back of my mind for next time. It's still not quite as element as a native reMatchGroups() function - but it can be used to build one.

Epilogue on the Java Pattern / Matcher

Even with the updates to reFind(), there's still nothing quite as elegant as dipping down into the Java layer to create an reMatchGroups() function with java.util.regex.Pattern:

<cfscript>

	input = "Hello there, you magnificent bastard!";
	// Match any letter at the beginning of a word-boundary followed by N-number of
	// word-characters.
	regexPattern = "(\b[a-z])(\w*)";

	dump( reMatchGroupsNoCase( input, regexPattern ) );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I return the captured-group matches of the given Regular Expression pattern within
	* the given input.
	* 
	* @input I am the string being parsed.
	* @inputPattern I am the RegEx pattern used to locate matches.
	*/
	public array function reMatchGroups(
		required string input,
		required string inputPattern
		) {

		var matcher = createObject( "java", "java.util.regex.Pattern" )
			.compile( inputPattern )
			.matcher( input )
		;
		var results = [];

		while ( matcher.find() ) {

			var groups = [:];

			for ( var i = 0 ; i <= matcher.groupCount() ; i++ ) {

				groups[ i ] = matcher.group( i );

			}

			results.append( groups );

		}

		return( results );

	}


	/**
	* I return the captured-group matches of the given Regular Expression pattern within
	* the given input using CASE-INSENSITIVE matching.
	* 
	* @input I am the string being parsed.
	* @inputPattern I am the RegEx pattern used to locate matches.
	*/
	public array function reMatchGroupsNoCase(
		required string input,
		required string inputPattern
		) {

		// Instead of re-creating all of the logic, we can just the case-sensitive
		// function but prepend the CASE-INSENSITIVE FLAG to the Regular Expression
		// flag. This will run a case-insensitive search using the case-sensitive
		// reFind() function.
		return( reMatchGroups( input, ( "(?i)" & inputPattern ) ) );

	}

</cfscript>

That's just so pretty!



Reader Comments

@Zac,

Ah, very nice! I just confirmed that this does, indeed, work in 5.3.8-RC - it seems both the .reFind() member method issue (too many arguments) and the captured-group content were both fixed! Outstanding work.

Re: Java RegEx engine, I love love love this feature. But, I don't think I can enable it on any brown-field project. Too many pattern in the wild. But, going forward, this will definitely be the seeing that I use.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Blog
Live in the Now
Oops!
NEW: Some basic markdown formatting is now supported: bold, italic, blockquotes, lists, fenced code-blocks. Read more about markdown syntax »
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.