The 17th Annual Regular Expression Day - June 1st 2024

By Ben Nadel

Published 2024-06-01 in ColdFusion — Comments (2)

Good morning my beautiful, beautiful friends and happy Regular Expression day 2024! Isn't it comforting to know that even with so much uncertainty in the world, you can always depend on the awesome power of pattern matching to help make life a little bit better—a little bit more fulfilling. And, even though there's ample opportunity to use Regular Expressions in every day life, there's always something new and exciting to try. And in celebration of this joyous day, I want to try something new (to me) and exciting in ColdFusion: using a callback operator in Adobe ColdFusion's reReplace() function.

Note: At the time of this writing, this feature is not supported in Lucee CFML. That is to say, while Lucee has a reReplace() method, it does not currently support a callback as the replacement mechanism.

Historically, the reReplace() function has accepted a string replacement. This replacement could contain static values as well as back-references; and, it can even contain some transformations like \u and \l, which upper-case and lower-case the next character, respectively. But, this replacement has always been performed inside a black box.

With the addition of a callback-based replacement, Adobe ColdFusion now gives us access to each replacement operation. Which means, not only do we get more granular control over how each individual replacement is performed, we can also use the reReplace() function to inspect a string and aggregate information.

Meaning, we can use the reReplace() function like an iterator, not just a transformer. If the passed-in callback is a Fat-arrow function, it will retain a binding to its lexical scoping. Which means, as our callback operator is executing, it can pass information back up into the lexical scope where we can store information about each individual match.

To demonstrate, I'm going to use a regular expression to capture words within a sentence that start with an uppercase character or a number and possibly end with a punctuation mark:

(\b[A-Z0-9])(\w*)([[:punct:]])?

As with all regular expression patterns, this one is easier to write than it is to read. So, let's reformat it using the verbose flag (?x). The verbose flag allows us to add comments and whitespace within the pattern without distorting the meaning of pattern:

(?x) # The verbose flag.

# Looking for the start of a word (first capture group).

( \b [A-Z0-9] )

# Followed by any number of word characters (second capture group).

( \w* )

# And possibly ending in a punctuation mark (third capture group).

( [[:punct:]] )?

This regular expression has three capture groups in addition to the implied match of the entire pattern. In other languages, this implied match is referred to as the 0 group. However, since ColdFusion uses 1-based arrays—and reports capture groups using an array—it reports its first capture group as index 2, not 1.

This is unfortunate. But, with our callback operator, we can fix it. Instead of recording the capture groups as an array, our callback can record the capture groups as a struct. This allows us to use the 0 key to identify the implied match, followed by 1, 2, 3, etc. for each subsequent capture group.

In addition to the capture groups, I'm also going to record the transform argument that is passed to the callback operator:

<cfscript>

	message = "Have A Wonderful Regular Expression Day 2024!";

	// We're going to use the reReplace() method as a means of locating and aggregating
	// the matches within the above string. Each match will be stored in this array as a
	// collection of capture groups.
	matches = [];

	// Note: I'm using the verbose flag (?x) to allow the Regular Expression to contain
	// non-meaningful white-space and comments.
	transformedMessage = message.reReplace(
		"(?x)
			## Looking for the start of a word (first capture group).
			( \b [A-Z0-9] )

			## Followed by any number of word characters (second capture group).
			( \w* )

			## And possibly ending in a punctuation mark (third capture group).
			( [[:punct:]] )?
		",
		( transform, position, original, count ) => {

			var match = [
				// Store the original transform for reference.
				transform: transform
			];

			// The transform reports the capture groups as an ARRAY, which means that the
			// groups start at 1, not 0. This is problematic because in other language
			// context, the "0" group is the full match; and then, each subsequent capture
			// group is identified as "1", "2", "3", etc. As such, we need to map the
			// array-based notation onto a struct-key-notation in order to normalize the
			// groups in our match.
			transform.match.each(
				( value, i ) => {

					// Mapping N -> ( N-1 ).
					match[ i - 1 ] = isNull( value )
						? ""
						: value
					;

				}
			);

			matches.append( match );

			// Wrap the match in brackets so we can more easily see the matches.
			return ( "[" & transform.matches & "]" );

		},
		"all"
	);

	writeDump( transformedMessage );
	writeDump( matches );

</cfscript>

This Adobe ColdFusion code is doing two things:

It's transforming the message and returning a new string.
It's capturing each match into the matches array.

When we run this CFML code, we end up with the transformed message:

[Have] [A] [Wonderful] [Regular] [Expression] [Day] [2024!]

Notice that each match has been replaced with the full matching string wrapped in brackets. Of course, there's nothing that says we have to reference the return value of our reReplace() invocation. If we're only using the function as an iterator, we can completely ignore the return value.

And, in our case, we're mostly using it as an iterator that aggregates all of the matches contained within the target string. In the following writeDump() output, I've highlighted our capture groups (mapped from 1-based to 0-based) in yellow:

The captured groups in the reReplace() pseudo-iteration in Adobe ColdFusion 2023.

As you can see, the reReplace() function gave us access to the individual matches within the target string; which, in turn, allowed us to aggregate the values with great precision.

This isn't the only way to do this in ColdFusion. For example, we can use the reFind() function to iterate over matches. In fact, the transform argument passed to our reReplace() callback is the same structure that is returned in the alternative invocation of the reFind() function.

We can even dip down into the Java layer and use the Pattern and Matcher classes to iterate over our strings. In fact, I have a ColdFusion component, jRegEx.cfc, that exposes this pattern matching magic using simple method calls.

With that, I wish you all a Happy Regular Expression Day! And, if all this pattern matching has gotten you wanting to know more, please checkout my Video presentations: Regular Expressions, Extraordinary Power.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4663

Reader Comments

Chris G Jun 1, 2024 at 1:48 PM

268 Comments

[Good] [Stuff] [Ben!] 👏

Reader Comments

Post A Comment — ❤️ I'd Love To Hear From You! ❤️

Post A Comment — I'd Love To Hear From You!