REMatchGroup() UDF To Return Only Specified Group In RegEx Pattern

By Ben Nadel

Published 2008-01-29 in ColdFusion — Comments (12)

Earlier today, Matthew Abbott was asking me about REMatch(). He had a pattern that was throwing the following error:

Illegal pattern: Look-behind group does not have an obvious maximum length near index

The problem here is that the look behind portion of the regular expression pattern had a wild card (*). This meant that that part of the pattern could be zero to N characters where N could be any number of characters. Something like this:

(?<=\w+)ben

This means that we want to match the string literal, "ben", but only if it is preceeded by one or more word characters (this is not what Matt had, this is just a simple example). While this kind of thing would be nice to have, it really doesn't make sense from an implimentation standpoint; the regular expression engine would be extremely bogged down if it had to match against and infinitely large look-behind.

One of the reasons that the look behind and the look ahead is so powerful is beacuse it allows you to match patterns without capturing certain data. But, it has the limitation mentioned above. However, when you go to match a regular pattern (capturing pattern), you don't have to worry about any flexible size limit. So, this got me thinking - how can we get around this limitation and still get what we need.

Well, what are we really doing; Let's look at this pattern:

(?<=A)B(?=C)

What this is saying is that we want to capture "B", but only if it is proceeded by "A" and succeeded by "C". But what's another way to look at this? I think, from a practical standpoint, this is the same as capturing "A(B)C" and only returning "B". This accomplishes the same thing and would circumvent the look ahead/behind limitations.

This is how I got the idea for the ColdFusion user defined function, REMatchGroup( Regex, Text [, Group] ). This does exactly what ColdFusion 8's REMatch() does; however, it takes an optional third argument in which you can specify the captured group that you want to return. This will allow you to return only a sub-expression while keeping the contextual integrity of the larger regular expression. Here is the ColdFusion user defined function:

<cffunction
	name="REMatchGroup"
	access="public"
	returntype="array"
	output="false"
	hint="Gets the given group of a captured expression.">

	<!--- Define arguments. --->
	<cfargument
		name="Pattern"
		type="string"
		required="true"
		hint="The regular expression we want to match on."
		/>

	<cfargument
		name="Text"
		type="string"
		required="true"
		hint="The target text against which we will run the patterns."
		/>

	<cfargument
		name="Group"
		type="numeric"
		required="false"
		default="0"
		hint="The captured group we want to return."
		/>


	<!--- Define the local scope. --->
	<cfset var LOCAL = {} />

	<!---
		Create the results array. Here is where we will store
		our captured groups.
	--->
	<cfset LOCAL.Results = [] />

	<!---
		Compile our regular expression pattern. By using the
		underlying Java pattern class, we will have both faster
		pattern matching and more access to the matched groups.
	--->
	<cfset LOCAL.Pattern = CreateObject(
		"java",
		"java.util.regex.Pattern"
		).Compile(
			JavaCast( "string", ARGUMENTS.Pattern )
			)
		/>

	<!--- Get our pattern matcher for our target text. --->
	<cfset LOCAL.Matcher = LOCAL.Pattern.Matcher(
		JavaCast( "string", ARGUMENTS.Text )
		) />


	<!---
		Keep looping over the pattern matches until we hit the
		end of the target string.
	--->
	<cfloop condition="LOCAL.Matcher.Find()">

		<!---
			Append the target group to the results array. If the
			Group was not defined (defaulting to zero), then the
			entire pattern will be matched.
		--->
		<cfset ArrayAppend(
			LOCAL.Results,
			LOCAL.Matcher.Group(
				JavaCast( "int", ARGUMENTS.Group )
				)
			) />

	</cfloop>


	<!--- Return the results array. --->
	<cfreturn LOCAL.Results />
</cffunction>

To see this in action, we are going to set up a text buffer that contains HTML links. Then, we are going to capture each Anchor tag but, only return the actual URL contained within it. Don't get hung up on the regular expression in the example - this is just to demonstrate the functionality:

<!--- Save our target text. --->
<cfsavecontent variable="strText">

	Hey Deborah, you should totally check out the movie,
	<a
	href="http://www.imdb.com/title/tt0096320/"
	title="Twins staring Arnold Schwarzenegger."
	target="_blank"
	>Twins</a>, starring Arnold Schwarzenegger. Obviously,
	some of his classic work. However, if you really want to
	appreciate Arnold for his true greatness, you should
	probably check out <a
	href="http://www.imdb.com/title/tt0076578/"
	title="Pumping Iron staring Arnold Schwarzenegger."
	target="_blank"
	>Pumping Iron</a>. This movie is just crazy awesome!

</cfsavecontent>


<!---
	Extract all the link URLs from the text. We are putting
	the acutal link URL in the only capturing group, which
	will be group 1. This is important since out UDF will
	return only the given group.
--->
<cfset arrURLs = REMatchGroup(
	"(?i)<a[\s\S]+?href=""([^""]+)""[^>]*>",
	strText,
	1
	) />

<!--- Dump out the URLs. --->
<cfdump
	var="#arrURLs#"
	label="REMatchGroup() - Link URLs"
	/>

Running that, we get the following ColdFusion CFDump output:

REMatchGroup() Output Captures Specified Pattern Group

As you can see, even though we were capturing the entire Anchor tag pattern, only the first captured group (as denoted by our optional argument, 1) was returned in the results array. I am not sure how useful this is, but it might very well get you out of a pinch when you are struggling with some complicated look aheads and look behinds. And, since the underlying technology is the Java regular expression Pattern object, you can also rock some sweet-ass negative look aheads / behinds.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/1132

Reader Comments

Steven Levithan Jan 29, 2008 at 11:32 PM

172 Comments

There are four approaches to lookbehind:

- No lookbehind. The state of affairs in e.g. JavaScript, ColdFusion, and Ruby.
- Fixed length. This is what you get in Perl and Python. With fixed-length lookbehind, (?<=a|ab) is invalid, as is the use of any kind of quantifier within the lookbehind.
- Finite length. How things work with Java and PCRE. With finite-length lookbehind, only the *, +, and {n,} quantifiers are unsupported. Everything else works fine.
- Infinite length. True with the .NET and JGsoft regex engines. However, taking advantage of infinite-length lookbehind can have significant performance implications when working with long text or poorly written regexes.

Note that the java.util.regex package which you're using supports anything but infinite-length quantifiers, so something like {0,9999} should work fine. Some regex engines cap the upper bound of interval quantifiers (usually at around 65,000), but I'm not sure if that's true with Java.

Also, it's worth noting that the (?<=A)B(?=C) construct would not be equivalent to A(B)C with only the B returned in all cases, due to the impact of backtracking, the regex bump-along mechanism, and so on. However, your UDF could certainly mimic lookbehind nicely in a fair number of cases, and could offer convenience in some other scenarios as well.

Last thing... Because I'm pedantic, I'll note that in regex lingo * is called a Kleene star (named after mathematician Stephen Kleene, who invented regular expressions) rather than a wildcard. At least to me, "wildcard" implies different functionality than what a regex * produces.

Ben Nadel Jan 30, 2008 at 7:41 AM

16,256 Comments

@Steve,

I think in Java, the look behind limit for finite lengths is something like 1000. I can't be sure, but I feel like that is a number I was comfortable with when testing something.

Clearly, for any regex engine to allow infinite lookbehinds, it's clearly much smarter than I am. I can't even conceptualize that. Although, I suppose that if it did it something like the way my UDF does it, then it's easier to understand. I always imagined these things as just-in-time checking, but that's probably not even close to accurate.

While this may not entirely mimic the lookahead / lookbehind functionality, I think that it would make some expressions easier to read, if nothing else.

Good note about the Kleene star; I agree that the term wild-card does conjure up different thoughts.

Elliott Sprehn Jan 30, 2008 at 1:30 PM

132 Comments

There's actually no need for this UDF. When you only need the result of a single capture group you should be using non-capture groups for everything else which is more efficient for the regexp engine.

(?:A)B(?:C)

There's really no reason to use REMatchGroup() and match more than one group since you're just throwing away the result anyway.

Ben Nadel Jan 30, 2008 at 1:35 PM

16,256 Comments

@Elliott,

Unless I am way off base here, a non-capturing group is still returned in the pattern match, you just can't refer to it as a group (it has not back reference). However, it is still returned.

Elliott Sprehn Jan 30, 2008 at 3:40 PM

132 Comments

@Ben

Non-capturing groups don't show up at all in the result at all.

Using ruby:

"foo bar" =~ /(?:foo) (bar)/

$0 = "foo bar"
$1 = "bar"

There's no indication in the result of what matched in a non-capturing group at all. All it does is group it's contents. Because of this it's also not available in as a back reference in the pattern either.

"foo foo bar" =~ /(?:foo) \1 (bar)/ => Does not match.
"foo foo bar" =~ /(foo) \1 (bar)/ => Does match.

Steven Levithan Jan 30, 2008 at 4:05 PM

172 Comments

@Elliott, Ben was talking about the entire match (backreference 0), where the values certainly will show up. Backreference values are not exactly convenient to get at with ColdFusion. You have to mid() them out the original string using the result of reFind() with the returnSubExpressions argument, or use underlying Java methods. And even if you did have more convenient access to them, being able to easily generate an array of only the results of a user-specified capturing group could still be convenient at times.

Steven Levithan Jan 30, 2008 at 4:14 PM

172 Comments

And by the way, I would not recommend using stuff like /(?:foo) \1 (bar)/ even for illustrative purposes unless you know exactly what you're doing, because it's bound to cause confusion. The meaning varies between regex flavors and even specific implementations and versions of the same standards. For example, is it an error, octal character index 1, or a forward reference to a capturing group which hasn't participated yet? And if it's the latter, will it fail to match, or will it match the empty string and therefore always match successfully? The handling for something like /(\2one|(two))+/ gets even more hairy and flavor/implementation-specific.

Elliott Sprehn Jan 30, 2008 at 4:20 PM

132 Comments

@Steven

That's a good point. Yeah, I this function is useful like that.

I guess I should have said that if you planned to only return a single capture group, and all you're not using the back references in the result, it's better to just use non-capture groups than to capture more than you need and throw them out with this function.

Elliott Sprehn Jan 30, 2008 at 4:23 PM

132 Comments

As far as I know, that example means the same thing in all PCRE based engines, which includes CF and Java, and as I mentioned in my comment.. ruby.

Steven Levithan Jan 30, 2008 at 4:33 PM

172 Comments

What is a PCRE-based engine? The only engine I can think of which that accurately describes is the one included in JavaScriptCore, the JavaScript engine for WebKit implementations such as the Safari browser. If you're using "PCRE-based" to describe popular Perl-derivative regex engines, I can tell you for a fact that it doesn't mean the same thing everywhere (for example, the meaning is entirely opposite in ECMAScript). BTW, another obvious potential meaning for "\1" when it appears before the first capturing group is closed is a literal 1.

Elliott Sprehn Jan 30, 2008 at 4:40 PM

132 Comments

Actually, WebKit uses the actual pcre library, not a derivative of it. And as for syntax, seeing as this is a post about ColdFusion and my examples were fine for that (spare the syntax from ruby, which was clearly marked)...

Anyway, since you seem to intent on arguing this, what format would you suggest to use for examples?

Steven Levithan Jan 30, 2008 at 5:32 PM

172 Comments

It is derivative. A prime example is the very issue we're discussing. The latest versions of WebKit implement the correct ECMAScript handling for backreferences to non-participating groups (they always match), which is different from PCRE's native handling (they never match).

I'm not trying to suggest alternative example code. I'm arguing against describing non-portable syntax and behavior in black-and-white terms, partly because the differences here can be subtle and therefore introduce latent bugs, but mostly because I enjoy discussing regular expressions.

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.