Earlier today, Matthew Abbott was asking me about REMatch(). He had a pattern that was throwing the following error:
Illegal pattern: Look-behind group does not have an obvious maximum length near index
The problem here is that the look behind portion of the regular expression pattern had a wild card (*). This meant that that part of the pattern could be zero to N characters where N could be any number of characters. Something like this:
(?<=\w+)ben
This means that we want to match the string literal, "ben", but only if it is preceeded by one or more word characters (this is not what Matt had, this is just a simple example). While this kind of thing would be nice to have, it really doesn't make sense from an implimentation standpoint; the regular expression engine would be extremely bogged down if it had to match against and infinitely large look-behind.
One of the reasons that the look behind and the look ahead is so powerful is beacuse it allows you to match patterns without capturing certain data. But, it has the limitation mentioned above. However, when you go to match a regular pattern (capturing pattern), you don't have to worry about any flexible size limit. So, this got me thinking - how can we get around this limitation and still get what we need.
Well, what are we really doing; Let's look at this pattern:
(?<=A)B(?=C)
What this is saying is that we want to capture "B", but only if it is proceeded by "A" and succeeded by "C". But what's another way to look at this? I think, from a practical standpoint, this is the same as capturing "A(B)C" and only returning "B". This accomplishes the same thing and would circumvent the look ahead/behind limitations.
This is how I got the idea for the ColdFusion user defined function, REMatchGroup( Regex, Text [, Group] ). This does exactly what ColdFusion 8's REMatch() does; however, it takes an optional third argument in which you can specify the captured group that you want to return. This will allow you to return only a sub-expression while keeping the contextual integrity of the larger regular expression. Here is the ColdFusion user defined function:
Launch code in new window » Download code as text file »
To see this in action, we are going to set up a text buffer that contains HTML links. Then, we are going to capture each Anchor tag but, only return the actual URL contained within it. Don't get hung up on the regular expression in the example - this is just to demonstrate the functionality:
Launch code in new window » Download code as text file »
Running that, we get the following ColdFusion CFDump output:
| | | | ||
| | ![]() | | ||
| | | |
As you can see, even though we were capturing the entire Anchor tag pattern, only the first captured group (as denoted by our optional argument, 1) was returned in the results array. I am not sure how useful this is, but it might very well get you out of a pinch when you are struggling with some complicated look aheads and look behinds. And, since the underlying technology is the Java regular expression Pattern object, you can also rock some sweet-ass negative look aheads / behinds.
Download Code Snippet ZIP File
Comments (12) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
Ask Ben: Simple Data Caching In ColdFusion
Ask Ben: Ending ColdFusion Session When User Closes Browser
There are four approaches to lookbehind:
- No lookbehind. The state of affairs in e.g. JavaScript, ColdFusion, and Ruby.
- Fixed length. This is what you get in Perl and Python. With fixed-length lookbehind, (?<=a|ab) is invalid, as is the use of any kind of quantifier within the lookbehind.
- Finite length. How things work with Java and PCRE. With finite-length lookbehind, only the *, +, and {n,} quantifiers are unsupported. Everything else works fine.
- Infinite length. True with the .NET and JGsoft regex engines. However, taking advantage of infinite-length lookbehind can have significant performance implications when working with long text or poorly written regexes.
Note that the java.util.regex package which you're using supports anything but infinite-length quantifiers, so something like {0,9999} should work fine. Some regex engines cap the upper bound of interval quantifiers (usually at around 65,000), but I'm not sure if that's true with Java.
Also, it's worth noting that the (?<=A)B(?=C) construct would not be equivalent to A(B)C with only the B returned in all cases, due to the impact of backtracking, the regex bump-along mechanism, and so on. However, your UDF could certainly mimic lookbehind nicely in a fair number of cases, and could offer convenience in some other scenarios as well.
Last thing... Because I'm pedantic, I'll note that in regex lingo * is called a Kleene star (named after mathematician Stephen Kleene, who invented regular expressions) rather than a wildcard. At least to me, "wildcard" implies different functionality than what a regex * produces.
Posted by Steven Levithan on Jan 29, 2008 at 11:32 PM
@Steve,
I think in Java, the look behind limit for finite lengths is something like 1000. I can't be sure, but I feel like that is a number I was comfortable with when testing something.
Clearly, for any regex engine to allow infinite lookbehinds, it's clearly much smarter than I am. I can't even conceptualize that. Although, I suppose that if it did it something like the way my UDF does it, then it's easier to understand. I always imagined these things as just-in-time checking, but that's probably not even close to accurate.
While this may not entirely mimic the lookahead / lookbehind functionality, I think that it would make some expressions easier to read, if nothing else.
Good note about the Kleene star; I agree that the term wild-card does conjure up different thoughts.
Posted by Ben Nadel on Jan 30, 2008 at 7:41 AM
There's actually no need for this UDF. When you only need the result of a single capture group you should be using non-capture groups for everything else which is more efficient for the regexp engine.
(?:A)B(?:C)
There's really no reason to use REMatchGroup() and match more than one group since you're just throwing away the result anyway.
Posted by Elliott Sprehn on Jan 30, 2008 at 1:30 PM
@Elliott,
Unless I am way off base here, a non-capturing group is still returned in the pattern match, you just can't refer to it as a group (it has not back reference). However, it is still returned.
Posted by Ben Nadel on Jan 30, 2008 at 1:35 PM
@Ben
Non-capturing groups don't show up at all in the result at all.
Using ruby:
"foo bar" =~ /(?:foo) (bar)/
$0 = "foo bar"
$1 = "bar"
There's no indication in the result of what matched in a non-capturing group at all. All it does is group it's contents. Because of this it's also not available in as a back reference in the pattern either.
"foo foo bar" =~ /(?:foo) \1 (bar)/ => Does not match.
"foo foo bar" =~ /(foo) \1 (bar)/ => Does match.
Posted by Elliott Sprehn on Jan 30, 2008 at 3:40 PM
@Elliott, Ben was talking about the entire match (backreference 0), where the values certainly will show up. Backreference values are not exactly convenient to get at with ColdFusion. You have to mid() them out the original string using the result of reFind() with the returnSubExpressions argument, or use underlying Java methods. And even if you did have more convenient access to them, being able to easily generate an array of only the results of a user-specified capturing group could still be convenient at times.
Posted by Steven Levithan on Jan 30, 2008 at 4:05 PM
And by the way, I would not recommend using stuff like /(?:foo) \1 (bar)/ even for illustrative purposes unless you know exactly what you're doing, because it's bound to cause confusion. The meaning varies between regex flavors and even specific implementations and versions of the same standards. For example, is it an error, octal character index 1, or a forward reference to a capturing group which hasn't participated yet? And if it's the latter, will it fail to match, or will it match the empty string and therefore always match successfully? The handling for something like /(\2one|(two))+/ gets even more hairy and flavor/implementation-specific.
Posted by Steven Levithan on Jan 30, 2008 at 4:14 PM
@Steven
That's a good point. Yeah, I this function is useful like that.
I guess I should have said that if you planned to only return a single capture group, and all you're not using the back references in the result, it's better to just use non-capture groups than to capture more than you need and throw them out with this function.
Posted by Elliott Sprehn on Jan 30, 2008 at 4:20 PM
As far as I know, that example means the same thing in all PCRE based engines, which includes CF and Java, and as I mentioned in my comment.. ruby.
Posted by Elliott Sprehn on Jan 30, 2008 at 4:23 PM
What is a PCRE-based engine? The only engine I can think of which that accurately describes is the one included in JavaScriptCore, the JavaScript engine for WebKit implementations such as the Safari browser. If you're using "PCRE-based" to describe popular Perl-derivative regex engines, I can tell you for a fact that it doesn't mean the same thing everywhere (for example, the meaning is entirely opposite in ECMAScript). BTW, another obvious potential meaning for "\1" when it appears before the first capturing group is closed is a literal 1.
Posted by Steven Levithan on Jan 30, 2008 at 4:33 PM
Actually, WebKit uses the actual pcre library, not a derivative of it. And as for syntax, seeing as this is a post about ColdFusion and my examples were fine for that (spare the syntax from ruby, which was clearly marked)...
Anyway, since you seem to intent on arguing this, what format would you suggest to use for examples?
Posted by Elliott Sprehn on Jan 30, 2008 at 4:40 PM
It is derivative. A prime example is the very issue we're discussing. The latest versions of WebKit implement the correct ECMAScript handling for backreferences to non-participating groups (they always match), which is different from PCRE's native handling (they never match).
I'm not trying to suggest alternative example code. I'm arguing against describing non-portable syntax and behavior in black-and-white terms, partly because the differences here can be subtle and therefore introduce latent bugs, but mostly because I enjoy discussing regular expressions.
Posted by Steven Levithan on Jan 30, 2008 at 5:32 PM