Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at CFUNITED 2010 (Landsdown, VA) with:

Javascript Exec() Method For Regular Expression Matching

By Ben Nadel on

Earlier today, Steve of Flagrant Badassery introduced me to the Exec() method of the Javascript Regular Expression object (RegExp::exec()). I had never heard of this before and I love working with regular expressions, so naturally I had to dive right in and do a little experimentation. From my initial play time, it looks like the exec() method allows me to use the regular expression object somewhat like the Java Pattern Matcher that I love using so much in ColdFusion.

First, I had to look up the details of the method and how it works. The Mozilla Developer Center has some very clean, straightforward documentation. Once I had that, I set up this test page.

This test page takes some song lyrics and finds the pattern defined as words that modify other words (I had to think of something to test with). Once we have our target text and our pattern, we do a conditional loop until the RegExp::exec() method no longer returns a valid match array. The array that gets returned acts just like the Matcher::Group() method in Java, so this felt very natural:

  • <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  • <html>
  • <head>
  • <title>Javascript Regular Expression Exec()</title>
  •  
  • <script type="text/javascript">
  •  
  • // Define the text that we are going to search.
  • // This text was taken from Sir Mix-a-Lot's hit
  • // single, "I Like Big Butts".
  • var strText = "\
  • I like big butts and I can not lie \
  • You other brothers can't deny \
  • That when a girl walks in with an itty bitty waist \
  • And a round thing in your face \
  • You get sprung \
  • Wanna pull up tough \
  • Cuz you notice that butt was stuffed \
  • Deep in the jeans she's wearing \
  • I'm hooked and I can't stop staring \
  • Oh, baby I wanna get with ya \
  • And take your picture \
  • ";
  •  
  •  
  • // Get a pattern that we want to search on. This
  • // defines certain modifiers and the words that
  • // the modify.
  • var rePattern = new RegExp(
  • "(big|round|bitty)(?:\\s+)([^\\s]+)",
  • "gi"
  • );
  •  
  • </script>
  • </head>
  • <body>
  •  
  • <script type="text/javascript">
  •  
  • // Define our match array. This will be populated for
  • // each iteration of the Exec() method.
  • var arrMatch = null;
  •  
  •  
  • // Keep looping over the target text while we can
  • // find matches. If no matches can be found,
  • // arrMatch is null and will end the while loop.
  • while (arrMatch = rePattern.exec( strText )){
  •  
  • document.write(
  • arrMatch[ 1 ].toUpperCase() +
  • " modifies " +
  • arrMatch[ 2 ].toUpperCase() +
  • "<br />"
  • );
  •  
  •  
  • // Explore the modified properties of both
  • // the returned array as well as the regular
  • // expression object. The returned array,
  • // unlike traditional Javascript arrays, is
  • // given pattern-matching related properties.
  •  
  •  
  • document.write(
  • "........ " +
  • "Phrase: " +
  • arrMatch[ 0 ] +
  • "<br />"
  • );
  •  
  • document.write(
  • "........ " +
  • "Start Index: " +
  • arrMatch.index +
  • "<br />"
  • );
  •  
  • document.write (
  • "........ " +
  • "End Index: " +
  • rePattern.lastIndex +
  • "<br />"
  • );
  •  
  • document.write (
  • "........ " +
  • "Index Substring: " +
  • arrMatch.input.substring(
  • arrMatch.index,
  • rePattern.lastIndex
  • ) +
  • "<br /><br />"
  • );
  • }
  •  
  • </script>
  •  
  • </body>
  • </html>

Running the above page, we get the following output:

BIG modifies BUTTS
........ Phrase: big butts
........ Start Index: 10
........ End Index: 19
........ Index Substring: big butts

BITTY modifies WAIST
........ Phrase: bitty waist
........ Start Index: 113
........ End Index: 124
........ Index Substring: bitty waist

ROUND modifies THING
........ Phrase: round thing
........ Start Index: 134
........ End Index: 145
........ Index Substring: round thing

What really surprised me was that the returned match array had additional properties that were related to the pattern matching itself. I am not used to seeing this as usually, when it comes to arrays, I am just accessing indexes and checking lengths. It does concern me a little bit that the entire target text is copied into a property of the array (Array.input). Since strings are copied by value, this means we have lots of copies of this text running around now. Of course, come on, we are talking about Javascript :) I don't think variable size/count or performance is really much of an issue.

Anyway, this is very cool stuff. I wish I had known about this earlier. This feels like a very elegant solution, even more so that harnessing the power of the String::replace() method to accomplish the same ends.

Thanks Steve!




Reader Comments

Hey, what the heck... my hash didn't link properly ! I will have to look into that.

Ok, so be clean and straightforward, I guess maybe I meant the actual formatting of the page. It just looked nice. But yeah, the last example has no place or purpose being there.

"What really surprised me was that the returned match array had additional properties that were related to the pattern matching itself."

If you think about it, it's not really any weirder than other types (e.g., functions) having properties or methods. At it's core, pretty much every type in JavaScript is really an object.

For some fun with JavaScript types, run these in Firebug for possibly a few surprises, depending on one's understanding of JavaScript types and constructors:

console.log(typeof null); // object
console.log(typeof [1,2,3]); // object
console.log(typeof new Boolean()); // object
console.log(new String() instanceof Object); // true

console.log(typeof NaN); // number
console.log(NaN == NaN); // false
console.log(typeof /regex/); // object in IE; function in Firefox

console.log(new Boolean(false)); // false (probably expected)
console.log(new Boolean(false) == false); // true (probably expected)
console.log(false === false); // true (probably expected)
console.log(new Boolean(false) === false); // false (probably unexpected)

I don't mean to spam your comments, but here a simple example of exec() in action that I used recently...

I needed to created an array containing the indices of each tab character within a string, so I used something like the following:

------------------------
var tabIndices = [];
var tabMatchInfo;
var tabRegex = /\t/g;

while (tabMatchInfo = tabRegex.exec(string)) {
tabIndices.push(tabMatchInfo.index);
}
------------------------

Nice and easy. To do this without exec, it would require something like the following:

------------------------
var tabIndices = [];
var thisTabIndex;

for (var i = 0; i < string.length; i++) {
thisTabIndex = string.indexOf(String.fromCharCode(9), i);
if (thisTabIndex === -1) {
break;
} else {
tabIndices.push(thisTabIndex);
i = thisTabIndex;
}
}
------------------------

Of course, there are cases where it would be much more difficult than this (or maybe even impossible) to reproduce functionality achieved using exec() with other methods.

One word of caution... although "while (arr = /\t/g.exec(string)) {}" works fine in Firefox, it would create an infinite loop in IE (and possibly other browsers) if there is even one tab character in the string, since IE recompiles the regex for each iteration of the loop, effectively reseting lastIndex and never advancing.

In case that last paragraph wasn't clear, the problem results from creating the regex within the loop. The fix is to simply create the regex before entering the loop.

NEVER use the code:

while (match = /\t/g.exec(string)) {
...// use match[] array...
}

which effectively can recompile the regexp at each iteration (the /\t/g syntax is considered in IE as the equivalent of instanciating a new Regexp with the string "\\t" and the flag "g", as if you had effectively used: new Regexp("\\t", "g").exec(string)

Instead, instanciate the Regexp only once, using the code:

var re=/\t/; while (match = re.exec(string)) {
...// use match[] array...
}

or a for loop like:

for (var re=/\t/g; match = re.exec(string);) {
...// use match[] array...
}

Note also that the compiled Regexp object will be also modified during the loop as it will store the additional read/write property "lastIndex" which is initialized to 0 when the Regexp is initialized, and contains the final position (in the string parameter given to the exec () call, of the last match found in it.

This means that the same instanciated Regexp object cannot be shared to parse several strings in parallel (but you may duplicate the compiled and initialized Regexp object).

For me, modifiying the compiled Regexp object during later calls to exec() is a design error of Javascript: the exec(string) call should contain itself a second "startIndex" parameter indicating from where the source string should be parsed, because the lastIndex property modified in the Regexp instance is only valid for the next call of exec(), if the string in parameter is exactly the same.

This means that you have to also keep the parsed string object unmodified in a stable/unutable variable or property. So the actual code should be something like:

for (var re=/\t/g, hs=new string, match; match = re.exec(hs); ) {
...//use match[] array
}

which preserves the parsed string into a separate non muted local string (hs) before processing it, if this string can change while performing the loops (for example the string content of some DOM element that will be modified by the code in the loop body)...

Note also: if the "g" flag is specified, this is the only case where at most one match will be returned in the non-null array (the loop is then needed) and the regexp object will be modified to hold the lastIndex. In that case also, the returned non-null array (match above) will not just be an array, but will be an object containing two other named properties:
- the match.source property (or match["source"] if you prefer the indexed array syntax with non numeric array index) will contain a safe copy of the source string (in th, so you coould as well reuse that string instead of keeping it in a separate variable.
- the match.lastIndex property (or match["lastIndex"]) will contain the position after the last matched value returned.

Only one match is returned in fact if the match returned is not null, so only index at [0] contains a substring; higher indices are reserved for capturing groups (between unescaped parentheses in the regexp specification string) so match[0] represents effectively the $0 placeholder in substitution rules (it means the full pattern match (without the lookahead which may be specified in the regexp syntax after a slash), not the whole source string, which exists identically within the source.

Another way to write the search loop could be:

for var re = /\t/g, match = re.exec(string); match != null; match = re.exec(match.source)) {
...// use the match[0] substring which maps to match.source at position match.index
}

Example of Javascript (within HTML container) showing how this works:

<html><body>
<script language="javascript">
var re = /\d+/g;
var hs = "Step 4032: add 500 to 600 pinches of pepper to your delicious soup";
var m;
var info = "[\n";
while ((m = re.exec(hs)) != null) {
info += "{";
for (i in m) {
info += i + ": "
if (typeof(m[i]) == "string")
info += "\"" + m[i].replace("\\", "\\\\").replace("\n", "\\n").replace("\"", "\\\"").replace("&", "&").replace("<","<").replace(">",">").replace("\"",""") + "\"";
else
info += m[i];
info += "; ";
}
info += "},\n";
}
info += "]";
document.write("<pre>"+info+"</pre>");
</script></html>

If you run it, you'll read this text in your browser:

[
{0: "4032"; index: 5; input: "Step 4032: add 500 to 600 pinches of pepper to your delicious soup"; },
{0: "500"; index: 15; input: "Step 4032: add 500 to 600 pinches of pepper to your delicious soup"; },
{0: "600"; index: 22; input: "Step 4032: add 500 to 600 pinches of pepper to your delicious soup"; },
]

Sorry, replace:

for var re = /\t/g, match = re.exec(string); match != null; match = re.exec(match.source)) {
...// use the match[0] substring which maps to match.source at position match.index
}

by:

for (var re = /\t/g, match = re.exec(string); match != null; match = re.exec(match.input)) {
...// use the match[0] substring which maps to match.input at position match.index
}

Note that if you perform substitution within the match.input string, you'll need to update re.lastIndex, if the substitution string has a different length than the match; not doing this, you'll miss matches. Updating re.lastIndex will be faster (and will avoid possible infinite loops if the substitution string fully contains the matched string at a later position, for example when substituting all "a" by "ba", because the loop will process for the final position of the "a" in the string parsed by exec before your substitution, and so it will start at the position after "b", i.e. just the start position of the new "a" inserted just after this "b": this new "a" would be matched again and replaced again by your loop, without ending).

In other words, within your loop:
(1) compute the replacement string if it depends of the returned match[0] (or captured submatches in match[1]..match[N] when the regexp contains capturing groups). Let's say this replacement string is in the variable named "replace".
(2) get the length of replace
(3) modify the input string by taking the match.index first characters of match.input, followed by the content of the computed replace string followed by the final characters of match.input starting at position re.lastIndex
(4) update re.lastIndex by adding re.lastIndex - length(replace)
(5) take the next loop by calling match = re.exec(match.input) and test if it returns a null value that will break the loop.

With such code, you can compute arbitrary replacement strings, that are impossible to write with a simple regexp syntax using $0 (and possible capturing groups $1..$N), such as changing the case of some or all of the match, or performing some normalization that may change the length of the source string.

Note that the source string is not really copied into match.input by re.Exec(string), but just maintained as an additional reference to that string object. But that reference will be last after returning the last match when match=re.exec(match.source) will return null, so you'll still need an external variable to reference the processed source string given in parameter to exec(), when there remains no more match and you'll exit the processing loop.

An example on how to replace matching regexps with values computed:

  • var re = new RegExp( "([A-Z][a-z]+) ([a-z]+)", "gim" );
  • var oc = null;
  • var idx = 0;
  • var translated = "";
  • while ( ( oc = re.exec( selection.text ) ) !== null )
  • {
  • var repstr = oc[1].toUpperCase() + ":" + oc[2].toLowerCase() +";";
  • translated += oc.input.substring(idx,re.lastIndex - oc[0].length ) + repstr;
  • idx = re.lastIndex;
  • }
  • selection.text = translated;