Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at CFUNITED 2010 (Landsdown, VA) with: Randy Brown

Steven Levithan Rocks Hardcore (Regular Expression Optimization Case Study)

By Ben Nadel on
Tags: ColdFusion

The other day, I put together a regular-expression-based CSV parsing algorithm in ColdFusion. I like to think that I have a pretty good handle on writing regular expressions, but I know that I am not a wizard like Steve Levithan. As such, I asked him to take a look at what I had written to see if it could be optimized. I started off with this:

("(?:[^"]|"")*"|[^",\r\n]*)(,|\r\n?|\n)?

Steve Levithan took a look at that and came back with this:

\G(,|\r?\n|\r|^)(?:"([^"]*+(?>""[^"]*+)*)"|([^",\r\n]*+))

Before I get into how this works (as best as I can even understand), let's talk numbers. When parsing a 10 megabyte file with 10,750,000 characters over 50,000 lines of data, my version of the regular expression ran in about 58 seconds. This is not bad considering that my previous algorithm was horrible.

I then took Steve's regular expression and implemented it and ran it on the same file. This ran, on average, in about 35 seconds. Holy mackerel! That's almost a 40% reduction in execution time thanks to his optimization.

The optimization that Steve put into the expression was not just tied to the regular expression itself, but also to the use of that regular expression in the rest of the code. Where as I grabbed the entire field value, regardless of qualification, Steve broke qualified and non-qualified field values into two different capturing groups. This is actually a huge improvement because it:

  1. Cut down on extraneous embedded quote replacement. Since we know explicitly when we matched a qualified field value, we know exactly when to check for embedded quotes and when we don't have to.
  2. He used possessive qualifiers and atomic grouping which does not allow the regular expression engine to backtrack on non-matches. This part of the regular expression world is still very new and fuzzy to me, but basically what it is way to write your expressions with the assertion that certain optional matches will never take place. I don't want to explain it too much as I don't want to give you any misinformation, but it makes things more efficient because it cuts out possible steps.

Also, while this it not necessarily an optimization, since his pattern does not have an optional delimiter, we can remove the use of the ColdFusion CFBreak tag that you may have seen in my previous implementation. Not having the need to use CFBreak is just more elegant, in my opinion.

Ok, so let's see how this 40% reduction in execution time is achieved. Here it the algorithm and example:

  • <!---
  • Save CSV data values. Here, we are creating a CSV data
  • value that has both qualified and non-qualified field
  • values, populated and empty field values, and embedded
  • field qualifiers and field/line delimiters.
  • --->
  • <cfsavecontent variable="strCSV">
  • "Name","Nick Name","Age","Hair Color"
  • Kim,"Kim ""Hot Legs"" Smith",24,"Brunette"
  • "Sarah Vivenz, II","Stubs",27,"Brunette"
  • "Kit Williams",Kitty,34,Blonde,,,
  • "Even
  • Values With
  • Embedded Line Breaks"
  • </cfsavecontent>
  •  
  •  
  • <!--- Trim data values. --->
  • <cfset strCSV = Trim( strCSV ) />
  •  
  •  
  • <!---
  • Get the regular expression to match the tokens. I have
  • put the delimiter value on the first line and field value
  • on the second line for easier reading.
  • --->
  • <cfset strRegEx = (
  • "\G(,|\r?\n|\r|^)" &
  • "(?:""([^""]*+(?>""""[^""]*+)*)""|([^"",\r\n]*+))"
  • ) />
  •  
  •  
  • <!---
  • Create a compiled Java regular expression pattern object
  • based on the pattern above.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • JavaCast( "string", strRegEx )
  • )
  • />
  •  
  • <!---
  • Get the pattern matcher for our target text (the CSV data).
  • This will allows us to iterate over all the data fields.
  • --->
  • <cfset objMatcher = objPattern.Matcher(
  • JavaCast( "string", strCSV )
  • ) />
  •  
  •  
  • <!---
  • Create an array to hold the CSV data. We are going
  • to create an array of arrays in which each nested
  • array represents a row in the CSV data file.
  • --->
  • <cfset arrData = ArrayNew( 1 ) />
  •  
  • <!--- Start off with a new array for the new data. --->
  • <cfset ArrayAppend( arrData, ArrayNew( 1 ) ) />
  •  
  •  
  • <!---
  • Here's where the magic is taking place; we are going
  • to use the Java pattern matcher to iterate over each
  • of the CSV data fields using the regular expression
  • we defined above.
  •  
  • Each match will have at least the field value and
  • possibly an optional trailing delimiter.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!---
  • Get the delimiter. We know that the delimiter will
  • always be matched, but in the case that it matched
  • the START expression, it will not have a length.
  • --->
  • <cfset REQUEST.Delimiter = objMatcher.Group(
  • JavaCast( "int", 1 )
  • ) />
  •  
  •  
  • <!---
  • Check for delimiter length and is not the field
  • delimiter. This is the only time we ever need to
  • perform an action (adding a new line array).
  • --->
  • <cfif (REQUEST.Delimiter NEQ ",")>
  •  
  • <!--- Start new row data array. --->
  • <cfset ArrayAppend(
  • arrData,
  • ArrayNew( 1 )
  • ) />
  •  
  • </cfif>
  •  
  •  
  • <!---
  • Get the field token value in group 2 (which may
  • not exist if the field value was not qualified.
  • --->
  • <cfset REQUEST.Value = objMatcher.Group(
  • JavaCast( "int", 2 )
  • ) />
  •  
  • <!---
  • Check to see if the value exists. If it doesn't exist,
  • then we want the non-qualified field. If it does exist,
  • then we want to replace any escaped embedded quotes.
  • --->
  • <cfif StructKeyExists( REQUEST, "Value" )>
  •  
  • <!---
  • Replace escpaed quotes with an unescaped double
  • quote. No need to perform regex for this.
  • --->
  • <cfset REQUEST.Value = Replace(
  • REQUEST.Value,
  • """""",
  • """",
  • "all"
  • ) />
  •  
  • <cfelse>
  •  
  • <!---
  • No qualified field value was found, so use group
  • 3 - the non-qualified alternative.
  • --->
  • <cfset REQUEST.Value = objMatcher.Group(
  • JavaCast( "int", 3 )
  • ) />
  •  
  • </cfif>
  •  
  •  
  • <!--- Add the field value to the row array. --->
  • <cfset ArrayAppend(
  • arrData[ ArrayLen( arrData ) ],
  • REQUEST.Value
  • ) />
  •  
  • </cfloop>
  •  
  •  
  • <!--- Dump out CSV data array. --->
  • <cfdump
  • var="#arrData#"
  • label="CSV File Data"
  • />

And, just so believe me that this actually works, here is the CFDump output that we get:


 
 
 

 
CSV Parsing In ColdFusion Using Steve Levithan's Regular Expression Optimization  
 
 
 

So, like I said, the whole idea of possessive qualifiers and atomic grouping is still very fuzzy to me. However, I did try to break Steve's regular expression apart and explain it. Please take this with a grain of salt as it might not be 100% accurate (or helpful):

  • <!--- Create verbose regular expression. --->
  • <cfsavecontent variable="strRegex">(?x)
  •  
  • ## Every pattern consists of a delimiter followed by a
  • ## field value. The delimiter will not have a length
  • ## for the first pattern matching since it can match the
  • ## start of a string which inheritly has no length.
  •  
  • ## To kick things off, we are going to put in an assertion
  • ## that this pattern match starts at the End of the last
  • ## matched string (optional, but assuring).
  •  
  • \G
  •  
  • ## Let's get the delimiter, which can be either a comma,
  • ## new line data, or the start of the string. These are
  • ## put in order of likelyhood (with the field delimiter
  • ## coming first. This group is NOT optional, but it might
  • ## result in the empty string.
  •  
  • (
  • ## Comma - the most likely to occur.
  •  
  • ,
  •  
  • ## OR, optional return-new line.
  •  
  • |\r?\n
  •  
  • ## OR just the carriage return.
  •  
  • |\r
  •  
  • ## OR the beginning of the string. This is will only
  • ## occur once in the entire string.
  •  
  • |^
  • )
  •  
  • ## Now that we have our delimiter, let's figure out what
  • ## kind of field value we are grabbing. There are two kinds
  • ## of field values - qualified (quoted) and not qualified.
  • ## Since it can be either, we are going to OR the two types
  • ## in a non-capturing group.
  •  
  • (?:
  • ## Let's figure out if we are getting a qualified field
  • ## value first. This must start and end which quotes and
  • ## can have escaped quotes within it.
  •  
  • " ## Start quoted value.
  •  
  • (
  • ## To start with we are going to match non-quote
  • ## characters in a possesive way (meaning that we
  • ## are not going to allow backtracking once we hit
  • ## a quote value. This will match zero or more times.
  •  
  • [^"]*+
  •  
  • ## Then, assuming the above part hits a quote
  • ## character, we are going to see if is an escaped
  • ## quote followed by zero or more non-quote
  • ## characters in a posessive fashion (non back-
  • ## tracking).
  •  
  • (?>""[^"]*+)*
  •  
  • )
  •  
  • " ## End quoted value.
  •  
  • ## If we did not find the above quoted field value, then
  • ## we can match a non-quoted value.
  •  
  • | ## OR
  •  
  • ([^",\r\n]*+)
  •  
  • )
  •  
  • ## End non-capturing field value group.
  •  
  • </cfsavecontent>

Thanks Steve! You rock.




Reader Comments

Heh... Thanks for the regex luv.

As you noted, although there is significant optimization in the regex itself (through possessive quantifiers/atomic groups and Jeffrey Friedl's "unrolling the loop" pattern, which avoids the inefficiency of something like "(?:[^"]|"")*"), much of the speedup comes from the changes which allow simplified post-processing.

By the way, in your original post you mentioned in the code comments that you weren't sure if it was necessary to allow a single line-feed character (\n) for line breaks. That's a Unix\Linux style line break. Mac uses \r, and Windows uses \r\n.

Reply to this Comment

@Steve,

Thanks for all the tips. I have to look further into this Unrolling The Loop technique as the whole non-backtracking stuff is still a bit fuzzy, especially when it comes to seeing where it should be applied. Do you know of any good explanations of this? Or should I just Google it?

Reply to this Comment

regular-expressions.info has pretty good information on atomic groups and possessive quantifiers at http://www.regular-expressions.info/possessive.html and http://www.regular-expressions.info/atomic.html

The general pattern for unrolling the loop is:

opening normal* (special normal*)* closing

...although there are a few additional rules, such as that the start of special and normal must never intersect, and special must not match an empty string.

You can see it in the above pattern as:

"([^"]*+(?>""[^"]*+)*)"

Where the leading and trailing double-quote are opening and closing, [^"] is normal, and "" is special.

There is a little more information at http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/UnrollingTheLoop.html

Reply to this Comment

Check out this article on regex optimization:

http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html

It is, in part, Java specific, and makes a few debatable claims, but on the whole it is a very good article for the basics of regex optimization (a topic which isn't written about often enough, or at least not by people who know what they're talking about). One day I hope to write up an in-depth article on JavaScript regex optimization.

Reply to this Comment

@Steve,

Yeah, I love http://www.regular-expressions.info. I think it has a ton of great information. Thanks for the link on unrolling the loop. I think I get it (or at least getting there). I'll have to check out that article on Javascript regular expression optimization.

Reply to this Comment

How would you go about this if the delimeter were a TAB character. I tried adding "|\t" in the line where you have "\G(,|\r?\n|\r|^)", but this didn't seem to work.

Also, can you incorporate this into a function like you did in your original blog post, so the data, delimiter, file, and qualifier can be passed into CSVToArray(). Thanks.

Reply to this Comment

@Troy,

I will certainly wrap this up in a UDF. The \t probably didn't work because you need to reference later on in the code as well as in the Regular Expression.

I'll try to post tomorrow.

Reply to this Comment

Thanks Ben, looking forward to the new and improved UDF. The speed is incredibly faster and is going to make my client very happy. And if you can make it support the TAB delimiter, all the better.

In the old UDF, a 184KB CSV file with 1000 rows took 35.59 sec to parse.
in the new UDF, the same file is parsed in only 1.57 sec.

Reply to this Comment

@Ben,

Okay, so I've got this thing working, or so I think. The csv file has 18k rows and 11 columns (3mb file/ 3,083,493 chars with spaces). The process takes about 30 seconds per 1000 records which is nowhere near the performance levels you presented in the beginning of this thread and it is only 1/3 of the size of data you were working with.

Any ideas as to what I'm doing wrong?

Frustrated

Reply to this Comment

@DLackey,

Sorry, I am not sure. Perhaps my data was not as large as yours? Or there's not as much running on my machine? Outside the regex parsing, there's not a whole lot going on, so there's not a ton of room for performance optimization.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.