Steven Levithan Rocks Hardcore (Regular Expression Optimization Case Study)

Posted September 29, 2007 at 5:58 PM

Tags: ColdFusion

The other day, I put together a regular-expression-based CSV parsing algorithm in ColdFusion. I like to think that I have a pretty good handle on writing regular expressions, but I know that I am not a wizard like Steve Levithan. As such, I asked him to take a look at what I had written to see if it could be optimized. I started off with this:

("(?:[^"]|"")*"|[^",\r\n]*)(,|\r\n?|\n)?

Steve Levithan took a look at that and came back with this:

\G(,|\r?\n|\r|^)(?:"([^"]*+(?>""[^"]*+)*)"|([^",\r\n]*+))

Before I get into how this works (as best as I can even understand), let's talk numbers. When parsing a 10 megabyte file with 10,750,000 characters over 50,000 lines of data, my version of the regular expression ran in about 58 seconds. This is not bad considering that my previous algorithm was horrible.

I then took Steve's regular expression and implemented it and ran it on the same file. This ran, on average, in about 35 seconds. Holy mackerel! That's almost a 40% reduction in execution time thanks to his optimization.

The optimization that Steve put into the expression was not just tied to the regular expression itself, but also to the use of that regular expression in the rest of the code. Where as I grabbed the entire field value, regardless of qualification, Steve broke qualified and non-qualified field values into two different capturing groups. This is actually a huge improvement because it:

  1. Cut down on extraneous embedded quote replacement. Since we know explicitly when we matched a qualified field value, we know exactly when to check for embedded quotes and when we don't have to.
  2. He used possessive qualifiers and atomic grouping which does not allow the regular expression engine to backtrack on non-matches. This part of the regular expression world is still very new and fuzzy to me, but basically what it is way to write your expressions with the assertion that certain optional matches will never take place. I don't want to explain it too much as I don't want to give you any misinformation, but it makes things more efficient because it cuts out possible steps.

Also, while this it not necessarily an optimization, since his pattern does not have an optional delimiter, we can remove the use of the ColdFusion CFBreak tag that you may have seen in my previous implementation. Not having the need to use CFBreak is just more elegant, in my opinion.

Ok, so let's see how this 40% reduction in execution time is achieved. Here it the algorithm and example:

 Launch code in new window » Download code as text file »

  • <!---
  • Save CSV data values. Here, we are creating a CSV data
  • value that has both qualified and non-qualified field
  • values, populated and empty field values, and embedded
  • field qualifiers and field/line delimiters.
  • --->
  • <cfsavecontent variable="strCSV">
  • "Name","Nick Name","Age","Hair Color"
  • Kim,"Kim ""Hot Legs"" Smith",24,"Brunette"
  • "Sarah Vivenz, II","Stubs",27,"Brunette"
  • "Kit Williams",Kitty,34,Blonde,,,
  • "Even
  • Values With
  • Embedded Line Breaks"
  • </cfsavecontent>
  •  
  •  
  • <!--- Trim data values. --->
  • <cfset strCSV = Trim( strCSV ) />
  •  
  •  
  • <!---
  • Get the regular expression to match the tokens. I have
  • put the delimiter value on the first line and field value
  • on the second line for easier reading.
  • --->
  • <cfset strRegEx = (
  • "\G(,|\r?\n|\r|^)" &
  • "(?:""([^""]*+(?>""""[^""]*+)*)""|([^"",\r\n]*+))"
  • ) />
  •  
  •  
  • <!---
  • Create a compiled Java regular expression pattern object
  • based on the pattern above.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • JavaCast( "string", strRegEx )
  • )
  • />
  •  
  • <!---
  • Get the pattern matcher for our target text (the CSV data).
  • This will allows us to iterate over all the data fields.
  • --->
  • <cfset objMatcher = objPattern.Matcher(
  • JavaCast( "string", strCSV )
  • ) />
  •  
  •  
  • <!---
  • Create an array to hold the CSV data. We are going
  • to create an array of arrays in which each nested
  • array represents a row in the CSV data file.
  • --->
  • <cfset arrData = ArrayNew( 1 ) />
  •  
  • <!--- Start off with a new array for the new data. --->
  • <cfset ArrayAppend( arrData, ArrayNew( 1 ) ) />
  •  
  •  
  • <!---
  • Here's where the magic is taking place; we are going
  • to use the Java pattern matcher to iterate over each
  • of the CSV data fields using the regular expression
  • we defined above.
  •  
  • Each match will have at least the field value and
  • possibly an optional trailing delimiter.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!---
  • Get the delimiter. We know that the delimiter will
  • always be matched, but in the case that it matched
  • the START expression, it will not have a length.
  • --->
  • <cfset REQUEST.Delimiter = objMatcher.Group(
  • JavaCast( "int", 1 )
  • ) />
  •  
  •  
  • <!---
  • Check for delimiter length and is not the field
  • delimiter. This is the only time we ever need to
  • perform an action (adding a new line array).
  • --->
  • <cfif (REQUEST.Delimiter NEQ ",")>
  •  
  • <!--- Start new row data array. --->
  • <cfset ArrayAppend(
  • arrData,
  • ArrayNew( 1 )
  • ) />
  •  
  • </cfif>
  •  
  •  
  • <!---
  • Get the field token value in group 2 (which may
  • not exist if the field value was not qualified.
  • --->
  • <cfset REQUEST.Value = objMatcher.Group(
  • JavaCast( "int", 2 )
  • ) />
  •  
  • <!---
  • Check to see if the value exists. If it doesn't exist,
  • then we want the non-qualified field. If it does exist,
  • then we want to replace any escaped embedded quotes.
  • --->
  • <cfif StructKeyExists( REQUEST, "Value" )>
  •  
  • <!---
  • Replace escpaed quotes with an unescaped double
  • quote. No need to perform regex for this.
  • --->
  • <cfset REQUEST.Value = Replace(
  • REQUEST.Value,
  • """""",
  • """",
  • "all"
  • ) />
  •  
  • <cfelse>
  •  
  • <!---
  • No qualified field value was found, so use group
  • 3 - the non-qualified alternative.
  • --->
  • <cfset REQUEST.Value = objMatcher.Group(
  • JavaCast( "int", 3 )
  • ) />
  •  
  • </cfif>
  •  
  •  
  • <!--- Add the field value to the row array. --->
  • <cfset ArrayAppend(
  • arrData[ ArrayLen( arrData ) ],
  • REQUEST.Value
  • ) />
  •  
  • </cfloop>
  •  
  •  
  • <!--- Dump out CSV data array. --->
  • <cfdump
  • var="#arrData#"
  • label="CSV File Data"
  • />

And, just so believe me that this actually works, here is the CFDump output that we get:


 
 
 

 
CSV Parsing In ColdFusion Using Steve Levithan's Regular Expression Optimization  
 
 
 

So, like I said, the whole idea of possessive qualifiers and atomic grouping is still very fuzzy to me. However, I did try to break Steve's regular expression apart and explain it. Please take this with a grain of salt as it might not be 100% accurate (or helpful):

 Launch code in new window » Download code as text file »

  • <!--- Create verbose regular expression. --->
  • <cfsavecontent variable="strRegex">(?x)
  •  
  • ## Every pattern consists of a delimiter followed by a
  • ## field value. The delimiter will not have a length
  • ## for the first pattern matching since it can match the
  • ## start of a string which inheritly has no length.
  •  
  • ## To kick things off, we are going to put in an assertion
  • ## that this pattern match starts at the End of the last
  • ## matched string (optional, but assuring).
  •  
  • \G
  •  
  • ## Let's get the delimiter, which can be either a comma,
  • ## new line data, or the start of the string. These are
  • ## put in order of likelyhood (with the field delimiter
  • ## coming first. This group is NOT optional, but it might
  • ## result in the empty string.
  •  
  • (
  • ## Comma - the most likely to occur.
  •  
  • ,
  •  
  • ## OR, optional return-new line.
  •  
  • |\r?\n
  •  
  • ## OR just the carriage return.
  •  
  • |\r
  •  
  • ## OR the beginning of the string. This is will only
  • ## occur once in the entire string.
  •  
  • |^
  • )
  •  
  • ## Now that we have our delimiter, let's figure out what
  • ## kind of field value we are grabbing. There are two kinds
  • ## of field values - qualified (quoted) and not qualified.
  • ## Since it can be either, we are going to OR the two types
  • ## in a non-capturing group.
  •  
  • (?:
  • ## Let's figure out if we are getting a qualified field
  • ## value first. This must start and end which quotes and
  • ## can have escaped quotes within it.
  •  
  • " ## Start quoted value.
  •  
  • (
  • ## To start with we are going to match non-quote
  • ## characters in a possesive way (meaning that we
  • ## are not going to allow backtracking once we hit
  • ## a quote value. This will match zero or more times.
  •  
  • [^"]*+
  •  
  • ## Then, assuming the above part hits a quote
  • ## character, we are going to see if is an escaped
  • ## quote followed by zero or more non-quote
  • ## characters in a posessive fashion (non back-
  • ## tracking).
  •  
  • (?>""[^"]*+)*
  •  
  • )
  •  
  • " ## End quoted value.
  •  
  • ## If we did not find the above quoted field value, then
  • ## we can match a non-quoted value.
  •  
  • | ## OR
  •  
  • ([^",\r\n]*+)
  •  
  • )
  •  
  • ## End non-capturing field value group.
  •  
  • </cfsavecontent>

Thanks Steve! You rock.

Download Code Snippet ZIP File

Comments (9)  |  Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page




Adobe ColdFusion 8.0.1 Update - Helping Programmers To Be Signifanctly Less Girlie - Download ColdFusion 8 Update 8.0.1 Now.

Reader Comments

Heh... Thanks for the regex luv.

As you noted, although there is significant optimization in the regex itself (through possessive quantifiers/atomic groups and Jeffrey Friedl's "unrolling the loop" pattern, which avoids the inefficiency of something like "(?:[^"]|"")*"), much of the speedup comes from the changes which allow simplified post-processing.

By the way, in your original post you mentioned in the code comments that you weren't sure if it was necessary to allow a single line-feed character (\n) for line breaks. That's a Unix\Linux style line break. Mac uses \r, and Windows uses \r\n.

Posted by Steve on Oct 1, 2007 at 4:58 PM


@Steve,

Thanks for all the tips. I have to look further into this Unrolling The Loop technique as the whole non-backtracking stuff is still a bit fuzzy, especially when it comes to seeing where it should be applied. Do you know of any good explanations of this? Or should I just Google it?

Posted by Ben Nadel on Oct 1, 2007 at 5:07 PM


regular-expressions.info has pretty good information on atomic groups and possessive quantifiers at http://www.regular-expressions.info/possessive.html and http://www.regular-expressions.info/atomic.html

The general pattern for unrolling the loop is:

opening normal* (special normal*)* closing

...although there are a few additional rules, such as that the start of special and normal must never intersect, and special must not match an empty string.

You can see it in the above pattern as:

"([^"]*+(?>""[^"]*+)*)"

Where the leading and trailing double-quote are opening and closing, [^"] is normal, and "" is special.

There is a little more information at http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/UnrollingTheLoop.html

Posted by Steve on Oct 1, 2007 at 6:47 PM


Check out this article on regex optimization:

http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html

It is, in part, Java specific, and makes a few debatable claims, but on the whole it is a very good article for the basics of regex optimization (a topic which isn't written about often enough, or at least not by people who know what they're talking about). One day I hope to write up an in-depth article on JavaScript regex optimization.

Posted by Steve on Oct 1, 2007 at 8:02 PM


@Steve,

Yeah, I love http://www.regular-expressions.info. I think it has a ton of great information. Thanks for the link on unrolling the loop. I think I get it (or at least getting there). I'll have to check out that article on Javascript regular expression optimization.

Posted by Ben Nadel on Oct 2, 2007 at 7:32 AM


How would you go about this if the delimeter were a TAB character. I tried adding "|\t" in the line where you have "\G(,|\r?\n|\r|^)", but this didn't seem to work.

Also, can you incorporate this into a function like you did in your original blog post, so the data, delimiter, file, and qualifier can be passed into CSVToArray(). Thanks.

Posted by Troy on Oct 11, 2007 at 1:41 PM


@Troy,

I will certainly wrap this up in a UDF. The \t probably didn't work because you need to reference later on in the code as well as in the Regular Expression.

I'll try to post tomorrow.

Posted by Ben Nadel on Oct 11, 2007 at 2:03 PM


Thanks Ben, looking forward to the new and improved UDF. The speed is incredibly faster and is going to make my client very happy. And if you can make it support the TAB delimiter, all the better.

In the old UDF, a 184KB CSV file with 1000 rows took 35.59 sec to parse.
in the new UDF, the same file is parsed in only 1.57 sec.

Posted by Troy on Oct 11, 2007 at 3:20 PM


@Troy,

The UDF has been posted:

http://www.bennadel.com/index.cfm?dax=blog:991.view

That's awesome to hear that the new algorithm is so much faster. Thanks for the feedback.

Posted by Ben Nadel on Oct 12, 2007 at 9:02 AM


Post Comment  |  Ask Ben


Home   |   Web Log   |   ColdFusion   |   Projects   |   Resume   |   Job Form   |   Search   |   Contact
Epicenter Consulting - Custom Software Solutions for Business Evolution HostMySite.com - The Leader In ColdFusion Hosting