Regular Expressions Make CSV Parsing In ColdFusion So Much Easier (And Faster)

Posted September 28, 2007 at 7:29 AM

Tags: ColdFusion

Today, I was working on some data parsing when it hit me like a bolt of lightening! Why not just use Regular Expressions to parse CSV (comma separated values) data files? For those of you who have seen my previous attempts at parsing CSV data files using ColdFusion, you will see that having to handle embedded field qualifiers and data delimiters made it a character by character parsing problem since you can no longer treat it like a list of lists (both comma a new line delimited). It was a huge amount of code to do a huge amount of work.

But then, as I was working today, suddenly I realized that working with embedded field qualifiers was exactly like working with embedded quotes in HTML tag attribute value parsing; it's just a pattern. So, what is the pattern of the CSV file. At it's highest level, it's a field value followed by an optional delimiter. I say "optional" delimiter because the last data field in the file will not be succeeded by a delimiter. The field value is then an optional qualified value containing zero or more characters. The allowable characters in the field are determined be the qualification of the data.

The regular expression I came up with was so short it made me giggle:

("(?:[^"]|"")*"|[^",\r\n]*)(,|\r\n?|\n)?

This ColdFusion compatible regular expression captures two groups; the first group is the field value (zero or more characters) and the second group is the optional succeeding delimiter. If this is totally not readable (as most regular expressions are not), I have converted this to a verbose regular expression (?x) with my typical amount of commenting so that you can really see how the pattern of the data field can be matched:

 Launch code in new window » Download code as text file »

  • <!--- Create verbose regular expression. --->
  • <cfsavecontent variable="strRegex">(?x)
  •  
  • ## Every CSV "Group" consists of a field value
  • ## followed by an optional delimiter. This delimiter,
  • ## if it exists, will be either a field delimiler or
  • ## a line delimiter. If no delimiter exists, then we
  • ## are at the end of the file.
  •  
  • ## Let's get the field value. We need to consider two cases
  • ## in field values. Either the value is qualified or it is
  • ## not. According to standards, values that have an
  • ## embedded field or line delimiter or qualifier, MUST BE
  • ## contained in a qualified field.
  •  
  • (
  • ## Because the qualified fields are the exception case,
  • ## we want to check for those first. A qualified field
  • ## can contain non quotes AND escaped quotes.
  •  
  • "(?:[^"]|"")*"
  •  
  • | ## OR
  •  
  • ## If we did not match the qualified field token, then
  • ## we need to check for the non-qualified field token
  • ## which can be zero or more characters NOT consisting
  • ## of any qualifier or delimiter.
  •  
  • [^",\r\n]*
  •  
  • )
  •  
  • ## Now that we have captured the zero+ length field value,
  • ## we can get the token. Remember, if we are at the end
  • ## of the file, the token will not exist and therefore,
  • ## this group must be optional (?).
  •  
  • (
  • ## Comma
  •  
  • ,
  •  
  • ## OR return with optional newline.
  •  
  • |\r\n?
  •  
  • ## OR just new line (I include this option for safer
  • ## text values across operating systems (but this might
  • ## not be necessary).
  •  
  • |\n
  • )?
  •  
  • </cfsavecontent>

So now that you see how simple the pattern is, let's use it to start parsing our CSV data in ColdFusion. Luckily, from within ColdFusion, we have access to the Java Pattern and the Java Pattern Matcher which make our lives almost too easy when it comes to iterating over a string. As we iterate over the string, we are going to store our CSV values in an array of arrays in which each sub array will represent a row from the CSV data file.

 Launch code in new window » Download code as text file »

  • <!---
  • Save CSV data values. Here, we are creating a CSV data
  • value that has both qualified and non-qualified field
  • values, populated and empty field values, and embedded
  • field qualifiers and field/line delimiters.
  • --->
  • <cfsavecontent variable="strCSV">
  • "Name","Nick Name","Age","Hair Color"
  • Kim,"Kim ""Hot Legs"" Smith",24,"Brunette"
  • "Sarah Vivenz, II","Stubs",27,"Brunette"
  • "Kit Williams",Kitty,34,Blonde,,,
  • "Even
  • Values With
  • Embedded Line Breaks"
  • </cfsavecontent>
  •  
  •  
  • <!--- Trim data values. --->
  • <cfset strCSV = Trim( strCSV ) />
  •  
  •  
  • <!---
  • Get the regular expression to match the tokens. I have
  • put the field value on the first line and delimiters on
  • the second line for easier reading.
  • --->
  • <cfset strRegEx = (
  • "(""(?:[^""]|"""")*""|[^"",\r\n]*)" &
  • "(,|\r\n?|\n)?"
  • )/>
  •  
  •  
  • <!---
  • Create a compiled Java regular expression pattern object
  • based on the pattern above.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • JavaCast( "string", strRegEx )
  • )
  • />
  •  
  • <!---
  • Get the pattern matcher for our target text (the CSV data).
  • This will allows us to iterate over all the data fields.
  • --->
  • <cfset objMatcher = objPattern.Matcher(
  • JavaCast( "string", strCSV )
  • ) />
  •  
  •  
  • <!---
  • Create an array to hold the CSV data. We are going
  • to create an array of arrays in which each nested
  • array represents a row in the CSV data file.
  • --->
  • <cfset arrData = ArrayNew( 1 ) />
  •  
  • <!--- Start off with a new array for the new data. --->
  • <cfset ArrayAppend( arrData, ArrayNew( 1 ) ) />
  •  
  •  
  • <!---
  • Here's where the magic is taking place; we are going
  • to use the Java pattern matcher to iterate over each
  • of the CSV data fields using the regular expression
  • we defined above.
  •  
  • Each match will have at least the field value and
  • possibly an optional trailing delimiter.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!--- Get the field token value. --->
  • <cfset REQUEST.Value = objMatcher.Group(
  • JavaCast( "int", 1 )
  • ) />
  •  
  • <!--- Remove the field qualifiers (if any). --->
  • <cfset REQUEST.Value = REQUEST.Value.ReplaceAll(
  • JavaCast( "string", "^""|""$" ),
  • JavaCast( "string", "" )
  • ) />
  •  
  • <!--- Unesacepe embedded qualifiers (if any). --->
  • <cfset REQUEST.Value = REQUEST.Value.ReplaceAll(
  • JavaCast( "string", "(""){2}" ),
  • JavaCast( "string", "$1" )
  • ) />
  •  
  • <!--- Add the field value to the row array. --->
  • <cfset ArrayAppend(
  • arrData[ ArrayLen( arrData ) ],
  • REQUEST.Value
  • ) />
  •  
  •  
  • <!---
  • Get the delimiter. If no delimiter group was matched,
  • this will destroy the variable in the REQUEST scope.
  • --->
  • <cfset REQUEST.Delimiter = objMatcher.Group(
  • JavaCast( "int", 2 )
  • ) />
  •  
  •  
  • <!--- Check for delimiter. --->
  • <cfif StructKeyExists( REQUEST, "Delimiter" )>
  •  
  • <!---
  • Check to see if we need to start a new array to
  • hold the next row of data. We need to do this if the
  • delimiter we just found is NOT a field delimiter.
  • --->
  • <cfif (REQUEST.Delimiter NEQ ",")>
  •  
  • <!--- Start new row data array. --->
  • <cfset ArrayAppend(
  • arrData,
  • ArrayNew( 1 )
  • ) />
  •  
  • </cfif>
  •  
  • <cfelse>
  •  
  • <!---
  • If there is no delimiter, then we are done parsing
  • the CSV file data. Break out rather than just ending
  • the loop to make sure we don't get any extra data.
  • --->
  • <cfbreak />
  •  
  • </cfif>
  •  
  • </cfloop>
  •  
  •  
  • <!--- Dump out CSV data array. --->
  • <cfdump
  • var="#arrData#"
  • label="CSV File Data"
  • />

There's a lot of commenting going on there (typical Me!), but if you filter all that out, this is only a few lines of code. That's how easy regular expressions make our lives! Compare that with any of my previous ColdFusion CSV parsing solutions to see that this one is a small fraction of the size. And, best part is, it works! Running the above code, we get the following CFDump output:


 
 
 

 
Parsing CSV Values In ColdFusion Using Regular Expressions  
 
 
 

It works on a small file, sure piece of cake, but what about performance? Sometimes we (ok, maybe just me) get sucked so hard into how sexy Regular Expressions are, that we lose sight of the big picture and we end up going down a path that is less performant in the long run. To test this, I am going to create a beefy CSV file with this code:

 Launch code in new window » Download code as text file »

  • <!--- Create data file path. --->
  • <cfset strCSVPath = ExpandPath( "data.csv" ) />
  •  
  • <!---
  • Create sample data for a web site. Make this data beefy
  • to simulate what kind of stuff might actually be coming
  • out of a system-generated CSV file.
  • --->
  • <cfsavecontent variable="strCSV">"Kinky Solutions: A Student's Perspective","http://www.bennadel.com","Ben J. Nadel","ColdFusion,Web Development,CSS,Javascript,jQuery,SQL,SEO Optimization",,,09/21/2007 07:31:15,true,true,false,"5,340","34",,true</cfsavecontent>
  •  
  • <!--- Repeat string 10,000 times. --->
  • <cfset strCSV = RepeatString(
  • strCSV & Chr( 13 ) & Chr( 10 ),
  • (10 * 1000)
  • ) />
  •  
  • <!--- Write data to file. --->
  • <cffile
  • action="write"
  • file="#strCSVPath#"
  • output="#strCSV#"
  • />

This creates a 10,000 line CSV file with over 2,150,000 characters. This file comes out to be just over 2 megabytes in size. Now, to make sure I am not messing with the time trials, I am running this script separately than the parsing code.

Now for the test! I ran the current version that employs regular expressions to do the parsing against my previous version which checks characters.

Previous CSV Parsing Algorithm

Average parse time: 100,000 ms (100 seconds)

Regular Expression CSV Parsing Algorithm

Average parse time: 11,000 ms (11 seconds)

Holy Cow! The regular expression solution is about 10 times faster than the previous solution! And, this was no rinky-dinky example; this was a seriously large CSV file (at least in my world).

Just for fun, I regenerated the above test CSV file, except this time, I created a 50,000 record CSV file. This file has 10,750,000 characters in it and comes out at well over 10 megabytes. I didn't even want to bother running this in the old solution. However, I am very pleased to say that all 50,000 records were successfully parsed in just 57,000 ms (57 seconds)! That's pretty sweet-ass-sweet.

Let's stop for a second and just look at the performance scaling. The first run went 10,000 records in 11,000 ms. That comes out to be 909 records per second. The second run went 50,000 records in 57,000 ms. That comes out to be 877 records per second. As the data set grows, this thing is just gonna EAT UP MEMORY (creating this massive array in RAM), but as you can see, the efficiency of the algorithm is fairly constant. This is some awesome news.


 
 
 

 
Is It Hot In Here? Or Is It Just Your Use Of Regular Expressions?  
 
 
 

Download Code Snippet ZIP File

Comments (13)  |  Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page




Adobe ColdFusion 8.0.1 Update - Helping Programmers To Be Signifanctly Less Girlie - Download ColdFusion 8 Update 8.0.1 Now.

Reader Comments

Holy Pseudo-Random Strings Of Characters, Batman! :)

Firstoff, very cool.

Secondly, I have a question that will wither mean I'm starting to understand Regular Expressions or... well, not.

It looks to me like you've got two groupings that create unused backreferences, namely ("(?:[^"]|"")*"|[^",\r\n]*) and (,|\r\n?|\n) If I understand this correctly, wouldn't (?:"(?:[^"]|"")*"|[^",\r\n]*)(?:,|\r\n?|\n) give better performance because you're not creating unused backreferences?

Or am I missing something?

Posted by Matt Osbun on Sep 28, 2007 at 8:42 AM


@Matt,

True, the groups do create back references that can use within the current regular expression. However, it makes the matched pattern much more convenient when grabbing the matched text when you capture the values.

If none of the groups were capturing, I could still grab the whole matching using:

objMatcher.Group()

However, it would not be as easy to figure out where the field value ends and the !optional! delimiter begins. By allowing the field and the delimiter to be captured via a group, then it makes it much more convenient to use:

objMatcher.Group( 1 )

... to get the field value, and

objMatcher.Group( 2 )

... to get the optional delimiter.

So yes, it might be faster in the pattern to not capture, but overall, capturing the groups make the algorithm easier to code and to understand.

Posted by Ben Nadel on Sep 28, 2007 at 9:04 AM


http://xkcd.com/208/

:)

Posted by tony petruzzi on Sep 28, 2007 at 9:06 AM


@Ben

Gotcha. Thanks.

Posted by Matt Osbun on Sep 28, 2007 at 9:19 AM


@Tony,

I love those comics. I need to get that RegEx t-shirt: http://xkcd.com/store/

Posted by Ben Nadel on Sep 28, 2007 at 10:41 AM


Here, we use the java Ostermiller CSV parser
http://ostermiller.org/utils/CSV.html

On my test server, your code running at around 4000 - 5000ms
Ostermiller's is under 300ms

We do A LOT of data importing and exporting, so I've spent tons and tons of time looking for the best way to parse files.

The only CF I have to deal with is to call a function:

<cfscript>
function parseCSV(csvPath) {
jFIS = createObject( "java", "java.io.FileInputStream" );
jFIS.init("#ARGUMENTS.csvPath#");

csvParser = createObject( "java", "com.Ostermiller.util.CSVParser" );
csvParser.init( jFIS );
csvAsArray = csvParser.getAllValues();
csvParser.close();

jFIS.close();

return csvAsArray;
}
</cfscript>

<cfset csvArray = parseCSV(strCSVPath)>

Posted by Jason Rushton on Sep 28, 2007 at 10:57 AM


@Jason,

Yeah, a real Java package is probably going to outperform anything that I write. But keep in mind that mine runs basic ColdFusion without any additions. I assume that for the CSVParser to work, I would have to add a Java package to the ColdFusion install. This might not be a problem for most people, but I do not have access to do that sort of stuff.

But regardless, thanks for testing it up against the "big boys". Good to see where I fall in line :)

Posted by Ben Nadel on Sep 28, 2007 at 11:15 AM


Email me your size and address and it will be ordered tonight :)

Posted by tony petruzzi on Sep 28, 2007 at 8:23 PM


Hello Chaps,

I do a fair amount, well, in fact, I do a large amount of CSV data parsing through a bunch of web services and things like that, and this is a very interesting topic for me.

Untill now I've pretty much cut CF out of the loop entirely as I dont really need to do any real 'manipulation' on the data, I just want to get it parsed into my database, so in the past I've always used a DTS package for SQL Server to do this for me, and more recently now I'm running 2k5 just opted for a stored proc that recieves a string for the file path of the CSV, it then does a BULK LOAD on the file, which seems to work very nicely.

Am I still handling this in the best fasion? or would one of these JAVA classes mixed up with an insert type query be my best bet?

I'm still betting that SQL handles this stuff infinatly faster than CF.

Rob

Posted by Rob Rawlins on Sep 29, 2007 at 1:58 PM


Ben, since at http://www.bennadel.com/index.cfm?dax=blog:410.view you asked me to see if I had any suggestions for your CSV parser here, in Java I would use something like the following:

\G(,|\r?\n|\r|^)(?:"([^"]*+(?>""[^"]*+)*)"|([^",\r\n]*+))

While the above is no longer ColdFusion-compatible, I assume that's not a big deal since you're using Java's regex engine anyway. The somewhat more advanced features it uses include possessive quantifiers, an atomic group, Jeffrey Friedl's "unrolling the loop" pattern, and Java's "\G" (just as an extra assurance that the regex's bump-along mechanism always puts us where we want to be). I've also used two different sets of parentheses to avoid having to remove the outer quotes from quoted values in post-processing. Instead, we can just check if a value was captured to backreference 2, and if so, use it, otherwise use backreference 3.

One other thing I would recommend after a brief lookover is changing the replacement of (""){2} with $1 to simply replacing "" with ". Also, if you're going to use a regex to do that, compile the regex outside the loop.

Posted by Steven Levithan on Sep 29, 2007 at 3:28 PM


@Steve,

You magnificent bastard - your optimizations rock. It saves 40% execution time! Unbelievable!

http://www.bennadel.com/index.cfm?dax=blog:978.view

Thanks for taking your time to help me out.

Posted by Ben Nadel on Sep 29, 2007 at 6:06 PM


Dude - You're a star.. Lovin' this code. I added The following code to convert the array to a query:-

<!--- Convert to a query --->
<cfset columnList = arrayToList(arrData[1])>
<cfset qryNew = queryNew(columnList)>
<cfloop from="2" to="#arrayLen(arrData)#" index="ptr">
<cfset itemArray = arrData[ptr]>
<cfset queryAddRow(qryNew)>
<cfloop from="1" to="#arrayLen(itemArray)#" index="ptr2">
<cfset querySetCell(qryNew, "#listGetAt(columnList, ptr2)#", itemArray[ptr2], qryNew.recordcount)>
</cfloop>
</cfloop>

Posted by Martin parry on Oct 15, 2007 at 6:51 AM


@Martin,

Glad you like it. Just watch out with your array to query algorithm as the array is not guaranteed to have all the sub-arrays the same length. Just be careful of the query dimensions.

Posted by Ben Nadel on Oct 15, 2007 at 7:12 AM


Post Comment  |  Ask Ben


Home   |   Web Log   |   ColdFusion   |   Projects   |   Resume   |   Job Form   |   Search   |   Contact
Epicenter Consulting - Custom Software Solutions for Business Evolution HostMySite.com - The Leader In ColdFusion Hosting