Regular Expressions Make CSV Parsing In ColdFusion So Much Easier (And Faster)
Posted September 28, 2007 at 7:29 AM
Today, I was working on some data parsing when it hit me like a bolt of lightening! Why not just use Regular Expressions to parse CSV (comma separated values) data files? For those of you who have seen my previous attempts at parsing CSV data files using ColdFusion, you will see that having to handle embedded field qualifiers and data delimiters made it a character by character parsing problem since you can no longer treat it like a list of lists (both comma a new line delimited). It was a huge amount of code to do a huge amount of work.
But then, as I was working today, suddenly I realized that working with embedded field qualifiers was exactly like working with embedded quotes in HTML tag attribute value parsing; it's just a pattern. So, what is the pattern of the CSV file. At it's highest level, it's a field value followed by an optional delimiter. I say "optional" delimiter because the last data field in the file will not be succeeded by a delimiter. The field value is then an optional qualified value containing zero or more characters. The allowable characters in the field are determined be the qualification of the data.
The regular expression I came up with was so short it made me giggle:
("(?:[^"]|"")*"|[^",\r\n]*)(,|\r\n?|\n)?
This ColdFusion compatible regular expression captures two groups; the first group is the field value (zero or more characters) and the second group is the optional succeeding delimiter. If this is totally not readable (as most regular expressions are not), I have converted this to a verbose regular expression (?x) with my typical amount of commenting so that you can really see how the pattern of the data field can be matched:
Launch code in new window » Download code as text file »
- <!--- Create verbose regular expression. --->
- <cfsavecontent variable="strRegex">(?x)
-
- ## Every CSV "Group" consists of a field value
- ## followed by an optional delimiter. This delimiter,
- ## if it exists, will be either a field delimiler or
- ## a line delimiter. If no delimiter exists, then we
- ## are at the end of the file.
-
- ## Let's get the field value. We need to consider two cases
- ## in field values. Either the value is qualified or it is
- ## not. According to standards, values that have an
- ## embedded field or line delimiter or qualifier, MUST BE
- ## contained in a qualified field.
-
- (
- ## Because the qualified fields are the exception case,
- ## we want to check for those first. A qualified field
- ## can contain non quotes AND escaped quotes.
-
- "(?:[^"]|"")*"
-
- | ## OR
-
- ## If we did not match the qualified field token, then
- ## we need to check for the non-qualified field token
- ## which can be zero or more characters NOT consisting
- ## of any qualifier or delimiter.
-
- [^",\r\n]*
-
- )
-
- ## Now that we have captured the zero+ length field value,
- ## we can get the token. Remember, if we are at the end
- ## of the file, the token will not exist and therefore,
- ## this group must be optional (?).
-
- (
- ## Comma
-
- ,
-
- ## OR return with optional newline.
-
- |\r\n?
-
- ## OR just new line (I include this option for safer
- ## text values across operating systems (but this might
- ## not be necessary).
-
- |\n
- )?
-
- </cfsavecontent>
So now that you see how simple the pattern is, let's use it to start parsing our CSV data in ColdFusion. Luckily, from within ColdFusion, we have access to the Java Pattern and the Java Pattern Matcher which make our lives almost too easy when it comes to iterating over a string. As we iterate over the string, we are going to store our CSV values in an array of arrays in which each sub array will represent a row from the CSV data file.
Launch code in new window » Download code as text file »
- <!---
- Save CSV data values. Here, we are creating a CSV data
- value that has both qualified and non-qualified field
- values, populated and empty field values, and embedded
- field qualifiers and field/line delimiters.
- --->
- <cfsavecontent variable="strCSV">
- "Name","Nick Name","Age","Hair Color"
- Kim,"Kim ""Hot Legs"" Smith",24,"Brunette"
- "Sarah Vivenz, II","Stubs",27,"Brunette"
- "Kit Williams",Kitty,34,Blonde,,,
- "Even
- Values With
- Embedded Line Breaks"
- </cfsavecontent>
-
-
- <!--- Trim data values. --->
- <cfset strCSV = Trim( strCSV ) />
-
-
- <!---
- Get the regular expression to match the tokens. I have
- put the field value on the first line and delimiters on
- the second line for easier reading.
- --->
- <cfset strRegEx = (
- "(""(?:[^""]|"""")*""|[^"",\r\n]*)" &
- "(,|\r\n?|\n)?"
- )/>
-
-
- <!---
- Create a compiled Java regular expression pattern object
- based on the pattern above.
- --->
- <cfset objPattern = CreateObject(
- "java",
- "java.util.regex.Pattern"
- ).Compile(
- JavaCast( "string", strRegEx )
- )
- />
-
- <!---
- Get the pattern matcher for our target text (the CSV data).
- This will allows us to iterate over all the data fields.
- --->
- <cfset objMatcher = objPattern.Matcher(
- JavaCast( "string", strCSV )
- ) />
-
-
- <!---
- Create an array to hold the CSV data. We are going
- to create an array of arrays in which each nested
- array represents a row in the CSV data file.
- --->
- <cfset arrData = ArrayNew( 1 ) />
-
- <!--- Start off with a new array for the new data. --->
- <cfset ArrayAppend( arrData, ArrayNew( 1 ) ) />
-
-
- <!---
- Here's where the magic is taking place; we are going
- to use the Java pattern matcher to iterate over each
- of the CSV data fields using the regular expression
- we defined above.
-
- Each match will have at least the field value and
- possibly an optional trailing delimiter.
- --->
- <cfloop condition="objMatcher.Find()">
-
- <!--- Get the field token value. --->
- <cfset REQUEST.Value = objMatcher.Group(
- JavaCast( "int", 1 )
- ) />
-
- <!--- Remove the field qualifiers (if any). --->
- <cfset REQUEST.Value = REQUEST.Value.ReplaceAll(
- JavaCast( "string", "^""|""$" ),
- JavaCast( "string", "" )
- ) />
-
- <!--- Unesacepe embedded qualifiers (if any). --->
- <cfset REQUEST.Value = REQUEST.Value.ReplaceAll(
- JavaCast( "string", "(""){2}" ),
- JavaCast( "string", "$1" )
- ) />
-
- <!--- Add the field value to the row array. --->
- <cfset ArrayAppend(
- arrData[ ArrayLen( arrData ) ],
- REQUEST.Value
- ) />
-
-
- <!---
- Get the delimiter. If no delimiter group was matched,
- this will destroy the variable in the REQUEST scope.
- --->
- <cfset REQUEST.Delimiter = objMatcher.Group(
- JavaCast( "int", 2 )
- ) />
-
-
- <!--- Check for delimiter. --->
- <cfif StructKeyExists( REQUEST, "Delimiter" )>
-
- <!---
- Check to see if we need to start a new array to
- hold the next row of data. We need to do this if the
- delimiter we just found is NOT a field delimiter.
- --->
- <cfif (REQUEST.Delimiter NEQ ",")>
-
- <!--- Start new row data array. --->
- <cfset ArrayAppend(
- arrData,
- ArrayNew( 1 )
- ) />
-
- </cfif>
-
- <cfelse>
-
- <!---
- If there is no delimiter, then we are done parsing
- the CSV file data. Break out rather than just ending
- the loop to make sure we don't get any extra data.
- --->
- <cfbreak />
-
- </cfif>
-
- </cfloop>
-
-
- <!--- Dump out CSV data array. --->
- <cfdump
- var="#arrData#"
- label="CSV File Data"
- />
There's a lot of commenting going on there (typical Me!), but if you filter all that out, this is only a few lines of code. That's how easy regular expressions make our lives! Compare that with any of my previous ColdFusion CSV parsing solutions to see that this one is a small fraction of the size. And, best part is, it works! Running the above code, we get the following CFDump output:
| | | | ||
| | ![]() | | ||
| | | |
It works on a small file, sure piece of cake, but what about performance? Sometimes we (ok, maybe just me) get sucked so hard into how sexy Regular Expressions are, that we lose sight of the big picture and we end up going down a path that is less performant in the long run. To test this, I am going to create a beefy CSV file with this code:
Launch code in new window » Download code as text file »
- <!--- Create data file path. --->
- <cfset strCSVPath = ExpandPath( "data.csv" ) />
-
- <!---
- Create sample data for a web site. Make this data beefy
- to simulate what kind of stuff might actually be coming
- out of a system-generated CSV file.
- --->
- <cfsavecontent variable="strCSV">"Kinky Solutions: A Student's Perspective","http://www.bennadel.com","Ben J. Nadel","ColdFusion,Web Development,CSS,Javascript,jQuery,SQL,SEO Optimization",,,09/21/2007 07:31:15,true,true,false,"5,340","34",,true</cfsavecontent>
-
- <!--- Repeat string 10,000 times. --->
- <cfset strCSV = RepeatString(
- strCSV & Chr( 13 ) & Chr( 10 ),
- (10 * 1000)
- ) />
-
- <!--- Write data to file. --->
- <cffile
- action="write"
- file="#strCSVPath#"
- output="#strCSV#"
- />
This creates a 10,000 line CSV file with over 2,150,000 characters. This file comes out to be just over 2 megabytes in size. Now, to make sure I am not messing with the time trials, I am running this script separately than the parsing code.
Now for the test! I ran the current version that employs regular expressions to do the parsing against my previous version which checks characters.
Previous CSV Parsing Algorithm
Average parse time: 100,000 ms (100 seconds)
Regular Expression CSV Parsing Algorithm
Average parse time: 11,000 ms (11 seconds)
Holy Cow! The regular expression solution is about 10 times faster than the previous solution! And, this was no rinky-dinky example; this was a seriously large CSV file (at least in my world).
Just for fun, I regenerated the above test CSV file, except this time, I created a 50,000 record CSV file. This file has 10,750,000 characters in it and comes out at well over 10 megabytes. I didn't even want to bother running this in the old solution. However, I am very pleased to say that all 50,000 records were successfully parsed in just 57,000 ms (57 seconds)! That's pretty sweet-ass-sweet.
Let's stop for a second and just look at the performance scaling. The first run went 10,000 records in 11,000 ms. That comes out to be 909 records per second. The second run went 50,000 records in 57,000 ms. That comes out to be 877 records per second. As the data set grows, this thing is just gonna EAT UP MEMORY (creating this massive array in RAM), but as you can see, the efficiency of the algorithm is fairly constant. This is some awesome news.
| | | | ||
| | ![]() | | ||
| | | |
Download Code Snippet ZIP File
Post Comment | Ask Ben | Other Searches | Print Page
Newer Post
ColdFusion 8 ImageDrawTextArea() (Inspired By Barney Boisvert!)
Older Post
Java String Buffer Treated As String In ColdFusion (When Needed)
Reader Comments
Holy Pseudo-Random Strings Of Characters, Batman! :)
Firstoff, very cool.
Secondly, I have a question that will wither mean I'm starting to understand Regular Expressions or... well, not.
It looks to me like you've got two groupings that create unused backreferences, namely ("(?:[^"]|"")*"|[^",\r\n]*) and (,|\r\n?|\n) If I understand this correctly, wouldn't (?:"(?:[^"]|"")*"|[^",\r\n]*)(?:,|\r\n?|\n) give better performance because you're not creating unused backreferences?
Or am I missing something?
@Matt,
True, the groups do create back references that can use within the current regular expression. However, it makes the matched pattern much more convenient when grabbing the matched text when you capture the values.
If none of the groups were capturing, I could still grab the whole matching using:
objMatcher.Group()
However, it would not be as easy to figure out where the field value ends and the !optional! delimiter begins. By allowing the field and the delimiter to be captured via a group, then it makes it much more convenient to use:
objMatcher.Group( 1 )
... to get the field value, and
objMatcher.Group( 2 )
... to get the optional delimiter.
So yes, it might be faster in the pattern to not capture, but overall, capturing the groups make the algorithm easier to code and to understand.
:)
@Ben
Gotcha. Thanks.
@Tony,
I love those comics. I need to get that RegEx t-shirt: http://xkcd.com/store/
Here, we use the java Ostermiller CSV parser
http://ostermiller.org/utils/CSV.html
On my test server, your code running at around 4000 - 5000ms
Ostermiller's is under 300ms
We do A LOT of data importing and exporting, so I've spent tons and tons of time looking for the best way to parse files.
The only CF I have to deal with is to call a function:
<cfscript>
function parseCSV(csvPath) {
jFIS = createObject( "java", "java.io.FileInputStream" );
jFIS.init("#ARGUMENTS.csvPath#");
csvParser = createObject( "java", "com.Ostermiller.util.CSVParser" );
csvParser.init( jFIS );
csvAsArray = csvParser.getAllValues();
csvParser.close();
jFIS.close();
return csvAsArray;
}
</cfscript>
<cfset csvArray = parseCSV(strCSVPath)>
@Jason,
Yeah, a real Java package is probably going to outperform anything that I write. But keep in mind that mine runs basic ColdFusion without any additions. I assume that for the CSVParser to work, I would have to add a Java package to the ColdFusion install. This might not be a problem for most people, but I do not have access to do that sort of stuff.
But regardless, thanks for testing it up against the "big boys". Good to see where I fall in line :)
Email me your size and address and it will be ordered tonight :)
Hello Chaps,
I do a fair amount, well, in fact, I do a large amount of CSV data parsing through a bunch of web services and things like that, and this is a very interesting topic for me.
Untill now I've pretty much cut CF out of the loop entirely as I dont really need to do any real 'manipulation' on the data, I just want to get it parsed into my database, so in the past I've always used a DTS package for SQL Server to do this for me, and more recently now I'm running 2k5 just opted for a stored proc that recieves a string for the file path of the CSV, it then does a BULK LOAD on the file, which seems to work very nicely.
Am I still handling this in the best fasion? or would one of these JAVA classes mixed up with an insert type query be my best bet?
I'm still betting that SQL handles this stuff infinatly faster than CF.
Rob
Ben, since at http://www.bennadel.com/index.cfm?dax=blog:410.view you asked me to see if I had any suggestions for your CSV parser here, in Java I would use something like the following:
\G(,|\r?\n|\r|^)(?:"([^"]*+(?>""[^"]*+)*)"|([^",\r\n]*+))
While the above is no longer ColdFusion-compatible, I assume that's not a big deal since you're using Java's regex engine anyway. The somewhat more advanced features it uses include possessive quantifiers, an atomic group, Jeffrey Friedl's "unrolling the loop" pattern, and Java's "\G" (just as an extra assurance that the regex's bump-along mechanism always puts us where we want to be). I've also used two different sets of parentheses to avoid having to remove the outer quotes from quoted values in post-processing. Instead, we can just check if a value was captured to backreference 2, and if so, use it, otherwise use backreference 3.
One other thing I would recommend after a brief lookover is changing the replacement of (""){2} with $1 to simply replacing "" with ". Also, if you're going to use a regex to do that, compile the regex outside the loop.
@Steve,
You magnificent bastard - your optimizations rock. It saves 40% execution time! Unbelievable!
http://www.bennadel.com/index.cfm?dax=blog:978.view
Thanks for taking your time to help me out.
Dude - You're a star.. Lovin' this code. I added The following code to convert the array to a query:-
<!--- Convert to a query --->
<cfset columnList = arrayToList(arrData[1])>
<cfset qryNew = queryNew(columnList)>
<cfloop from="2" to="#arrayLen(arrData)#" index="ptr">
<cfset itemArray = arrData[ptr]>
<cfset queryAddRow(qryNew)>
<cfloop from="1" to="#arrayLen(itemArray)#" index="ptr2">
<cfset querySetCell(qryNew, "#listGetAt(columnList, ptr2)#", itemArray[ptr2], qryNew.recordcount)>
</cfloop>
</cfloop>
@Martin,
Glad you like it. Just watch out with your array to query algorithm as the array is not guaranteed to have all the sub-arrays the same length. Just be careful of the query dimensions.
Ben,
Thank you for forwarding me to this article. Works even faster then the last and with less code!!!
one quick question, why can't i assign vars to each row in the cfloop output? Usually the following works...
Can you throw in an example how i can set a var to each row so that i can insert into the db? Thanks in advance
This isn't working:
<cfloop from="1" to="#arrayLen(arrData)#" index="i">
<cfoutput>#arrData[i]#</cfoutput> <br />
</cfloop>
@Rob,
Not to keep bumping you around, but I actually took this algorithm and wrapped it up in a user defined function (UDF):
http://www.bennadel.com/index.cfm?dax=blog:991.view
As far as your question, I am not really sure what you mean. Can you explain further? Maybe using the UDF above will help you out a bit.
Ben,
I downloaded the UDF and the array dump of my CSV is working great. I am just having a problem setting variables to each column in each row in the csv.
For my particular CSV file, I have 19 columns and each column is shown in the dump but what var name do i use to output that data in a loop?
see below. this is the copy/paste of the dump. I'm sure it's easy but how do what syntax do i use to set a var to each row in the array? The normal methods i use are not working right... thanks.
array
1 501
2 9938
3 2002
4 BMW
5 128i
6 Sedan
7 1B7HL48X62S557215
8 0
9 24449
10 22000
11 Red
12 Tan, blue
13 Automatic
14 [empty string]
15 Description 1
16 Car
17 V6
18 Front Wheel Drive
19 Gasoline
@Rob,
The result of the UDF is an array of arrays. Therefore, you need to refer to both indexes:
#arrData[ intRowIndex ][ intFieldIndex ]#
I think the problem is that you are only referring to the row index. That gives you another array which requires a second index.
Yep! That's it. Thanks, i'm used to using structures within arrays and that seemed tricky for me at first
<cfoutput>#arrData[1][1]#</cfoutput>
etc...
Great! I can slee well tonight now :)
You da man.
Great piece of code Ben. I searched and experimented with all kinds of methods found online; this was a godsend and by for the easiest solution!
@Rob,
Glad you like - be sure to checkout the faster, updated UDF:
http://www.bennadel.com/blog/991-CSVToArray-ColdFusion-UDF-For-Parsing-CSV-Data-Files.htm
I needed a fast solution to import a csv list of countries and found your UDF. But did not manage to implement it.
Maybe it´s very simple to call a UDF. I placed it in the same folder from where I called it but no luck. Searched Adobe Knowledge base to found out exactly where to put it, but run out of time so I skiped it and instead used: http://sourceforge.net/projects/csvtosql, then I also had a sql dump file for mySQL server.
I agree that regular expressions can be realy handy sometimes!
@Göran,
I'm glad you found something that worked out.
Thanks Ben,
This bit of code saved me a lot of work while battling with a 68000 line CSV file. I did modify it slightly as I was reading in the file line by line, so only needed the one dimensional array that would be replaced come the next line of the file. Certainly saved on memory usage.
I blogged about using it while battling a memory leak issue relating to processing the data from that CSV.
http://misterdai.wordpress.com/2009/06/24/when-not-to-use-cfqueryparam/
Cheers!
Dave (Mister Dai)
@David,
Glad to help out. You should check out Steve Levithan's modification to this work in a later post:
http://www.bennadel.com/blog/991-CSVToArray-ColdFusion-UDF-For-Parsing-CSV-Data-Files.htm
How do you handle/detect bad data, or where different rows, may not have the right number of fields?
Or does it just assume all incoming data is correct, after parsing it through regex?
@Craig,
That's an interesting question. CSV is a "standard". If someone where to deviate from the standard, I am not sure what to do. The regex might just not match, or find things in the incorrect rows. Really, this depends on valid data (just as your compiler depends on valid code).
The reason being, is that many csv projects, have where you can't control the validity of the output, which is why we have to do some validity checking...
I am working on a project of importing csv files, and i am trying to discover ways, to help clarify what is correct and incorrect rows of data.
Because sadly you can't always assume that the incoming data is pristine and so i am just trying to take precautions...
I am also worried about memory leaks, and performance issues, and so forth...
Has there been any studies done on both the different ways to parse csv, and what methods more speedily import the data?
I am most likely going to be doing most of my cleaning and validating of data, and breaking the files into smaller chunks using your really nice code...
But once i have valid and clean and pristine csv data, i will most likely have to use sql server stored procedure to automate that part...
Was hoping to hear more best practices from other cfers that do lots of csv importing.
As well as your own experiences...
Thank You.
@Craig,
I know there are some very powerful database-oriented ways to import the CSV data, but I don't have much experience with that. I use this kind of method and it seems to work well, but I do more exporting than importing.
Sorry I don't have more "best practice" type advice to offer.
@Craig,
You may want to have a look at my blog post on my battles with extremely large CSV file and memory leakage (all due to compnents, cfquery and cfqueryparam usage, not Ben's CSV parsing routine). http://bit.ly/ZknIx
As far as validation is concerned, I have to take an all or nothing approach. Everything was wrapped in a cftransaction / cftry with a rollback if there were any errors. Any data which wasn't valid (e.g. date, barcode) would throw an error.
For minimum memory usage, use a stored procedure, read the CSV line-by-line (you can easily modify the CSV regex to do so), avoid CFQueryParam if possible, use CFQuery for calling the stored proc, also try to buffer your SQL into an array and execute it every 50 CSV rows (tweak this value for best performance).
As for best practices, I'm not sure of many but I'd say you're better off using the database to do the CSV importing if possible. Maybe handling just validation in CF and then passing it over to the Database for the actual import process.
@David,
How does a database-csv-import work? Do you basically set up a datasource to a given file name? And then have CF place the new CSV in that file... which is how I think using an Excel file as a datasource works (although I haven't done that in years).
@Ben,
Well for MS SQL Server 2000+ there is "BULK INSERT". To use it you'd have to validate the CSV file with ColdFusion first, copy it to a drive on the database server and then via CF just use the CFQuery tag to execute the bulk insert SQL.
http://msdn.microsoft.com/en-us/library/aa225968%28SQL.80%29.aspx
The only problem I can see with this is that it doesn't comply to CSV standards and only works with a single delimiter, no support for double quotes and extra commas. So it might need the CSV pre-processed to replace the comma delimiter with a control character and remove the double quotes where applicable.
Although I have seen something mentioned about a tool called bcp and providing it with file format information. But that would have to use a CFExecute :(
Hope this makes sense :)
How would you modify this code if your delimiter was the hat (^) character? I ran the CSV function defining the delimiter and it worked great. I would like to try this one as well.
Thanks for all your hard work and sharing.
@Dwayne,
You would need to add the "^" character to the list of delimiters. The only thing to be careful of is that "^" is a special character in Regular Expressions. As such, you would need to escape it:
... In character groups (since it is, by default, a "negation" selector):
[\^]
... In standard patterns (since it is, by default, a "string start" delimiter):
\^






