Parsing CSV Values Using A Standard File Format

Posted January 25, 2007 at 9:28 AM by Ben Nadel

Tags: ColdFusion

I have been working on parsing CSV values in ColdFusion and it has given me such a headache. The hardest part is trying to figure out if you are dealing with a field qualifier (ex. "ben") or dealing with an escaped qualifier (ex. "Ben is the ""bomb"""). I came up with a hacky solution that iterates over each character in the line. This is not performant and also does not handle all the qualifier situations correctly.

Tony Petruzzi suggested that I try using a Tokenizer. I had originally tried this in my first attempt but ran into similar problems involving field qualifiers. Working on it again this morning, I realized that my biggest problem was that I didn't even know what the CSV standard file format was! How can I solve a problem when I don't even know what my problem domain is?!?

After doing some quick Googling, I found this page, http://www.edoceo.com/utilis/csv-file-format.php, which listed the CSV file format standard as:

  1. Each record is one line - Line separator may be LF (0x0A) or CRLF (0x0D0A), a line seperator may also be embedded in the data (making a record more than one line but still acceptable).
  2. Fields are separated with commas. - Duh.
  3. Leading and trailing whitespace is ignored - Unless the field is delimited with double-quotes in that case the whitespace is preserved.
  4. Embedded commas - Field must be delimited with double-quotes.
  5. Embedded double-quotes - Embedded double-quote characters must be doubled, and the field must be delimited with double-quotes.
  6. Embedded line-breaks - Fields must be surrounded by double-quotes.
  7. Always Delimiting - Fields may always be delimited with double quotes, the delimiters will be parsed and discarded by the reading applications.

This makes things SOOO much easier. Knowing that an embedded delimiter or qualifier MUST be in a field that is fully qualified simplifies my life so much. Now, if I come across a token like this:

""""

I don't have to think about wether or not it's two escaped qualifiers or one escaped qualifier in a fully qualified field. Based on the standard I know that embedded qualifiers MUST be in a qualified field and hence, this is a single escaped qualifier in a qualified field (NOT two escaped qualifiers).

Based on this new information, I should be able to have the String Token version of the ColdFusion CSV parser up and running soon. But let this be a lesson - if you are trying to solve a problem, be sure you truly understand what the problem is :)

Thinking about it now though, why bother using the Tokenizer? That involves function calls. Why not just convert the row to a list using a split method. Looping over an Array has got to be faster than using a Tokenizer.




Reader Comments

Jan 25, 2007 at 9:51 AM // reply »
13 Comments

I've used the following two free tools when I needed to read CSV values. They should be helpful for anyone who wants code that does it for you or for anyone who is writing their own code and wants to see an example.

1. CSV Library (open source, CFC)

http://labs.redbd.net/projects/csvLibrary.cfm

2. cfx_text2Query (free, CFX)

http://www.emerle.net/programming/display.cfm/t/cfx_text2query


Jan 25, 2007 at 9:57 AM // reply »
11,238 Comments

Gabriel,

That looks pretty cool. Giving the algorithm a once over, it appears to be going through each character at a time. I would like to accomplish this without having to examine each character, but that may just be the best way to go.

Thanks for posting these links. I am sure they are going to be very useful. I can't use the CFX one (no install permissions), but it is good to know it is there.


Jan 25, 2007 at 10:07 AM // reply »
6 Comments

I've started doing everything on the database server side using DTS packages- makes life a whole lot easier. Although if you are using something other than SQL Server, I guess you have to go that route.


Jan 25, 2007 at 10:16 AM // reply »
11,238 Comments

Bob, good point... but if for no other reason, it is fun to figure this stuff out.


Jan 25, 2007 at 12:09 PM // reply »
11 Comments

Dealing with dirty data can always throw a kink into things. If it's a one off data import, sometimes it's just easier to open the file first in Excel, expand out the columns and look for the dirty row.

For continual usage, using the list functions are great, and when you have a known column count, you can use listlen for a partial validity check and then jump to some cleanup functions if the length doesn't match the expected result. Just make sure that you don't have any empty columns.
Try <cfset myList = ",,foo,bar">
<cfoutput>#ListLen(myList)#</cfoutput>
and you'll see what I mean.


Jan 26, 2007 at 7:39 AM // reply »
11,238 Comments

Chris,

Yeah, dirty data is something I fear, but I suppose, this, as with anything else, if you don't use it properly, it's ok if it breaks (it's not wrong to break). As far as interface type stuff, I am hoping to add this to my POI Utility.

Blank list items can be pain, especially in ColdFusion. Luckily, it seems that when you have a string and use the underlying java method String::Split( regEx ), the regular expression "[,]{1}" will split ",,," into four empty values. This works quite nicely!


Jan 27, 2007 at 10:30 AM // reply »
19 Comments

I have a home-grown CFC that I use for parsing CSV values (though I turn them into a 2D array rather than a query), and I recently retooled it to use the string.Split(chr(10)) and string.split(",") methods, i.e.

(psuedo-code)
<cfset var i = 0>
<cfset var j = 0>
<cfset destArray = inText.split(chr(10))>
<cfloop from="1" to="#arrayLen( destArray )#" index="i">
<cfset destArray[i] = destArray[i].split(",")>
<cfloop from="1" to="#arrayLen( destArray[i] )#" index="j">
<cfif left(destArray[i][j],1) eq """">
<cfif j lt arrayLen( destArray[i] )>
<cfset destArray[i][j] = listAppend( destArray[i][j], destArray[i][j+1]>
<cfelse>
<cfset arrayAppendArray( destArray[i], destArray[i+1] )>
</cfif>
</cfif>
</cfloop>
</cfloop>

An interesting thing I noticed though, is that Sring.split() at least in CFMX6.1 doesn't return a real ColdFusion array, so all of the "ArrayAPpend" type functions don't work on it. I had to write a custom function to convert the Java Array into a ColdFusion array.

However, net result: About a 90% performance savings over the old character-by-character method I'd been using before. Now if I could just find a way to speed up my bulk inserts...


Jan 27, 2007 at 10:31 AM // reply »
19 Comments

I have a home-grown CFC that I use for parsing CSV values (though I turn them into a 2D array rather than a query), and I recently retooled it to use the string.Split(chr(10)) and string.split(",") methods, i.e.

(incomplete code)
<cfset var i = 0>
<cfset var j = 0>
<cfset destArray = inText.split(chr(10))>
<cfloop from="1" to="#arrayLen( destArray )#" index="i">
<cfset destArray[i] = destArray[i].split(",")>
<cfloop from="1" to="#arrayLen( destArray[i] )#" index="j">
<cfif left(destArray[i][j],1) eq """">
<cfif j lt arrayLen( destArray[i] )>
<cfset destArray[i][j] = listAppend( destArray[i][j], destArray[i][j+1]>
<cfelse>
<cfset arrayAppendArray( destArray[i], destArray[i+1] )>
</cfif>
</cfif>
</cfloop>
</cfloop>

An interesting thing I noticed though, is that Sring.split() at least in CFMX6.1 doesn't return a real ColdFusion array, so all of the "ArrayAPpend" type functions don't work on it. I had to write a custom function to convert the Java Array into a ColdFusion array.

However, net result: About a 90% performance savings over the old character-by-character method I'd been using before. Now if I could just find a way to speed up my bulk inserts...


Jan 27, 2007 at 11:01 AM // reply »
13 Comments

Assuming you're doing bulk inserts into a SQL server database, I've been able to speed that process up by inserting 100 rows using one database call in the format below.

INSERT INTO memberTable(membername,memberLastName)
SELECT 'John','Smith'
UNION ALL
SELECT 'Julia','Roberts'
UNION ALL
SELECT 'Bruce','Wayne'

100 was a trivial number for me. At 1000, the inserts took longer than if I did them 1 at a time. But at 100, the process took significantly less time than if I inserted them 1 at a time.

I wasn't looking for the optimal number. So 100 shortened the time it took to an acceptable time for me.

Gabriel


Jan 28, 2007 at 11:03 PM // reply »
11,238 Comments

Adam,

Yes, I also ran into the Split() not returning a true ColdFusion array in this sort of algorithm. I just finished an algorithm for handling text based only on delimiters. My next step is have CSV parsing that handles delimiters and line breaks. The tricky thing is that line breaks might be escaped... actually, now that I am writing this, it is occurring to me that these are EXACTLY the same problems. I can use the same string tokenizer to break it up by lines and then by delimiters in two passes.


Ron
Feb 24, 2009 at 6:18 PM // reply »
3 Comments

Hello!

I came across an issue with cfx_text2Query and I wonder if anybody has an idea for my problem. It's about the character set because I need to use german umlauts. ;./

On my windows machine everything works fine importing the csv with the umlauts (the file is ISO-8859-1 as Firefox tells me)
(config: win2k, CF7, and the head of the .cfm reads:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<cfprocessingdirective pageencoding="iso-8859-1">
<HTML>
<HEAD>
...

Now to the problem: When using the same script on the production machine (wich runs under Linux) I get "?" instead of any umlaut. :(

Bloody linux?! Any ideas? :-)

Ronald


Feb 24, 2009 at 6:21 PM // reply »
11,238 Comments

@Ronald,

I am always lacking experience with extended characters. Sorry :(


Apr 27, 2010 at 3:16 AM // reply »
1 Comments

@Ronald

I had the same problem with german umlauts from a csv import file. I set the charset of <cffile> tag to "iso-8859-1" and it worked.

<cffile action="read" file="/usr/local/httpd/htdocs/yoursite/umlauts.csv" variable="csvfile" charset="iso-8859-1">

<cfloop index="index" list="#csvfile#" delimiters="#chr(10)##chr(13)#">

#listgetAt('#index#',1,';')# #listgetAt('#index#',2,';')# #listgetAt('#index#',3,';')#
<br />

</cfloop>



Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 20, 2013 at 11:45 AM
Using jQuery's Animate() Step Callback Function To Create Custom Animations
This is really useful. I found out that you don't actually have to use a dummy css property (surprisingly). To animate a property in a linear-gradient for instance I did this this.css('someLinearGra ... read »
May 20, 2013 at 10:51 AM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Josh, Oh snap! You're totally right! I'm not sure I've ever tried that. I did know that you can call a number of other array-methods on ColdFusion query columns: http://www.bennadel.com/blog/167 ... read »
May 20, 2013 at 10:45 AM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Ben - I believe you can achieve the same functionality with ColdFusion's built in ArrayToList() function. ArrayToList( users[ "id" ] ); ... read »
May 20, 2013 at 10:21 AM
My Experience With AngularJS - The Super-heroic JavaScript MVW Framework
Is there any error logging and handling framework in angularjs, if not then in what way I can do this. ... read »
May 19, 2013 at 2:31 PM
My Experience With AngularJS - The Super-heroic JavaScript MVW Framework
It's funny really just how well that image describes the way I would imagine most people that go with angular for some project is. I have had a similar roller-coaster ride with it as well, but not qu ... read »
May 17, 2013 at 7:42 PM
HashKeyCopier - An AngularJS Utility Class For Merging Cached And Live Data
Ben - thanks so much for posting these Angular articles and findings, they've been a huge help towards learning one of the more 'complex' JavaScript frameworks out there (IMO). I have been using Angu ... read »
May 16, 2013 at 5:01 PM
UPDATE: Parsing CSV Data Files In ColdFusion With csvToArray()
Your code was the closest thing I've found to obtaining some direction for converting ISO fields to values that CF can translate properly. Thank you for posting! ... read »
May 15, 2013 at 6:07 PM
Making SOAP Web Service Requests With ColdFusion And CFHTTP
Ben, you once again saved my bacon at work. Thank you, thank you, thank you! ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools