Parsing CSV Data Using ColdFusion
Posted January 22, 2007 at 4:03 PM
As part of my exploration of writing, reading, and creating Microsoft Excel documents using ColdFusion, I have come across the need to parse comma-separated-value (CSV) data files. While this seems at first like a relatively simple task, I soon found out that it was ANYTHING but simple. It's one thing to worry about lists (for which ColdFusion is the bomb-diggity), but it's an entirely other thing to worry about lists that have field qualifiers, escaped qualifiers, escaped qualifiers that might be empty fields, and non-qualified field values all rolled into one.
I tried looking it up in Google but could not find any satisfactory algorithms (translates to: code that I could understand). Everything on CSV seems to be in Java and none the stuff on CFLib.org seems to comply with the range of CSV values (especially qualified fields). So, in typical blood-and-guts fashion, I sat down and tried to write my own algorithm. This proved to be easy at first until I found out that my approach was highly flawed. I went through about three different implementations over the weekend of the algorithm before I came up with something that seemed to work satisfactorially.
It has to evaluate each character at a time, which probably won't scale or perform nicely. I would have liked to harness the power of CFHttp to convert CSV files to queries, but I could not get CFHttp to work on the LOCAL file system (ie. a URL that begins with "file:"). If anyone knows of great way to do this, please let me know. I suppose that I could written a temporary file to a public folder and then performed a CFHttp to it, then deleted it, but that just felt a bit "hacky." However, in the end that might just prove to be the way to go.
So anyway, this is what I have come up with. It is a function that takes either a chunk of CSV data or a file path to a CSV data file (text file) and converts it to an array of arrays. It assumes that each record is separated by a return character followed optionally by a new line. Not sure if that is cross system compliant, but heck, this is my first attempt:
Launch code in new window » Download code as text file »
- <cffunction
- name="CSVToArray"
- access="public"
- returntype="array"
- output="false"
- hint="Takes a delimited text data file or chunk of delimited data and converts it to an array of arrays.">
-
- <!--- Define the arguments. --->
- <cfargument
- name="CSVData"
- type="string"
- required="false"
- default=""
- hint="This is the raw CSV data. This can be used if instead of a file path."
- />
-
- <cfargument
- name="CSVFilePath"
- type="string"
- required="false"
- default=""
- hint="This is the file path to a CSV data file. This can be used instead of a text data blob."
- />
-
- <cfargument
- name="Delimiter"
- type="string"
- required="false"
- default=","
- hint="The character that separate fields in the CSV."
- />
-
- <cfargument
- name="Qualifier"
- type="string"
- required="false"
- default=""""
- hint="The field qualifier used in conjunction with fields that have delimiters (not used as delimiters ex: 1,344,343.00 where [,] is the delimiter)."
- />
-
-
- <!--- Define the local scope. --->
- <cfset var LOCAL = StructNew() />
-
- <!---
- Check to see if we are dealing with a file. If we are,
- then we will use the data from the file to overwrite
- any csv data blob that was passed in.
- --->
- <cfif (
- Len( ARGUMENTS.CSVFilePath ) AND
- FileExists( ARGUMENTS.CSVFilePath )
- )>
-
- <!---
- Read the data file directly into the arguments scope
- where it can override the blod data.
- --->
- <cffile
- action="READ"
- file="#ARGUMENTS.CSVFilePath#"
- variable="ARGUMENTS.CSVData"
- />
-
- </cfif>
-
-
- <!---
- ASSERT: At this point, whether we got the CSV data
- passed in as a data blob or we read it in from a
- file on the server, we now have our raw CSV data in
- the ARGUMENTS.CSVData variable.
- --->
-
-
- <!---
- Make sure that we only have a one character delimiter.
- I am not going traditional ColdFusion style here and
- allowing multiple delimiters. I am trying to keep
- it simple.
- --->
- <cfif NOT Len( ARGUMENTS.Delimiter )>
-
- <!---
- Since no delimiter was passed it, use thd default
- delimiter which is the comma.
- --->
- <cfset ARGUMENTS.Delimiter = "," />
-
- <cfelseif (Len( ARGUMENTS.Delimiter ) GT 1)>
-
- <!---
- Since multicharacter delimiter was passed, just
- grab the first character as the true delimiter.
- --->
- <cfset ARGUMENTS.Delimiter = Left(
- ARGUMENTS.Delimiter,
- 1
- ) />
-
- </cfif>
-
-
- <!---
- Make sure that we only have a one character qualifier.
- I am not going traditional ColdFusion style here and
- allowing multiple qualifiers. I am trying to keep
- it simple.
- --->
- <cfif NOT Len( ARGUMENTS.Qualifier )>
-
- <!---
- Since no qualifier was passed it, use thd default
- qualifier which is the quote.
- --->
- <cfset ARGUMENTS.Qualifier = """" />
-
- <cfelseif (Len( ARGUMENTS.Qualifier ) GT 1)>
-
- <!---
- Since multicharacter qualifier was passed, just
- grab the first character as the true qualifier.
- --->
- <cfset ARGUMENTS.Qualifier = Left(
- ARGUMENTS.Qualifier,
- 1
- ) />
-
- </cfif>
-
-
- <!--- Create an array to handel the rows of data. --->
- <cfset LOCAL.Rows = ArrayNew( 1 ) />
-
- <!---
- Split the CSV data into rows of raw data. We are going
- to assume that each row is delimited by a return and
- / or a new line character.
- --->
- <cfset LOCAL.RawRows = ARGUMENTS.CSVData.Split(
- "\r\n?"
- ) />
-
-
- <!--- Loop over the raw rows to parse out the data. --->
- <cfloop
- index="LOCAL.RowIndex"
- from="1"
- to="#ArrayLen( LOCAL.RawRows )#"
- step="1">
-
-
- <!--- Create a new array for this row of data. --->
- <cfset ArrayAppend( LOCAL.Rows, ArrayNew( 1 ) ) />
-
-
- <!--- Get the raw data for this row. --->
- <cfset LOCAL.RowData = LOCAL.RawRows[ LOCAL.RowIndex ] />
-
-
- <!---
- Replace out the double qualifiers. Two qualifiers in
- a row acts as a qualifier literal (OR an empty
- field). Replace these with a single character to
- make them easier to deal with. This is risky, but I
- figure that Chr( 1000 ) is something that no one
- is going to use (or is it????).
- --->
- <cfset LOCAL.RowData = LOCAL.RowData.ReplaceAll(
- "[\#ARGUMENTS.Qualifier#]{2}",
- Chr( 1000 )
- ) />
-
- <!--- Create a new string buffer to hold the value. --->
- <cfset LOCAL.Value = CreateObject(
- "java",
- "java.lang.StringBuffer"
- ).Init()
- />
-
-
- <!---
- Set an initial flag to determine if we are in the
- middle of building a value that is contained within
- quotes. This will alter the way we handle
- delimiters - as delimiters or just character
- literals.
- --->
- <cfset LOCAL.IsInField = false />
-
-
- <!--- Loop over all the characters in this row. --->
- <cfloop
- index="LOCAL.CharIndex"
- from="1"
- to="#LOCAL.RowData.Length()#"
- step="1">
-
-
- <!---
- Get the current character. Remember, since Java
- is zero-based, we have to subtract one from out
- index when getting the character at a
- given position.
- --->
- <cfset LOCAL.ThisChar = LOCAL.RowData.CharAt(
- JavaCast( "int", (LOCAL.CharIndex - 1))
- ) />
-
-
- <!---
- Check to see what character we are dealing with.
- We are interested in special characters. If we
- are not dealing with special characters, then we
- just want to add the char data to the ongoing
- value buffer.
- --->
- <cfif (LOCAL.ThisChar EQ ARGUMENTS.Delimiter)>
-
- <!---
- Check to see if we are in the middle of
- building a value. If we are, then this is a
- character literal, not an actual delimiter.
- If we are NOT buildling a value, then this
- denotes the end of a value.
- --->
- <cfif LOCAL.IsInField>
-
- <!--- Append char to current value. --->
- <cfset LOCAL.Value.Append(
- LOCAL.ThisChar.ToString()
- ) />
-
-
- <!---
- Check to see if we are dealing with an
- empty field. We will know this if the value
- in the field is equal to our "escaped"
- double field qualifier (see above).
- --->
- <cfelseif (
- (LOCAL.Value.Length() EQ 1) AND
- (LOCAL.Value.ToString() EQ Chr( 1000 ))
- )>
-
- <!---
- We are dealing with an empty field so
- just append an empty string directly to
- this row data.
- --->
- <cfset ArrayAppend(
- LOCAL.Rows[ LOCAL.RowIndex ],
- ""
- ) />
-
-
- <!---
- Start new value buffer for the next
- row value.
- --->
- <cfset LOCAL.Value = CreateObject(
- "java",
- "java.lang.StringBuffer"
- ).Init()
- />
-
- <cfelse>
-
- <!---
- Since we are not in the middle of
- building a value, we have reached the
- end of the field. Add the current value
- to row array and start a new value.
-
- Be careful that when we add the new
- value, we replace out any "escaped"
- qualifiers with an actual qualifier
- character.
- --->
- <cfset ArrayAppend(
- LOCAL.Rows[ LOCAL.RowIndex ],
- LOCAL.Value.ToString().ReplaceAll(
- "#Chr( 1000 )#{1}",
- ARGUMENTS.Qualifier
- )
- ) />
-
-
- <!---
- Start new value buffer for the next
- row value.
- --->
- <cfset LOCAL.Value = CreateObject(
- "java",
- "java.lang.StringBuffer"
- ).Init()
- />
-
- </cfif>
-
-
- <!---
- Check to see if we are dealing with a field
- qualifier being used as a literal character.
- We just have to be careful that this is NOT
- an empty field (double qualifier).
- --->
- <cfelseif (LOCAL.ThisChar EQ ARGUMENTS.Qualifier)>
-
- <!---
- Toggle the field flag. This will signal that
- future characters are part of a single value
- despite and delimiters that might show up.
- --->
- <cfset LOCAL.IsInField = (NOT LOCAL.IsInField) />
-
-
- <!---
- We just have a non-special character. Add it
- to the current value buffer.
- --->
- <cfelse>
-
- <cfset LOCAL.Value.Append(
- LOCAL.ThisChar.ToString()
- ) />
-
- </cfif>
-
-
- <!---
- If we have no more characters left then we can't
- ignore the current value. We need to add this
- value to the row array.
- --->
- <cfif (LOCAL.CharIndex EQ LOCAL.RowData.Length())>
-
- <!---
- Check to see if the current value is equal
- to the empty field. If so, then we just
- want to add an empty string to the row.
- --->
- <cfif (
- (LOCAL.Value.Length() EQ 1) AND
- (LOCAL.Value.ToString() EQ Chr( 1000 ))
- )>
-
- <!---
- We are dealing with an empty field.
- Just add the empty string.
- --->
- <cfset ArrayAppend(
- LOCAL.Rows[ LOCAL.RowIndex ],
- ""
- ) />
-
- <cfelse>
-
- <!---
- Nothing special about the value. Just
- add it to the row data.
- --->
- <cfset ArrayAppend(
- LOCAL.Rows[ LOCAL.RowIndex ],
- LOCAL.Value.ToString().ReplaceAll(
- "#Chr( 1000 )#{1}",
- ARGUMENTS.Qualifier
- )
- ) />
-
- </cfif>
-
- </cfif>
-
- </cfloop>
-
- </cfloop>
-
- <!--- Return the row data. --->
- <cfreturn( LOCAL.Rows ) />
-
- </cffunction>
I have chosen to convert the CSV to an array of arrays as I was not sure that you could depend on the constant number of fields per row. Plus, I figure that going from an array to a query (after this step) would be rather easy. Plus, since Excel is not perfectly square cols vs. rows, I figure this was more in-line with where I want to go with it (including it in my ColdFusion POI Utility component).
If I create a variable containing this CSV data:
last name,first name,salary,dream salary,happiness
Jones,Mike,"$35,500.00","$73,000.00"
Hopkins,Paul,"$55,234.00","$250,000.00",3.0
Hawkings,Katie,,,
,
Smith,Betty,"$57,010.00","$60,000.00",10.0
... and pass it into the CSVToArray ColdFusion user defined function:
Launch code in new window » Download code as text file »
- <!--- Convert the CSV to an array of arrays. --->
- <cfset arrCSV = CSVToArray(
- CSVData = strCSVData,
- Delimiter = ",",
- Qualifier = """"
- ) />
-
- <!--- Dump out array. --->
- <cfdump var="#arrCSV#" label="CSV Data" />
I get this output:
| | | | ||
| | ![]() | | ||
| | | |
As you can see, the CSVToArray() ColdFusion function handles mixed length records, empty field values, and qualified fields. It even handles escaped qualifiers (ex. "" becomes ") but this was not demonstrated. While this is not perfect, at least it provides me with a CSV conversion interface that I can use in my POI Utility ColdFusion component. Further down the road, I will be able to swap this out later for a better implementation.
Download Code Snippet ZIP File
Post Comment | Ask Ben | Permalink | Other Searches | Print Page
Newer Post
Adding Basic CSS Support To My POI Utility ColdFusion Component For Excel Creation
Older Post
Parsing And Keeping A CSS Model Using ColdFusion
Reader Comments
Ben,
I haven't thought through this, so forgive me if it's a stupid question, but
did you consider using regular expressions? If so, what caused you to decide against using them?
@Ben,
When dealing with lists, use the GetToken() function. It won't ignore empty list elements. This will significantly speed up your function and replace the loop that you are doing. Also Sammy hit the nails on the head with using RegEx to strip out the text between the qualifiers.
Another trick you can use to speed things up is to use GetToken() to populate the empty the empty cells and then use ListToArray() for the conversion. It's alot quicker then creating a Java Object on each call.
Hopefully this helps you out some.
@Sammy,
I did think of regular expressions, 'cause they are cool, but I wasn't sure how to apply them. Plus I don't think my skills with them would be good enough to handle all the different options that come with CSV formatting. Take for example:
ben,was,here
That is three fields. But this:
"ben,was,here"
is one field. But this:
""ben,was,here""
is three fields; the first starts with a quote literal, and the last field ends with a quote literal. And then this:
""ben,"was,here"""
has two fields.... you get the point? It was just too much for me to wrap my head around. I am sure that regular expressions would rock somehow, I just can't figure it out.
Tony,
It's funny you mention that because my first attempt actually did use a Tokenizer. In my experience, though, it does skip empty fields:
<cfset Tokenizer = CreateObject(
"java",
"java.util.StringTokenizer"
).Init(
JavaCast( "string", "a,b,,,,c,d,e,f" ),
JavaCast( "string", "," )
) />
<cfloop condition="Tokenizer.HasMoreTokens()">
[#Tokenizer.NextToken()#]<br />
</cfloop>
... outputs:
[a]
[b]
[c]
[d]
[e]
[f]
... it skips right over the empty fields. However, in my current implementation I do add a leading space to all fields which then gets stripped out later.
I did learn some things in iteration three that I didn't know in iteration one, so I could probably go back and apply that to the String Tokenizer. In fact, maybe I will do that.
Comma seperated is a good idea with cold fusion becoz it is gonna remove some of difficult queries and the irregularities. while is is easy to retrieve the information at the client end.
It is being used in www.compglobe.com where you are entitled to compose your comment and the comment will be transfered to the CSV file at the server level.
www.compglobe.com is also using CSV format to upload the phone no.s if you want to send information to the handset of the recipent to whom you want to delivered the material. www.compglobe.com has various things like message composer and an online radio too.
Doing something similar, i just grabbed http://opencsv.sourceforge.net/ and then did this:
<cfparam name="filename">
<cfscript>
fileReader = createobject("java","java.io.FileReader");
fileReader.init(filename);
csvReader = createObject("java","au.com.bytecode.opencsv.CSVReader");
csvReader.init(fileReader);
</cfscript>
<cfdump var="#csvReader.readAll()#">
Java and ColdFusion play SO nice together *smile*
Thanks for the code. This was very helpful since I'm just learning CF. I now from other experiences that parsing CSV files can be a real pain to get it to work right.
Thanks for the code and tutorial Ben - I was grappling with exactly the same issue relating to coverting CSV with encapsulating quotes and your post was a lifesaver!!
Always a pleasure to help out!
@Stephen:
Have you gotten this work with opencsv's CSVWriter?
Tim
This is similar perhaps to what I need to achieve.(I think)
My client has a list of products. (Product ID, Product Name, description) are the colum headers for the product table.
well, the description field data... is a CSV.
for example
the data in the description field is:
OD(+/-1.2mm), Wall Thickness = 5.0mm (+/- .4mm), Inside Diameter = 65.0mm, Approximate pieces per case = 4, Approximate weight per case = 32.34 lbs
But i need to take the data in that one field, and create more colums to display these attributes rather than this text blob.
Am I on the right track?
@JKS,
You can use CSV parsing to get those values; however, if those are the only values in the field, you can simply treat the data as if it were a comma-delimited list. Then, you can either split the list into an array with ListToArray(), or even use things like ListGetAt() and ListLen() to loop over the elements of the list and examine each individually.
Ben,
AWESOME JOB!!! I can't believe this was so difficult to find. You definitely saved HOURS of time and helped meet my deadline. This works great. People like you are what make the net an awesome place for research and learning. Thanks!!
@Rob,
Glad to help. Check out a more updated post on this type of thing:
http://www.bennadel.com/index.cfm?dax=blog:976.view




