CSVToArray() ColdFusion UDF For Parsing CSV Data / Files

By Ben Nadel

Published 2007-10-12 in ColdFusion — Comments (74)

NOTE: This UDF had a bug in it when the first value in the CSV data was blank. The updated version has been posted here.

NOTE: This UDF was modified on 10/15/2007 to include a trimming argument.

Yesterday, Troy asked me to wrap my new CSV parsing algorithm up into a ColdFusion user defined function for ease of use. So, here it is:

<cffunction
	name="CSVToArray"
	access="public"
	returntype="array"
	output="false"
	hint="Takes a CSV file or CSV data value and converts it to an array of arrays based on the given field delimiter. Line delimiter is assumed to be new line / carriage return related.">

	<!--- Define arguments. --->
	<cfargument
		name="File"
		type="string"
		required="false"
		default=""
		hint="The optional file containing the CSV data."
		/>

	<cfargument
		name="CSV"
		type="string"
		required="false"
		default=""
		hint="The CSV text data (if the file was not used)."
		/>

	<cfargument
		name="Delimiter"
		type="string"
		required="false"
		default=","
		hint="The data field delimiter."
		/>

	<cfargument
		name="Trim"
		type="boolean"
		required="false"
		default="true"
		hint="Flags whether or not to trim the END of the file for line breaks and carriage returns."
		/>


	<!--- Define the local scope. --->
	<cfset var LOCAL = StructNew() />


	<!---
		Check to see if we are using a CSV File. If so,
		then all we want to do is move the file data into
		the CSV variable. That way, the rest of the algorithm
		can be uniform.
	--->
	<cfif Len( ARGUMENTS.File )>

		<!--- Read the file into Data. --->
		<cffile
			action="read"
			file="#ARGUMENTS.File#"
			variable="ARGUMENTS.CSV"
			/>

	</cfif>


	<!---
		ASSERT: At this point, no matter how the data was
		passed in, we now have it in the CSV variable.
	--->


	<!---
		Check to see if we need to trim the data. Be default,
		we are going to pull off any new line and carraige
		returns that are at the end of the file (we do NOT want
		to strip spaces or tabs).
	--->
	<cfif ARGUMENTS.Trim>

		<!--- Remove trailing returns. --->
		<cfset ARGUMENTS.CSV = REReplace(
			ARGUMENTS.CSV,
			"[\r\n]+$",
			"",
			"ALL"
			) />

	</cfif>


	<!--- Make sure the delimiter is just one character. --->
	<cfif (Len( ARGUMENTS.Delimiter ) NEQ 1)>

		<!--- Set the default delimiter value. --->
		<cfset ARGUMENTS.Delimiter = "," />

	</cfif>


	<!---
		Create a compiled Java regular expression pattern object
		for the experssion that will be needed to parse the
		CSV tokens including the field values as well as any
		delimiters along the way.
	--->
	<cfset LOCAL.Pattern = CreateObject(
		"java",
		"java.util.regex.Pattern"
		).Compile(
			JavaCast(
				"string",

				<!--- Delimiter. --->
				"\G(\#ARGUMENTS.Delimiter#|\r?\n|\r|^)" &

				<!--- Quoted field value. --->
				"(?:""([^""]*+(?>""""[^""]*+)*)""|" &

				<!--- Standard field value --->
				"([^""\#ARGUMENTS.Delimiter#\r\n]*+))"
				)
			)
		/>

	<!---
		Get the pattern matcher for our target text (the
		CSV data). This will allows us to iterate over all the
		tokens in the CSV data for individual evaluation.
	--->
	<cfset LOCAL.Matcher = LOCAL.Pattern.Matcher(
		JavaCast( "string", ARGUMENTS.CSV )
		) />


	<!---
		Create an array to hold the CSV data. We are going
		to create an array of arrays in which each nested
		array represents a row in the CSV data file.
	--->
	<cfset LOCAL.Data = ArrayNew( 1 ) />

	<!--- Start off with a new array for the new data. --->
	<cfset ArrayAppend( LOCAL.Data, ArrayNew( 1 ) ) />


	<!---
		Here's where the magic is taking place; we are going
		to use the Java pattern matcher to iterate over each
		of the CSV data fields using the regular expression
		we defined above.

		Each match will have at least the field value and
		possibly an optional trailing delimiter.
	--->
	<cfloop condition="LOCAL.Matcher.Find()">

		<!---
			Get the delimiter. We know that the delimiter will
			always be matched, but in the case that it matched
			the START expression, it will not have a length.
		--->
		<cfset LOCAL.Delimiter = LOCAL.Matcher.Group(
			JavaCast( "int", 1 )
			) />


		<!---
			Check for delimiter length and is not the field
			delimiter. This is the only time we ever need to
			perform an action (adding a new line array). We
			need to check the length because it might be the
			START STRING match which is empty.
		--->
		<cfif (
			Len( LOCAL.Delimiter ) AND
			(LOCAL.Delimiter NEQ ARGUMENTS.Delimiter)
			)>

			<!--- Start new row data array. --->
			<cfset ArrayAppend(
				LOCAL.Data,
				ArrayNew( 1 )
				) />

		</cfif>


		<!---
			Get the field token value in group 2 (which may
			not exist if the field value was not qualified.
		--->
		<cfset LOCAL.Value = LOCAL.Matcher.Group(
			JavaCast( "int", 2 )
			) />

		<!---
			Check to see if the value exists. If it doesn't
			exist, then we want the non-qualified field. If
			it does exist, then we want to replace any escaped
			embedded quotes.
		--->
		<cfif StructKeyExists( LOCAL, "Value" )>

			<!---
				Replace escpaed quotes with an unescaped double
				quote. No need to perform regex for this.
			--->
			<cfset LOCAL.Value = Replace(
				LOCAL.Value,
				"""""",
				"""",
				"all"
				) />

		<cfelse>

			<!---
				No qualified field value was found, so use group
				3 - the non-qualified alternative.
			--->
			<cfset LOCAL.Value = LOCAL.Matcher.Group(
				JavaCast( "int", 3 )
				) />

		</cfif>


		<!--- Add the field value to the row array. --->
		<cfset ArrayAppend(
			LOCAL.Data[ ArrayLen( LOCAL.Data ) ],
			LOCAL.Value
			) />

	</cfloop>


	<!---
		At this point, our array should contain the parsed
		contents of the CSV value. Return the array.
	--->
	<cfreturn LOCAL.Data />
</cffunction>

It can take either a CSV data value or a CSV file path. However, since it can take either value or file path, you must used Named Parameters, not ordered parameters for it to work correctly. Additionally, I fixed a small bug in the algorithm which forgot to check for the length of the delimiter (required for the start-of-string match).

Here, you can see a sample of the function being called with both a CSV data value and with a File path (output not shown):

<!--- Load CSV file. Use the default comma delimiter. --->
<cfset arrData = CSVToArray(
	File = ExpandPath( "./application.log" )
	) />


<!---
	Load CSV data. Instead of using default comma
	delimiter, pass in overriding TAB delimiter.
--->
<cfset arrData = CSVToArray(
	CSV = strCSVTab,
	Delimiter = " "
	) />

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/991

Reader Comments

Troy Oct 12, 2007 at 11:04 AM

9 Comments

In the previous UDF, I would pass in "Chr(9)" for a TAB. This is not working now, nor is passing in "\t". What is needed in this new regex UDF to represent TAB? It looks like the example you show for overriding TAB delimiter just has a space " ".

Also, I'm seeing an extra array element which is empty on each of my comma separated test files. Have you tested this on a real file and not just your manually generated data of Kim, Sarah, and Kit? Wondering if there is something quirky in my files, or this is a real bug?

So a file with 592 rows gives me a cfdump with array element 593 which only contains:
593 array
1 [empty string]

Ben Nadel Oct 12, 2007 at 11:18 AM

16,238 Comments

@Troy,

Try passing in #Char( 9 )#" (make sure you are using the hash signs). Also, I tested it on a Application.log file for fun and it seemed to work, although, I might not have caught an empty row at the bottom. Let me do some more testing.

Troy Oct 12, 2007 at 11:40 AM

9 Comments

Ok, I see there is a carraige return at the end of the final row of data in my CSV files and then a final line that is blank. After I removed the blank line, all was well. However, your old UDF seemed to account for this case and handled it gracefully, not putting the blank line into the array of arrays.

Also, I tried the TAB like this, but didn't work.
<cfset Variables.del = "##Char(9)##">

Ben Nadel Oct 12, 2007 at 11:52 AM

16,238 Comments

@Troy,

I will take a look at the old UDF to see what it was doing. Also, don't use double-hash:

Troy Oct 12, 2007 at 12:07 PM

9 Comments

I was using the double hash to escape them. I thought you wanted the single hashes passed into the UDF. Using your way, CF throws this error: Variable CHAR is undefined.

Ben Nadel Oct 12, 2007 at 12:08 PM

16,238 Comments

Ooops :) Chr(), not CHAR()... my bad.

Troy Oct 12, 2007 at 1:07 PM

9 Comments

That's what I noted I tried in the first comment I made "Chr(9)". Maybe you need to translate that inside the UDF to "\t" in the places needed in your regex lines of code?

Ben Nadel Oct 12, 2007 at 1:32 PM

16,238 Comments

@Troy,

I think there is some confusion on when to use the Hash signs. If we get rid of the quotes, you don't even need to use them. I just ran this and it works:

Notice that I am just passing in Chr( 9 ) as the delimiter value.

Michael Long Oct 12, 2007 at 5:05 PM

1 Comments

Does this correctly handle cases where there might be carriage returns within quoted values?

Ben Nadel Oct 12, 2007 at 5:14 PM

16,238 Comments

Absolutely. As long as the value is within a quoted field value, you are all good.

Matthew Abbott Oct 13, 2007 at 5:18 PM

46 Comments

I have easily taken Ben's function and parsed 500-1000 line csv that is tab delimited. I used the Chr(9) as the delimeter. I did bring it to his attention about having an extra line at the end.. but It was only because there was a carriage return in my txt file. It wasnt to ask him to change his function.

Also I just accounted for the blank lines and didnt show them, since I get the files in the same format everytime. I account for additional formats as well, so it didnt matter if it really had extra lines, as my app ignores them anyway.

Anyway, good job Ben. Keep up the good work. Ill probably integrate this into my next version of my csv parsing application, as it seems to be a little faster. Got to test it some to make sure.

Have a good day Ben and everyone.

-Matthew

Troy Oct 13, 2007 at 8:37 PM

9 Comments

I found a dumb mistake in my code and now my TAB parsing works with Chr(9). I would like the final blank line to be ignored if it's a simple tweak Ben can make. But not a showstopper, and I'm grateful to Ben for all his help, good stuff!

Ben Nadel Oct 14, 2007 at 1:21 PM

16,238 Comments

Do you guys think it would make sense to just strip off the final new lines and carriage returns? Basically, like doing an RTrim() that would stop if it hit a TAB character or space or something? At first, I was worried that this might erase needed data, but I don't think a number of empty lines at the end is necessarily lost... unless the number or rows is meaningful.

I think I will add this and make an optional boolean argument that can turn it OFF.

Ben Nadel Oct 15, 2007 at 7:54 AM

16,238 Comments

I have updated the UDF in this post to include some quasi RTrim() functionality. It will not strip spaces or tabs as these might be valid data characters. It will, however, by default, strip off trailing new line and carriage returns characters from the file. This can be turned off with a fourth, optional argument.

Steve Jan 21, 2008 at 5:52 PM

17 Comments

Since Excel does not use the backslash as an escape character, but instead escapes quote literals with two quotes,
which does your regex use?

I presume it is "" based on some of your writings, but I am no regex expert.

Ben Nadel Jan 23, 2008 at 9:51 AM

16,238 Comments

@Steve,

Yeah, I am using "" (double-double quotes) in order to escape quotes within a qualified field. And remember, all "special characters" MUST be in qualified fields.

Chris May 30, 2008 at 9:37 AM

9 Comments

Ben,

What happens if you have something like this?

Field1,Field2,Field3
Data1,Data2,Data3
Data1
Data1,,,

Where the commas aren't always there? I know exporting from Excel to CSV sometimes the commas get ignored it the cell is empty.

Have you ever come across this?

Ben Nadel May 30, 2008 at 9:40 AM

16,238 Comments

@Chris,

That should be fine. If you look at the example on which this UDF is based:

www.bennadel.com/index.cfm?dax=blog:978.view

... you will see that it handles empty spaces quite nicely.

Steve May 30, 2008 at 4:33 PM

17 Comments

@Chris
The array will contain the number of fields that exist in that row,
Some rows may be 3, some 1 , some 4 (1 and 3empty)
What I do is when I am iterating over the array is I check that the array row has the number of fields I am expecting, then I do any data validation I expect on each field (numeric, date, etc)

Also I check for a row of all empty fields (and exit my import).
Excel is also notorious for
field1, field2, field3
d1,d2,d3
d1,d2,d3
,,,
,,,
,,,
(and so on for a few hundred or thousand rows)
,,,
<eof>

Chris Jun 2, 2008 at 11:08 AM

9 Comments

Thanks Ben.

@Steve, I want to take my CSV file and insert it into a database table. However, since some commas are missing, the "Column count doesn't match value count at row". Any ideas how I can fix that? It is due to the fact when exporting to a csv file from Excel it leaves off commas. I add the commas, it works fine. I want to validate the csv before input.

Thanks,
Chris

Steve Jun 2, 2008 at 11:18 AM

17 Comments

@Chris
You need to perform all your validation before attempting database insert.
Say I want 5 columns...
<cfif ArrayLen(arrData[i]) LT 5>

<cfelseif NOT isnumeric(arrData[i][1])>

<cfelse>
Do dtabase work.
</cfif>

Chris Jun 2, 2008 at 11:23 AM

9 Comments

Thanks Steve.

If I want 5 columns, but one of my rows only has 4, is there away to add another column so tthat the insert doesn't have to fail?

Thanks,
Chris

Rob Oct 1, 2008 at 10:25 AM

7 Comments

Ben,

I got this UDF working and it is very helpful. I just have a quick question maybe you can answer. I am using this as an include in a page which is calling it 2 times as a cfinclude.

It works when i call it one time but when i call it the second time i get this:
-----------------------------------------------------
Routines cannot be declared more than once.
The routine "CSVToArray" has been declared twice in different templates.

ColdFusion cannot determine the line of the template that caused this error. This is often caused by an error in the exception handling subsystem.
-----------------------------------------------------

Is there any way around this? Thanks in advance

Ben Nadel Oct 1, 2008 at 10:57 AM

16,238 Comments

@Rob,

You can't include a UDF more than once per page request. To get around this, I would include it once in your Application.cfm or something. You need to find a way to only include it once.

tom Jan 12, 2009 at 11:15 AM

1 Comments

amazing use of regular expressions.. this has made a simple load of ~2,500 lines that much easier!

Ben Nadel Jan 12, 2009 at 11:18 AM

16,238 Comments

@Tom,

Glad to help. Regular expressions are quite amazing!

Christiann Jan 12, 2009 at 10:56 PM

1 Comments

I tried this function out today, and it rocks! Thank you!

Fred Feb 25, 2009 at 6:29 AM

1 Comments

Can anyone tell me if this code will work with tabbed CSVs where there are multiple tabs to represent empty fields? The problem I chave is that i am given a CSV that has some column values missing, and those missing values are just represented as multiple tabs.

Ben Nadel Feb 25, 2009 at 6:43 PM

16,238 Comments

@Fred,

Yeah, it should handle empty fiends just fine.

Dave Apr 16, 2009 at 5:01 PM

1 Comments

I tried the function and it worked until I ran into the following error:

"The element at position 20 of dimension 1, of array variable "COLS," cannot be found."

Any idea what may be causing this. It was able to insery 98% of all the records in my [tab delimited] text file.

Ben Nadel May 3, 2009 at 3:13 PM

16,238 Comments

@Dave,

COLS is not a variable in the UDF above (that I can see). Is COLS something in your calling code?

Tom May 6, 2009 at 1:29 PM

15 Comments

@Ben,

I love it when I see your site show up in Google Search results as I know I'm heading in the right direction. Dave (Apr 16, 2009 at 5:01 PM) issue is what brought me here, I believe, which utlimately displayed your CSVToArray function. To Dave, I suspect you are looping over your Array Elements and inserting that value into a table b/c that is what I'm attempting to do and I'm getting the a similar error. I believe what is generating the error in my array is the element located at CSVArray[x][8] (an empty value - supposed to contain a date). So, my question is, being inside of an element, what do you recommend checking for a valid date? I'm currently using <cfif IsDate(CSVArray[x][8])> ... </cfif> inside of my sql insert statement. Is there something better because this doesn't seem to be working... or there is another problem going on.

Ben Nadel May 7, 2009 at 8:29 AM

16,238 Comments

@DLackey,

Have you tried CFDump'ing out the first X items in array to give it a visual check? Sometimes the data is not as clean as you might think.

Tom May 7, 2009 at 9:02 AM

15 Comments

@Ben Nadel,

Yes. Based on my understanding of the way arrays work, if the array is empty it doesn't really exists (e.g.

/start
Array[1][1] = "Adam"
Array[1][2] = "Microsoft Server 2003"
Array[1][3] = [empty string]
Array[1][4] = "Module 1"
...
Array[1][8] = [empty string]
Array[1][9] = [empty string]
/end

What I'm needing to do is insert the values of the array into my database and if the array doesn't exists, the insert fails. I found another posting discussing this and I found a UDF (http://www.cflib.org/udf/isDefinedValueMX) that I thought would do the trick but either I'm implementing it incorrectly or I heading down the wrong path.

The value for Array[1][3] doesnt' seem to cause any problems, it is only for the dates for [1][8] and [1][9].

My query is as follows:

INSERT INTO table
( UID,
COURSE_NAME,
COURSE_DURATION,
MODULE,
SCORE,
MINUTES_SPENT,
LESSON_STATUS,
<cfif isDefinedValueMX("arrData[x][8]")>DATE_STARTED,</cfif>
<cfif isDefinedValueMX("arrDate[x][9]")>DATE_COMPLETED,</cfif>
UPDATED,
UPDATED_BY )
VALUES
(
<cfqueryparam value="#arrData[x][1]#" cfsqltype="CF_SQL_VARCHAR">,
<cfqueryparam value="#arrData[x][2]#" cfsqltype="CF_SQL_VARCHAR">,
<cfqueryparam value="#arrData[x][3]#" cfsqltype="CF_SQL_VARCHAR">,
<cfqueryparam value="#arrData[x][4]#" cfsqltype="CF_SQL_VARCHAR">,
<cfqueryparam value="#arrData[x][5]#" cfsqltype="CF_SQL_VARCHAR">,
<cfqueryparam value="#arrData[x][6]#" cfsqltype="CF_SQL_VARCHAR">,
<cfqueryparam value="#arrData[x][7]#" cfsqltype="CF_SQL_VARCHAR">,
<cfif isDefinedValueMX("arrData[x][8]")>to_date(<cfqueryparam value="#dateformat(arrData[x][8],'mm/dd/yyyy')# #timeformat(arrData[x][8],'HH:MM:SS')#" cfsqltype="CF_SQL_VARCHAR">,'mm/dd/yyyy hh24:mi:ss'),</cfif>
<cfif isDefinedValueMX("arrData[x][9]")>to_date(<cfqueryparam value="#dateformat(arrData[x][9],'mm/dd/yyyy')# #timeformat(arrData[x][9],'HH:MM:SS')#" cfsqltype="CF_SQL_VARCHAR">,'mm/dd/yyyy hh24:mi:ss'),</cfif>
to_date(<cfqueryparam value="#dateformat(now(),'mm/dd/yyyy')# #timeformat(now(),'HH:MM:SS')#" cfsqltype="CF_SQL_VARCHAR">,'mm/dd/yyyy hh24:mi:ss'),
<cfqueryparam cfsqltype="cf_sql_varchar" value="#lcase(cookie.admin_id)#"> )

Any advice on how to handle null values in the array?

Ben Nadel May 21, 2009 at 8:55 AM

16,238 Comments

@DLackey,

If I have an array with unknown index lengths, I will just CFParam the values with something I can easily check:

Then, in the insert, I can do something like this:

The null="" attribute allows me to NULL'ify the date/time column for that row if the passed-in date is not actually a date.

BYJ Aug 5, 2009 at 1:59 PM

4 Comments

Ben,
I experiences 2 things when applying this code to my tab delimited text file.
1.
I used your code to parse less than 6000 records and It took me forever to just see the cfdump. Finally I have to stop and killed my browser.
When I deleted those records and left only 4 records (lines) to test again, The code works!
So I'm not sure what do you mean when you wrote that using regex will speed up the process.

2.
I call the function this way:
Either of this one is working with small number of records:
(The diff is only on #chr(9)# or chr(9))

BUT this one showed error:
<cfset arrData = CSVToArray(File="#FeedPath##FileName#", CSV = strCSVTab,Delimiter = Chr( 9 )) />

The error says:
Variable STRCSVTAB is undefined.

Tom Aug 5, 2009 at 2:08 PM

15 Comments

@Ben,

Sorry Ben, I didn't see your reply until BYJ just replied. I ended up using the cfparam approach and it worked (removing the isDefinedValueMX udf).

BYJ Aug 5, 2009 at 2:56 PM

4 Comments

Ben,
To add on my previous comment. I have just finished parsing 375 records(lines) in tab delimited text file.

Total Time Avg Time Count
--------- -------- -----
130394 ms 130394 ms 1
131722 ms TOTAL EXECUTION TIME

This is just to create the array, I think it took too much time.
May be I did something wrong? all I did was copy and pasted your function and then write the following to call the function:

Ben Nadel Aug 5, 2009 at 7:03 PM

16,238 Comments

@BYJ,

You can't pass in both a File and a CSV value. It's one or the other. As far as the test you ran, that is a LONG time to parse just a few hundred records. Are you doing anything else other than the parsing?

BYJ Aug 6, 2009 at 9:58 AM

4 Comments

Ben,
As far far as my codes, other than this function, I only have CFSET to get the file location as: #FeedPath##FileName#
Then I do:
<cfset arrData = CSVToArray(File="#FeedPath##FileName#",Delimiter = "#chr(9)#") />
To call the function, and I do cfdump:
<cfdump var="#arrData#">

I need to pass the File and delimiter to the function, it is working but not with the speed I'm expecting.

I'm not too sure what is this argument for and when will I need to pass this argument to the function?
I thought I just have to pass the file location and delimiter.

BYJ Aug 6, 2009 at 10:17 AM

4 Comments

Hi Ben,
I found another bug on the function.
I used a different tab delimited file, one of the row is an empty space row:
AAAAAAA-tab-BBBB-tab-1234-tab-CCCCCCCC
--empty new row---
XXXXXXX-tab-YYYY-tab-5456-tab-ZZZZZZZZ
I got this result:
array 1 has an array elements AAAAAA, BBBB, etc. but
array 2 is an empty array.

I thought this code will catch this kind of problem but it is not. The result that I got was like this:

I realized that this file is a bad file, I could have emailed the client to re-send the file. But the function has removing training returns ability, which I personaly think, is a very good idea to have except it is not working.
Have you ran into this kind of file while testing?

Jonathan Slattery Aug 10, 2009 at 6:01 AM

1 Comments

@BYJ: The "TRIM/Remove trailing returns" check is only applying that check to the VERY END of the CSV file contents because the $ in the Regular Expression means 'match only at the end of the string'.
In this case, the end of the string is the end of the file content. (the file is one big long string).
It was not intended to remove blank lines in the middle.

However, you could probably add an additional boolean parameter to the UDF to called stripBlankLines or something analagous, and run it just before the trim check to strip out extra lines in the middle...

# <cfif ARGUMENTS.stripBlankLines>
#
#

# <cfset ARGUMENTS.CSV = REReplace(
# ARGUMENTS.CSV,
# "(\r?\n)(?:\r?\n)+",
# "$1",
# "ALL"
# ) />
#
# </cfif>

be careful with that though, as it can mess up quote-qualified string fields that contain carriage returns/line feeds, etc., but something like that should work as long as you're not worried about such outlier cases.

@Ben or any Regex Experts, please double-check my work on that... It's late and I don't have time to verify with any testing right now...

@Ben, one question on my mind: You didn't explicitly discredit the CSV-reader Query Datasource implementation method, but i assume there was a reason for creating this UDF... How well does this beat the query version? That's usually done the trick for me faster than attempting to parse the CSV file, unless I needed to do some initial massaging of the file, etc.
Although, i haven't ever dropped down to Java level code like this in CF, as I don't have any Java experience. Or is this mainly aimed for non-windows platorms?

Brian Lang Aug 31, 2009 at 6:18 PM

41 Comments

Going to try this out. Initial look seems very promising. I've got some very dirty data however... will have to proceed cautiously.

Ben Nadel Sep 2, 2009 at 8:47 AM

16,238 Comments

@BYJ,

As Jonathan said, the trimming is only for the end of the file (so people don't get random records at the end). Empty lines, mid-file are acceptable and just treated as empty records.

@Jonathan,

I have nothing against the datasource CSV file read - I just have never tried it before, that's all. I would love to see if there was a way to set up a data source for a single file, then be able to save each CSV to the same location such that you don't have to set up a datasource more than once.

@Brian,

Good luck! As long as the data fits CSV formatting standards, it should be ok.

Dave Phillips Jan 23, 2010 at 1:14 PM

1 Comments

Ben - thanks yet again for providing a wonderful solution for an absolute need. You just saved me hours more of coding. I just wish I had come across this post 2 days ago. This worked great with an direct CSV export of Outlook Contacts without any special handling at all. Thanks again!

Ben Nadel Jan 24, 2010 at 10:16 PM

16,238 Comments

@Dave,

Sweet! Glad to be able to contribute.

Jeff Lucido Mar 30, 2010 at 5:02 PM

1 Comments

Quite often I find nice little nuggets of information on your website that help me in tremendous ways. While I have been using OpenBlueDragon pretty my exclusively for a while, I need a rock solid CSV manipulation function for a CF8 server. Once again you had an excellent solution.

I really needed a query resultset output for easy display purposes, etc. I took the liberty to "add" to your UDF to allow for query output and what not (not to mention handle the first row as headers). I also did a brute force rough determination of each line of data being the same, in terms of elements, as the master row of data (the first row).

Just in case others might be looking for a query output from your wonderful CSV parser, I am adding my solution to the mix. Thanks again for such wonderful, well thought out articles. You have saved me hours of head banging over the years.

<cffunction name="readCSV" access="public" returntype="any" output="false" hint="Takes a CSV file or CSV data value and converts it to an array of arrays based on the given field delimiter. Line delimiter is assumed to be new line / carriage return related.">
<cfargument name="csvFile" type="string" required="false" default="" hint="The optional file containing the CSV data." />
<cfargument name="csvData" type="string" required="false" default="" hint="The CSV text data (if the file was not used)." />
<cfargument name="delimiter" type="string" required="false" default="," hint="The data field delimiter." />
<cfargument name="trimData" type="boolean" required="false" default="true" hint="Flags whether or not to trim the END of the file for line breaks and carriage returns." />
<cfargument name="isFirstRowHeader" type="boolean" required="false" default="true" hint="Flag to indicated whether the first row of the CSV data contains column header descriptors (if true, then the first row is deleted from the resultant)" />
<cfargument name="dataAsQuery" type="boolean" required="false" default="false" hint="Flag to indicated the return value of the data, default is an array (false). If return format is query and isFirstRowHeader true then the query column names are set as query column headings." />
<cfargument name="flaqSuspectQueryRow" type="boolean" required="false" default="false" hint="If a row of data has fewer records than the first row(master) and you want a query resultset, then an additional column 'isRowSuspect' is added to the column with a bit value representing the row of data.">

<cfscript>
var i = 0;
var r = 0;
var stProcess = {qryResult = ''};
var arrColumnNames = [];

// the CSV source is a file on disk so read the file to serve as our CSV source
if (arguments.csvFile != '') {
arguments.csvData = fileRead(arguments.csvFile);
}
// remove trailing returns from the end of the CSV data
if (arguments.trimData) {
arguments.csvData = REReplace(arguments.csvData, "[\r\n]+$", "", "ALL");
}
// make sure the delimiter is just one character
if (len(arguments.delimiter) != 1) {
arguments.delimiter = ','; // the delimiter must be a single character, therefore set the value to a comma
}

// PROCESSING: get to the point already and do some processing of the CSV data
// create a compiled Java regular expression pattern object for the experssion that will be needed to parse the CSV tokens including the field values as well as any delimiters along the way
stProcess.pattern = createObject('java', 'java.util.regex.Pattern').compile(javaCast('string','\G(\#arguments.delimiter#|\r?\n|\r|^)' & '(?:"([^"]*+(?>""[^"]*+)*)"|' & '([^"\#arguments.delimiter#\r\n]*+))'));
// get the pattern matcher for our target text (the CSV data). This will allows us to iterate over all the tokens in the CSV data for individual evaluation
stProcess.matcher = stProcess.pattern.matcher(javaCast('string', arguments.csvData));
// create an array to hold the CSV data (we are going to create an array of arrays in which each nested array represents a row in the CSV data file)
stProcess.arrResult = [];
arrayAppend(stProcess.arrResult, arrayNew(1));

while (stProcess.matcher.find()) {
stProcess.Delimiter = stProcess.matcher.group(javaCast('int', 1)); // the delimiter will always be matched, but in the case that it matched the START expression, it will not have a length
// check for delimiter length and is not the field delimiter. This is the only time we ever need to perform an action (adding a new line array). We need to check the length because it might be the START STRING match which is empty
if ((len(stProcess.delimiter) == 1) && (compareNoCase(stProcess.delimiter, arguments.delimiter) != 0)) {
arrayAppend(stProcess.arrResult, arrayNew(1)); // start new row data array
}
// get the field token value in group 2 (which may not exist if the field value was not qualified
stProcess.value = stProcess.matcher.group(javaCast('int', 2));
// check to see if the value exists. If it doesn't exist, then we want the non-qualified field. If it does exist, then we want to replace any escaped embedded quotes.
if (structKeyExists(stProcess, 'value')) {
stProcess.value = Replace(stProcess.Value, '""', '"', 'ALL'); // replace escaped quotes with an unescaped doublequote
} else {
stProcess.value = stProcess.matcher.group(javaCast('int', 3)); // no qualified field value was found, so use group 3 - the non-qualified alternative
}
// add the field value to the row array
arrayAppend(stProcess.arrResult[arrayLen(stProcess.arrResult)], stProcess.value);
}

// we now have an array of arrays which represents the CSV data we have processed, now do a little housekeeping
if (arguments.isFirstRowHeader) {
arrColumnNames = stProcess.arrResult[1]; // save the column names from the first row of the CSV (array index 1)
arrayDeleteAt(stProcess.arrResult, 1); // remove the column names from the result array
}

// if the desired result is a query, build that from the result array or arrays
if (arguments.dataAsQuery) {
// first, decided if the query column headers are either taken from the CSV header row or are strictly numerical (assumption: all arrays representing the CSV row data are the same length as the first record (array node 1), if not, blank strings will be added in their place)
if ((arguments.isFirstRowHeader) && !arrayIsEmpty(arrColumnNames)) {
// we should use the first row information as query column names, however we must 'clean' the column names since we cannot have spaces or weird characters in query column names
for (i = 1; i <= arrayLen(arrColumnNames); i++) {
arrColumnNames[i] = REReplace(arrColumnNames[i], '[^a-zA-Z0-9_]', '', 'ALL');
}
} else {
// use generic header column names for the query
for (i = 1; i <= arrayLen(stProcess.arrResult[1]); i++) {
arrayAppend(arrColumnNames, 'COLUMN' & i);
}
}
if (arguments.flaqSuspectQueryRow) {
// in the event a 'row' of CSV data does not contain the same number of elements as the first row (again the assumption is the first row represents ALL data rows of the CSV), keep track of which row may not match on element length
arrayAppend(arrColumnNames, 'isRowSuspect');
}
// initialize our query result with the column names
stProcess.qryResult = queryNew(arrayToList(arrColumnNames));

// build the query resultset from the result array of arrays
for (i = 1; i <= arrayLen(stProcess.arrResult); i++) {
queryAddRow(stProcess.qryResult);
// loop over the 'row' data and add it to the result query
for (r = 1; r <= arrayLen(arrColumnNames); r++) {
if (compareNoCase(arrColumnNames[r], 'isRowSuspect') != 0) {
// we are dealing with true column data, make sure the current processing 'row' (array) contains enough elements as compared to the master row
if (r <= arrayLen(stProcess.arrResult[i])) {
querySetCell(stProcess.qryResult, arrColumnNames[r], stProcess.arrResult[i][r]);
} else {
querySetCell(stProcess.qryResult, arrColumnNames[r], ''); // there is no data in this array element, set a default value of an empty string (catch up with the times CF and support 'NULL' values!!)
}
} else {
// determine if this 'row' of data had the same number of elements as the 'master' row
if ((r-1) == arrayLen(stProcess.arrResult[i])) {
querySetCell(stProcess.qryResult, 'isRowSuspect', '0'); // this 'row' contains the same number of elements, therefore nothing appears to be wrong with the data
} else {
querySetCell(stProcess.qryResult, 'isRowSuspect', '1'); // this 'row' contains a different number of elements, therefore flag it as suspect
}
}
}
// was this row of information at least valid in terms of number of elements processed?
if (r == arrayLen(stProcess.arrResult[i])) {
querySetCell(stProcess.qryResult, '', ''); // there is no data in this array element, set a default value of nothing
}
}
// the CSV data is now represented by a query data set, ship it
return stProcess.qryResult;
} else {
// the desired return data format is an array, ship it
return stProcess.arrResult;
}
</cfscript>
</cffunction>

Sna4x8 Apr 20, 2010 at 12:46 PM

2 Comments

Thanks man, works well.

Chris Apr 21, 2010 at 10:20 AM

4 Comments

What would be the easiest way to convert this into a query?

Thanks,
Chris

Chris Apr 21, 2010 at 10:29 AM

4 Comments

Nevermind, didn't see the one comment. I will give that a try. :)

Chris Apr 21, 2010 at 10:53 AM

4 Comments

@Jeff,

Your query solution doesn't seem to work for me. I tried it with my CSV which gives me the error: column [] does not exist

And I also tried it with this:
<cfsavecontent variable="strCSV">
"Name","Nick Name","Age","Hair Color"
Kim,"Kim ""Hot Legs"" Smith",24,"Brunette"
"Sarah Vivenz, II","Stubs",27,"Brunette"
"Kit Williams",Kitty,34,Blonde,,,
"Even
Values With
Embedded Line Breaks"
</cfsavecontent>

But when I dump that, all are inserted as query columns.

Chris Apr 21, 2010 at 11:03 AM

4 Comments

Nevermind, I needed to add flaqSuspectQueryRow :)

Jyothsna C May 14, 2010 at 2:43 AM

1 Comments

Hi,

Please help me in reading a check box in csv file. I tried using above function it is giving me a empty string. Please suggest.

Thanks,
Jo

SuperAlly Jul 19, 2010 at 8:23 PM

8 Comments

Ben

Thanks kindly for another great post. This was very helpful.

Cheers.

Sandi from Chicago Aug 6, 2010 at 1:19 PM

2 Comments

Hey Ben! We just met at CFUnited. I'm trying to use this code to read a CSV files and it is working. I can dump out the array and see it. However, and I'm sure this is a really stupid problem, I don't seem to be able to interate the array. For a test, I'm just trying to break the outer array into the inner arrays and dump them (keep it simple first right...):

And I'm getting an error:

• Type: class java.lang.String
• Message: Object of type class java.lang.String cannot be used as an array

Any help you can offer?

Thanks!

Sandi

Sandi from Chicago Aug 6, 2010 at 2:22 PM

2 Comments

I guess you can't use the "array" option in CFLoop for this kind of array. I'm using an old fasion index now and that seems to work fine.

Ben Nadel Aug 8, 2010 at 6:13 PM

16,238 Comments

@Sandi,

It was nice to meet you my man; sorry we didn't get to talk more. As for your issue in the code, you simply need to add hash tags around the array variable name:

... becomes:

... that should do the trick.

David Johnson Aug 16, 2010 at 10:30 AM

1 Comments

The Microsoft ODBC Text Driver was dropping leading zeroes and even removing alpha characters from alphanumeric fields. I switched my code over from being ODBC datasource driven to using your array solution in about 35 minutes and was able to keep all my previous error-checking in tact.

Thank you for providing such a clean and concise solution!

Ben Nadel Aug 16, 2010 at 9:26 PM

16,238 Comments

@David,

Hey man, that's awesome! Glad that it worked out so nicely! I hope that it is adequately performant. I know that it can't possible be as fast as a core drive; but, hopefully you're not taking too much of a hit.

Paul Cormier Sep 7, 2010 at 3:09 PM

5 Comments

I've been using a slightly modified version of this routine successfully for a couple of years, and on rare occasions I would get an error. I finally tracked down the cause of the errors: When the first field of the first row of the delimited file you are importing is empty, the import fails. Subsequent rows can have an empty first field and it works fine, just the first row can't.

I'll be working on a solution and will post my findings, but I'm not sure when I'll be able to get to it. So if someone wants to find the solution and post here, it would be greatly appreciated.

Paul

Paul Cormier Sep 7, 2010 at 4:23 PM

5 Comments

Ha! Of course it ended up being my own code. I added some extra code to handle the first row being column headers or not, and I had an error in there. There should be nothing wrong with the code as posted originally.

Paul

P.S. Using java.io.FileReader and java.io.LineNumberReader significantly speeds up performance when reading large delimited files rather than trying to load the entire file contents into a variable for processing.

Ben Nadel Sep 9, 2010 at 2:11 PM

16,238 Comments

@Paul,

I am not sure your code is entirely to blame. I believe someone else has alluded to an empty-first-cell problem with the CSV parser. The regular expression is a bit complicated (I didn't fully write it).

Paul Cormier Sep 9, 2010 at 9:43 PM

5 Comments

@Ben,

I'm glad I wasn't the only one having difficulty following the logic of that regular expression! The \G is an odd bird that behaves differently in different Regex parsers.

To clear things up a bit, there is a problem with your routine when the first field of the first row is blank. I added code to detect that special case and reinsert an empty value. My problem was due to my customized firstrowlabels=[true|false] I actually needed to add the special case in two places but only had it in one.

If you email me your email address, I can send you my modified version of your routine which:

a) Is actually CSV2Query()
b) Can optionally use first row labels as column names or you pass in column names.
c) Uses java.io.FileReader and java.io.LineNumberReader to allow efficient parsing of some pretty large files.

Unfortunately, unlike your beautifully commented code, mine is dense and entirely CFScript.

Paul

Ben Nadel Sep 10, 2010 at 5:44 PM

16,238 Comments

@Paul,

Ha ha, to each their own (tags vs. cfscript). I always get iffy about allowing people to use first rows as header values only because the clients I've had tend to NOT stick to naming convensions which means that they are bound to enter a value that is not a valid query column name at some point. Of course, this seems to be one of the features most people request... perhaps I am too paranoid.

This is the only time I've ever seen the \G being used; it has to do with matching the end of the previous match (or the beginning of the string). I didn't know about this until Steve Levithan helped me with this pattern. I should actually do a bit more exploration of it - I know a good bit about Regular Expressions, but there still a LOT that I don't know.

Stacey Abshire Oct 22, 2010 at 11:03 AM

7 Comments

@Paul @Ben I am having the same issue with the first row have a first cell that is empty. In that case, it is skipped, and I am pretty certain it isn't the regex, but rather has something to do with the assumption to use the 3rd group if the quoted group isn't matched. So in my case, I am just pre-pending a delimiter to the beginning if the first character is a delimiter, and I'm good. It's a hack, but works. LOL! If you find a fix for the parser, just let me know. Thanks!

Ben Nadel Oct 22, 2010 at 11:05 AM

16,238 Comments

@Stacey,

Yeah, people keep getting this error :( I am gonna see if I can fix this after work today. It's about time!

Stacey Abshire Oct 22, 2010 at 11:07 AM

7 Comments

@Ben, I guess I am the straw that broke the camel's back. LOL!

Ben Nadel Oct 22, 2010 at 11:11 AM

16,238 Comments

@Stacey,

Ha ha ha, well hopefully I'll be able to fix it :)

Stacey Abshire Oct 22, 2010 at 11:15 AM

7 Comments

@Ben,

My hack fix was this change in the code:

		<!---
				ASSERT: At this point, no matter how the data was
				passed in, we now have it in the CSV variable.
		--->
 
 
		<!---
				Stacey Abshire:
				Added a hack to handle a bug in the parse that
				results in the first column of the first row
				being skipped if that column is empty
		--->
		<cfif Left(ARGUMENTS.CSV, 1) eq ARGUMENTS.Delimiter>
				<cfset ARGUMENTS.CSV = ARGUMENTS.Delimiter & ARGUMENTS.CSV>
		</cfif>

Definitely not ideal, but works for now.

Ben Nadel Oct 22, 2010 at 5:44 PM

16,238 Comments

@All,

I finally ironed out the bug with the empty-first-field problem:

www.bennadel.com/blog/2041-UPDATE-Parsing-CSV-Data-Files-In-ColdFusion-With-csvToArray-.htm

Hope that helps. I'll put a note in the top of this blog post about it.

Timothy Farrar Jan 11, 2011 at 1:31 PM

3 Comments

Ben, used this today in a project, and it worked great! :) I was looking for someone who had worked out the REGEX etc. here, and was very happy to find you had! :) thanks man!

Mike Sep 14, 2011 at 1:53 PM

1 Comments

We are using the CSVToArray function, however our default separator is sometimes a ";" instead of a "," ( french region setting ). I tried to change the default separator in your function to a ; put it didn't seem to care. Am I missing something ?

My second option and this would make me even happier I think, is to quickly parse the incoming CSV file and replace all ";" with "," before feeding it to the CSVToArray function. If someone has an elegant way of doing that I would be grateful. I inherited this project and am quite new to ColdFusion so I would appreciate any help.

cheers
Mike

Brook Oct 11, 2011 at 11:25 AM

8 Comments

I made a couple of modifications to this to read and parse a range of rows so that I could process large files in chunks. This only applies when a file argument is supplied. I added startrows and maxrows arguments:

And then updated this code:

to:

Ivan Arambula Jul 30, 2013 at 7:00 PM

1 Comments

Has anyone tried this??

<cfloop list="#CSVstring#" index="item" delimiters=",">
 
<cfoutput>#item#</cfoutput>
 
</cfloop>

This will separate each value for you and you can append them to an array or you can do whatever you want to it.

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.