NOTE: This UDF was modified on 10/15/2007 to include a trimming argument.
Yesterday, Troy asked me to wrap my new CSV parsing algorithm up into a ColdFusion user defined function for ease of use. So, here it is:
Launch code in new window » Download code as text file »
It can take either a CSV data value or a CSV file path. However, since it can take either value or file path, you must used Named Parameters, not ordered parameters for it to work correctly. Additionally, I fixed a small bug in the algorithm which forgot to check for the length of the delimiter (required for the start-of-string match).
Here, you can see a sample of the function being called with both a CSV data value and with a File path (output not shown):
Launch code in new window » Download code as text file »
Download Code Snippet ZIP File
Comments (24) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
Why Do I Protect People Who Want To Break My Site?
Paginating Record Sets In ColdFusion With One SQL Server Call
In the previous UDF, I would pass in "Chr(9)" for a TAB. This is not working now, nor is passing in "\t". What is needed in this new regex UDF to represent TAB? It looks like the example you show for overriding TAB delimiter just has a space " ".
Also, I'm seeing an extra array element which is empty on each of my comma separated test files. Have you tested this on a real file and not just your manually generated data of Kim, Sarah, and Kit? Wondering if there is something quirky in my files, or this is a real bug?
So a file with 592 rows gives me a cfdump with array element 593 which only contains:
593 array
1 [empty string]
Posted by Troy on Oct 12, 2007 at 11:04 AM
@Troy,
Try passing in #Char( 9 )#" (make sure you are using the hash signs). Also, I tested it on a Application.log file for fun and it seemed to work, although, I might not have caught an empty row at the bottom. Let me do some more testing.
Posted by Ben Nadel on Oct 12, 2007 at 11:18 AM
Ok, I see there is a carraige return at the end of the final row of data in my CSV files and then a final line that is blank. After I removed the blank line, all was well. However, your old UDF seemed to account for this case and handled it gracefully, not putting the blank line into the array of arrays.
Also, I tried the TAB like this, but didn't work.
<cfset Variables.del = "##Char(9)##">
Posted by Troy on Oct 12, 2007 at 11:40 AM
@Troy,
I will take a look at the old UDF to see what it was doing. Also, don't use double-hash:
<cfset Variables.del = "#Char( 9 )#">
Posted by Ben Nadel on Oct 12, 2007 at 11:52 AM
I was using the double hash to escape them. I thought you wanted the single hashes passed into the UDF. Using your way, CF throws this error: Variable CHAR is undefined.
Posted by Troy on Oct 12, 2007 at 12:07 PM
Ooops :) Chr(), not CHAR()... my bad.
Posted by Ben Nadel on Oct 12, 2007 at 12:08 PM
That's what I noted I tried in the first comment I made "Chr(9)". Maybe you need to translate that inside the UDF to "\t" in the places needed in your regex lines of code?
Posted by Troy on Oct 12, 2007 at 1:07 PM
@Troy,
I think there is some confusion on when to use the Hash signs. If we get rid of the quotes, you don't even need to use them. I just ran this and it works:
<cfset arrData = CSVToArray(
CSV = strCSVTab,
Delimiter = Chr( 9 )
) />
Notice that I am just passing in Chr( 9 ) as the delimiter value.
Posted by Ben Nadel on Oct 12, 2007 at 1:32 PM
Does this correctly handle cases where there might be carriage returns within quoted values?
Posted by Michael Long on Oct 12, 2007 at 5:05 PM
Absolutely. As long as the value is within a quoted field value, you are all good.
Posted by Ben Nadel on Oct 12, 2007 at 5:14 PM
I have easily taken Ben's function and parsed 500-1000 line csv that is tab delimited. I used the Chr(9) as the delimeter. I did bring it to his attention about having an extra line at the end.. but It was only because there was a carriage return in my txt file. It wasnt to ask him to change his function.
Also I just accounted for the blank lines and didnt show them, since I get the files in the same format everytime. I account for additional formats as well, so it didnt matter if it really had extra lines, as my app ignores them anyway.
Anyway, good job Ben. Keep up the good work. Ill probably integrate this into my next version of my csv parsing application, as it seems to be a little faster. Got to test it some to make sure.
Have a good day Ben and everyone.
-Matthew
Posted by Matthew on Oct 13, 2007 at 5:18 PM
I found a dumb mistake in my code and now my TAB parsing works with Chr(9). I would like the final blank line to be ignored if it's a simple tweak Ben can make. But not a showstopper, and I'm grateful to Ben for all his help, good stuff!
Posted by Troy on Oct 13, 2007 at 8:37 PM
Do you guys think it would make sense to just strip off the final new lines and carriage returns? Basically, like doing an RTrim() that would stop if it hit a TAB character or space or something? At first, I was worried that this might erase needed data, but I don't think a number of empty lines at the end is necessarily lost... unless the number or rows is meaningful.
I think I will add this and make an optional boolean argument that can turn it OFF.
Posted by Ben Nadel on Oct 14, 2007 at 1:21 PM
I have updated the UDF in this post to include some quasi RTrim() functionality. It will not strip spaces or tabs as these might be valid data characters. It will, however, by default, strip off trailing new line and carriage returns characters from the file. This can be turned off with a fourth, optional argument.
Posted by Ben Nadel on Oct 15, 2007 at 7:54 AM
Since Excel does not use the backslash as an escape character, but instead escapes quote literals with two quotes,
which does your regex use?
I presume it is "" based on some of your writings, but I am no regex expert.
Posted by Steve on Jan 21, 2008 at 5:52 PM
@Steve,
Yeah, I am using "" (double-double quotes) in order to escape quotes within a qualified field. And remember, all "special characters" MUST be in qualified fields.
Posted by Ben Nadel on Jan 23, 2008 at 9:51 AM
Ben,
What happens if you have something like this?
Field1,Field2,Field3
Data1,Data2,Data3
Data1
Data1,,,
Where the commas aren't always there? I know exporting from Excel to CSV sometimes the commas get ignored it the cell is empty.
Have you ever come across this?
Posted by Chris on May 30, 2008 at 9:37 AM
@Chris,
That should be fine. If you look at the example on which this UDF is based:
http://www.bennadel.com/index.cfm?dax=blog:978.view
... you will see that it handles empty spaces quite nicely.
Posted by Ben Nadel on May 30, 2008 at 9:40 AM
@Chris
The array will contain the number of fields that exist in that row,
Some rows may be 3, some 1 , some 4 (1 and 3empty)
What I do is when I am iterating over the array is I check that the array row has the number of fields I am expecting, then I do any data validation I expect on each field (numeric, date, etc)
Also I check for a row of all empty fields (and exit my import).
Excel is also notorious for
field1, field2, field3
d1,d2,d3
d1,d2,d3
,,,
,,,
,,,
(and so on for a few hundred or thousand rows)
,,,
<eof>
Posted by Steve on May 30, 2008 at 4:33 PM
Thanks Ben.
@Steve, I want to take my CSV file and insert it into a database table. However, since some commas are missing, the "Column count doesn't match value count at row". Any ideas how I can fix that? It is due to the fact when exporting to a csv file from Excel it leaves off commas. I add the commas, it works fine. I want to validate the csv before input.
Thanks,
Chris
Posted by Chris on Jun 2, 2008 at 11:08 AM
@Chris
You need to perform all your validation before attempting database insert.
Say I want 5 columns...
<cfif ArrayLen(arrData[i]) LT 5>
<!--- Do Nothing, Exel empty line --->
<cfelseif NOT isnumeric(arrData[i][1])>
<!--- error field ivalid data --->
<cfelse>
Do dtabase work.
</cfif>
Posted by Steve on Jun 2, 2008 at 11:18 AM
Thanks Steve.
If I want 5 columns, but one of my rows only has 4, is there away to add another column so tthat the insert doesn't have to fail?
Thanks,
Chris
Posted by Chris on Jun 2, 2008 at 11:23 AM
Ben,
I got this UDF working and it is very helpful. I just have a quick question maybe you can answer. I am using this as an include in a page which is calling it 2 times as a cfinclude.
It works when i call it one time but when i call it the second time i get this:
-----------------------------------------------------
Routines cannot be declared more than once.
The routine "CSVToArray" has been declared twice in different templates.
ColdFusion cannot determine the line of the template that caused this error. This is often caused by an error in the exception handling subsystem.
-----------------------------------------------------
Is there any way around this? Thanks in advance
Posted by Rob G. on Oct 1, 2008 at 10:25 AM
@Rob,
You can't include a UDF more than once per page request. To get around this, I would include it once in your Application.cfm or something. You need to find a way to only include it once.
Posted by Ben Nadel on Oct 1, 2008 at 10:57 AM