I have been working on parsing CSV values in ColdFusion and it has given me such a headache. The hardest part is trying to figure out if you are dealing with a field qualifier (ex. "ben") or dealing with an escaped qualifier (ex. "Ben is the ""bomb"""). I came up with a hacky solution that iterates over each character in the line. This is not performant and also does not handle all the qualifier situations correctly.
Tony Petruzzi suggested that I try using a Tokenizer. I had originally tried this in my first attempt but ran into similar problems involving field qualifiers. Working on it again this morning, I realized that my biggest problem was that I didn't even know what the CSV standard file format was! How can I solve a problem when I don't even know what my problem domain is?!?
After doing some quick Googling, I found this page, http://www.edoceo.com/utilis/csv-file-format.php, which listed the CSV file format standard as:
- Each record is one line - Line separator may be LF (0x0A) or CRLF (0x0D0A), a line seperator may also be embedded in the data (making a record more than one line but still acceptable).
- Fields are separated with commas. - Duh.
- Leading and trailing whitespace is ignored - Unless the field is delimited with double-quotes in that case the whitespace is preserved.
- Embedded commas - Field must be delimited with double-quotes.
- Embedded double-quotes - Embedded double-quote characters must be doubled, and the field must be delimited with double-quotes.
- Embedded line-breaks - Fields must be surrounded by double-quotes.
- Always Delimiting - Fields may always be delimited with double quotes, the delimiters will be parsed and discarded by the reading applications.
This makes things SOOO much easier. Knowing that an embedded delimiter or qualifier MUST be in a field that is fully qualified simplifies my life so much. Now, if I come across a token like this:
I don't have to think about wether or not it's two escaped qualifiers or one escaped qualifier in a fully qualified field. Based on the standard I know that embedded qualifiers MUST be in a qualified field and hence, this is a single escaped qualifier in a qualified field (NOT two escaped qualifiers).
Based on this new information, I should be able to have the String Token version of the ColdFusion CSV parser up and running soon. But let this be a lesson - if you are trying to solve a problem, be sure you truly understand what the problem is :)
Thinking about it now though, why bother using the Tokenizer? That involves function calls. Why not just convert the row to a list using a split method. Looping over an Array has got to be faster than using a Tokenizer.