Skip to main content
Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.

Modernizing My CSV (Comma Separated Value) Parser In Lucee CFML 5.3.7.47

By Ben Nadel on
Tags: ColdFusion

In the past week, I've written a few posts about generating CSV (Comma Separated Value) data in ColdFusion, including some experimentation with "lazy" queries and streaming CSV files in Lucee CFML. Just coincidentally, Adam Cameron asked me about a very old post that I wrote for parsing CSV data in ColdFusion. I had taken that post down due to some questionable content; so, I thought this was a perfect opportunity to rewrite my CSV parser using modern ColdFusion syntax in Lucee CFML 5.3.7.47.

All the CSV parsing that I perform is based on a non-back-tracking Regular Expression (RegEx) that I wrote with the help of Steven Levithan. It uses a technique called "Unrolling the Loop", which allows for "special" substrings (such as escaped quotes) without leading to catastrophic back-tracking and never-ending patterns.

Regular Expressions - especially complex ones - are never easy to read, let alone write. To help maintain some clarity and maintainability, I am breaking out the RegEx pattern generation into its own User Defined Function (UDF) with copious amounts of comments. I'm also using verbose mode to allow for white-space that won't impact the mechanics of the pattern itself:

<cfscript>

	/**
	* I return a Java Pattern Matcher for the given CSV input and field delimiter.
	* 
	* @input I am the CSV data being parsed.
	* @delimiter I am the field delimiter.
	*/
	private any function getPatternMatcher(
		required string input,
		required string delimiter
		) {

		var escapedDelimiter = delimiter.reFind( "\W" )
			? "\#delimiter#"
			: delimiter
		;

		var patternParts = [
			// VERBOSE FLAGE: So that we can add white-space characters in the various 
			// parts (for readability) and they won't affect the RegEx pattern mechanics.
			"(?x)",

			// Make sure that the next match picks up exactly where the last match left
			// off. In other words, we only want matches that consume a continuous set of
			// characters.
			"\G",

			// As we match tokens within the CSV input, we're always going to match a
			// FIELD value followed by a DELIMITER (or the end of the file).
			// --
			// NOTE: The "*+" notation is a "Possessive Qualifier", which means that it
			// does not allow the pattern matcher to backtrack if it doesn't find a
			// match. This forces all patterns to be a continuous set of characters (and
			// has a performance benefit as the matcher does not have to remember any
			// backtracking positions).
			"(?:",

				// CAPTURE GROUP 1: A quoted field value.
				'"( [^"]*+ (?>""[^"]*+)* )"',

				"|",

				// CAPTURE GROUP 2: An unquoted field value.
				// --
				// NOTE: I originally had double-quotes in the following character-set;
				// but, I am not sure if I really need it there. Removing it makes the
				// matching ever-so-slightly more flexible.
				'( [^#escapedDelimiter#\r\n]*+ )',

			")",

			// CAPTURE GROUP 3: The delimiter.
			"(
				#escapedDelimiter# |
				\r\n? |
				\n |
				$
			)"
		];

		var patternText = patternParts.toList( "" );

		var matcher = createObject( "java", "java.util.regex.Pattern" )
			.compile( javaCast( "string", patternText ) )
			.matcher( javaCast( "string", input ) )
		;

		return( matcher );

	}

</cfscript>

I know that there are a lot of people in this world that don't like comments; and, who think that by leaving comments I have somehow fundamentally failed to express the intend of my code. Well, I'm pretty sure that those people have never written Regular Expressions!

This Regular Expression pattern works by capturing quoted (qualified) and non-quoted field values using two different groups. As we match the pattern against the input CSV data, we can therefore vary our parsing logic based on which capture group is null / populated.

ASIDE: Normally, I would not include the javaCast() calls for my interactions with the lower-level Java constructs; Lucee CFML generally doesn't need this as it uses the proper Types under the hood. However, to make this more compatible with the Adobe ColdFusion runtime, I'm including them in this implementation.

To modernize this whole approach, I've wrapped it in a ColdFusion Component that exposes two methods: one that parses raw CSV input; and, on that parses a CSV file:

component
	output = false
	hint = "I provide methods for parsing CSV (Comma Separated Values) inputs and files."
	{

	this.COMMA = ",";
	this.TAB = chr( 9 );
	// If this delimiter is passed-in, the parser will examine a portion of the input and
	// try to determine, roughly, which delimiter is being used in the data.
	this.GUESS_DELIMITER = "";

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I parse the given CSV input value using the given delimiter. The results are
	* returned as a set of nested arrays. The rows are ASSUMED to be delimited by new
	* line and / or row return characters.
	* 
	* @input I am the CSV data being parsed.
	* @delimiter I am the field delimiter.
	*/
	public array function parseCsv(
		required string input,
		string delimiter = this.GUESS_DELIMITER
		) {

		input = removeTrailingLines( input );

		// Short-Circuit: It's not really possible to differentiate an empty input from
		// an input that has an empty row. As such, we're just going to make a judgment
		// call that an empty input has no records.
		if ( ! input.len() ) {

			return( [] );

		}

		if ( delimiter == this.GUESS_DELIMITER ) {

			delimiter = getBestDelimiter( input );

		}

		testDelimiter( delimiter );

		var matcher = getPatternMatcher( input, delimiter );
		var QUOTED_FIELD_GROUP = javaCast( "int", 1 );
		var UNQUOTED_FIELD_GROUP = javaCast( "int", 2 );
		var DELIMITER_GROUP = javaCast( "int", 3 );
		var ESCAPED_QUOTE = """""";
		var UNESCAPTED_QUOTE = """";

		var rows = [];
		var row = [];

		while ( matcher.find() ) {

			// Our pattern matches quoted and unquoted fields in different capturing
			// groups. As such, the only way we can determine which type of field was
			// captured in this match is to see which one of the groups is NULL.
			var quotedField = matcher.group( QUOTED_FIELD_GROUP );

			// If the quotedField variable exists, it means that we captured a quoted
			// field. And, if it's null / undefined, it means that we captured an
			// unquoted field.
			if ( local.keyExists( "quotedField" ) ) {

				row.append( quotedField.replace( ESCAPED_QUOTE, UNESCAPTED_QUOTE, "all" ) );

			} else {

				row.append( matcher.group( UNQUOTED_FIELD_GROUP ) );

			}

			var capturedDelimiter = matcher.group( DELIMITER_GROUP );

			// If our captured delimiter has a length, it was either a field delimiter
			// or a row delimiter.
			if ( capturedDelimiter.len() ) {

				if ( capturedDelimiter == delimiter ) {

					// In the case of a field delimiter, there's nothing to do - the
					// matcher will just move onto the next field.

				} else {

					// In the case of a row delimiter, we need to gather up the current
					// row in the results and then start the next row.
					rows.append( row );
					row = [];

				}

			// If our captured delimiter has no length, it means that it matched the end
			// of the CSV input, which is also the end of the current row. We need to 
			// gather up the current row in the results; but, we don't need to bother
			// starting a new row - there will be no more (meaningful) matches.
			} else {

				rows.append( row );
				// In order to prevent an extra empty row from being appended to the
				// results, we have to explicitly break out of the loop.
				break;

			}

		} // END: While.

		// The CSV input is expected to be in the format of FIELD followed by DELIMITER.
		// However, if the user passed-in a delimiter that does not match the delimiter
		// in the actual data, the pattern matcher will never match. In such a case, we
		// need to let the user know that this was unexpected.
		if ( ! rows.len() ) {

			throw(
				type = "CsvParser.UnexpectedEmptyResults",
				message = "Results should not be empty.",
				detail = "This can happen if you use an incompatible delimiter."
			);

		}

		return( rows );

	}


	/**
	* I parse the given CSV files using the given delimiter. The results are returned as
	* a set of nested arrays. The rows are ASSUMED to be delimited by new line and / or
	* row return characters.
	* 
	* @inputFile I am the CSV file being parsed.
	* @delimiter I am the field delimiter.
	*/
	public array function parseCsvFile(
		required string inputFile,
		string delimiter = this.GUESS_DELIMITER
		) {

		return( parseCsv( fileRead( inputFile ), delimiter ) );

	}

	// ---
	// PRIVATE METHODS.
	// ---

	/**
	* I examine the given CSV input and determine which delimiter (Comma or Tab) to use.
	* This algorithm is naive and simply looks at which one MAY create more fields.
	* 
	* @input I am the CSV input being examined.
	*/
	private string function getBestDelimiter( required string input ) {

		var prefix = input.left( 1000 );
		var commaCount = prefix.listLen( this.COMMA );
		var tabCount = prefix.listLen( this.TAB );

		if ( commaCount >= tabCount ) {

			return( this.COMMA );

		} else {

			return( this.TAB );

		}

	}


	/**
	* I return a Java Pattern Matcher for the given CSV input and field delimiter.
	* 
	* @input I am the CSV data being parsed.
	* @delimiter I am the field delimiter.
	*/
	private any function getPatternMatcher(
		required string input,
		required string delimiter
		) {

		var escapedDelimiter = delimiter.reFind( "\W" )
			? "\#delimiter#"
			: delimiter
		;

		var patternParts = [
			// VERBOSE FLAGE: So that we can add white-space characters in the various 
			// parts (for readability) and they won't affect the RegEx pattern mechanics.
			"(?x)",

			// Make sure that the next match picks up exactly where the last match left
			// off. In other words, we only want matches that consume a continuous set of
			// characters.
			"\G",

			// As we match tokens within the CSV input, we're always going to match a
			// FIELD value followed by a DELIMITER (or the end of the file).
			// --
			// NOTE: The "*+" notation is a "Possessive Qualifier", which means that it
			// does not allow the pattern matcher to backtrack if it doesn't find a
			// match. This forces all patterns to be a continuous set of characters (and
			// has a performance benefit as the matcher does not have to remember any
			// backtracking positions).
			"(?:",

				// CAPTURE GROUP 1: A quoted field value.
				'"( [^"]*+ (?>""[^"]*+)* )"',

				"|",

				// CAPTURE GROUP 2: An unquoted field value.
				// --
				// NOTE: I originally had double-quotes in the following character-set;
				// but, I am not sure if I really need it there. Removing it makes the
				// matching ever-so-slightly more flexible.
				'( [^#escapedDelimiter#\r\n]*+ )',

			")",

			// CAPTURE GROUP 3: The delimiter.
			"(
				#escapedDelimiter# |
				\r\n? |
				\n |
				$
			)"
		];

		var patternText = patternParts.toList( "" );

		var matcher = createObject( "java", "java.util.regex.Pattern" )
			.compile( javaCast( "string", patternText ) )
			.matcher( javaCast( "string", input ) )
		;

		return( matcher );

	}


	/**
	* I remove any trailing empty lines.
	* 
	* @input I am the CSV data being parsed.
	*/
	private string function removeTrailingLines( required string input ) {

		return( input.reReplace( "[\r\n]+$", "" ) );

	}


	/**
	* I assert that the given delimiter can be used by this parser. If it can, this
	* method exits quietly; if not, it throws an error.
	* 
	* @delimiter I am the delimiter being tested.
	*/
	private void function testDelimiter( required string delimiter ) {

		if ( delimiter.len() != 1 ) {

			throw(
				type = "CsvParser.InvalidDelimiter.Length",
				message = "Field delimiter must be one character.",
				detail = "The field delimiter [#delimiter#] is not supported."
			);

		}

		if (
			( delimiter == chr( 13 ) ) ||
			( delimiter == chr( 10 ) )
			) {

			throw(
				type = "CsvParser.InvalidDelimiter.RowDelimiter",
				message = "Field delimiter matches row delimiter.",
				detail = "The field delimiter [#delimiter#] cannot be the same as the implicit row delimiter."
			);

		}

	}

}

Once you get past the complexity of the Regular Expression pattern itself, there's really not all that much going on here. All we're doing it creating a Java Pattern Matcher and then looping over the CSV input looking at the three captured groups.

To test this out, I created a sample CSV file that has various quirks:

ID,Value,Feature Being Tested
1,hello,A raw value
2,"world",A qualified value
3,"jib,jab","A qualified value with "","" embedded"
4,ben "jamin" nadel,A raw field with embedded quotes
5,,An empty raw field
6,"",An empty qualified field

7,after empty,Making sure the empty line above worked

Now, all we have to do is instantiate our CsvParser.cfc ColdFusion component and try to read this file:

<cfscript>

	parser = new CsvParser();

	dump( parser.parseCsvFile( "./sample.csv" ) );

</cfscript>

And, when we run this ColdFusion code, we get the following output:

CSV data parsed into a multi-dimensional Array in Lucee CFML

As you can see, we were able to parse the sample CSV file into a multi-dimensional array of results. But, CSV data can be somewhat ambiguous. Take the 8th element in this Array - it was an empty row in my CSV file; but, does that mean it's an entirely blank row? Or, is it a row with a single, empty field value? Probably, I could update my logic to ignore rows that end-up with a single, empty value.

As much as I celebrated the ease with which we can generate CSV files in ColdFusion, it's also not that difficult to consume CSV files in our ColdFusion applications. That is, once you get past the Regular Expression! CSV data - and text-data like JSON (JavaScript Object Notation) - ends-up being a rather solid choice when it comes to the interoperability of systems.



Reader Comments

Impressive how you've clearly documented what would otherwise be a nightmare to read after the fact.

Reply to this Comment

@Chris,

Thanks! RegEx is one of those technologies that makes way more sense in the mind of author vs. the mind of the reader. Have you ever seen one of those patterns that people use to validate email addresses 🤣 The only tool we have is copious amounts of comments.

Reply to this Comment

@ Ben

The verbose flag is a game changer! I hadn't heard of it before. My best form of documenting a REGEX expression to date has been simply to assign it to a thoughtful variable name. At least then the reader would know what it's purpose was (intended) to be. The Unrolling the Loop is new for me too. REGEX never ceases to amaze me...such a clever, cryptic, capable language.

Reply to this Comment

@Chris,

The verbose flag is pretty cool. It's harder to see it in action in a CFScript block since there's not as much white-space to be had. But, you can see it in the last part of the pattern where I have all the delimiters on a new-line inside the double-quoted, multi-line string.

Inside a verbose flag, you should be able to pull in an actual white-space character by escaping it.... I think... like \ (where there is a space after that slash.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Blog
Live in the Now
Oops!
NEW: Some basic markdown formatting is now supported: bold, italic, blockquotes, lists, fenced code-blocks. Read more about markdown syntax »
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.