Skip to main content
Ben Nadel at Node.js Training with Nodejitsu (Apr. 2011) with: Paolo Fragomeni and Charlie Robbins and Marak Squires
Ben Nadel at Node.js Training with Nodejitsu (Apr. 2011) with: Paolo Fragomeni ( @hij1nx ) Charlie Robbins ( @indexzero ) Marak Squires ( @maraksquires )

Parsing CSV Data With An Input Stream And A Finite State Machine

By on
Tags:

I've talked about parsing CSV (Comma Separated Value) data on this blog a number of times, both in ColdFusion and in JavaScript. Every now and then, someone brings up the fact that ColdFusion may run out of memory when parsing ginormous files. As such, I thought it would be a fun challenge to try and parse a CSV file using a Java Buffered Input Stream and a Finite State Machine approach to processing. This way, the CSV parser could read-in portions of the target file, examine one character at a time, and then publish various events along the way without tying up too much of the computer's memory.

To start with, I tried to draw a state diagram:

A Finite State Machine state diagram for parsing CSV (Comma Separated Value) data.

This was fun, but not useful. The diagram quickly got way out of hand. And, I realized that naming states "S1", S2", etc., was not useful. While a Finite State Machine is simply in theory, it is stil a lot to model mentally. As such, I sat back and started to think about what kind of "meaningful" states I wanted to use and what kind of character inputs were valid.

After some reflection, I came up with the following states that could exist in my Finite State Machine:

  • Pre-Data - This state would be used only once to determine if the document had any valid data.
  • Between Fields - This state would be used between fields of a single row.
  • Non-Quoted Value - This state would indicate the collection of a non-quoted value.
  • Quoted Value - This state would indicate the collection of a quoted value.
  • Escaped Value - This state would be used to check a quote embedded within a quoted value (ie. was it the end of the field or an embedded quote).
  • Carriage Return - This state would handle carriage returns.
  • New Line - This state would handle new lines.

NOTE: Since I am on a mac, I didn't actually test the \r to \n transition.

As the Finite State Machine transitioned from one state to another, it needed to raise events that some CSV data handler would be listening for. While not all of these events would be useful, this is the list of transitional events that I came up with:

  • startDocument
  • startRow
  • startField
  • endField (has accompanying data)
  • endRow
  • endDocument

Only the "endField" event would publish with an accompanying event data item (containing the value of the field that was just parsed).

At this point, I should have created a table that listed states and events and how they map to transitions (as I did in this post); however, due to lack of time, I just sort of winged it. I simply made a list of "relevant" input characters and then integrated them into my code:

  • \r - Carriage return.
  • \n - New line.
  • , - Comma - field delimiter.
  • \x04 - End-of-Transmission (end of file indicator).
  • " - Double quote - for quoted values.
  • [^\r\n,"\x04] - Everything else (valid field characters).

Ok, let's start looking at some code (I only have 10 mins to finish writing this post). Since this CSV parser is evented, I needed to create a ColdFusion component that would listen for events and aggregate the CSV data. For our purposes, this only needed to have one method - handleEvent() - which is the method to which the CSV parser will publish events.

Handler.cfc

<cfcomponent
	output="false"
	hint="I listen for CSV parser events and compile a array of arrays.">


	<cffunction
		name="init"
		access="public"
		returntype="any"
		output="false"
		hint="I initialize this component.">

		<!--- Set up the data. --->
		<cfset variables.csvData = [] />

		<!--- Return this object reference. --->
		<cfreturn this />
	</cffunction>


	<cffunction
		name="getData"
		access="public"
		returntype="array"
		output="false"
		hint="I return the current data collection.">

		<cfreturn duplicate( variables.csvData ) />
	</cffunction>


	<cffunction
		name="handleEvent"
		access="public"
		returntype="any"
		output="false"
		hint="I listen for and then response to events published by a CSV parser.">

		<!--- Define arguments. --->
		<cfargument
			name="eventType"
			type="string"
			required="true"
			hint="I am the type of event being raised."
			/>

		<cfargument
			name="eventData"
			type="string"
			required="false"
			default=""
			hint="I am the (optional) data being published along with the CSV parsing event."
			/>

		<!---
		<cffile
			action="append"
			file="#expandPath( './log.txt' )#"
			output="#arguments.eventType# [#arguments.eventData#]"
			addnewline="true"
			/>
		--->

		<!--- Check to see what kind of event we have. --->
		<cfif (arguments.eventType eq "startRow")>

			<!--- Push a new row onto the data. --->
			<cfset arrayAppend(
				variables.csvData,
				arrayNew( 1 )
				) />

		<cfelseif (arguments.eventType eq "endField")>

			<!--- Push this field onto the latest row. --->
			<cfset arrayAppend(
				variables.csvData[ arrayLen( variables.csvData ) ],
				arguments.eventData
				) />

		</cfif>

		<!--- Return this object reference. --->
		<cfreturn this />
	</cffunction>

</cfcomponent>

As you can see, this component builds an internal array based on two key events - "startRow" and "endField". When the "startRow" event is fired, a new array is pushed onto the data cache. When the "endField" event is fired, a new field value is pushed onto the last row.

Before we dive into the actual parser, let's see how this works in action. First, we need to define our CSV (Comma Separated Value) file:

ID,Name,Nick Name
1,Bob Smith,"Bob ""The Horse"" Smith"
2,"Darren",D-Rock
3,"Tony","Tony ""2 Legs"""
4,Vito Caprilo,Vito the Sink
"5","Frank","Frank the Tank"
6,Pete,""
7,,

As you can see, this data uses a mixture of quoted values, non-quoted values, and incomplete lines.

The test code that runs the parsing looks like this:

<!---
	Get the file path to the CSV data file that we will be reading
	in with an Input Stream (so as not to have to read the whole
	file at one time).
--->
<cfset filePath = expandPath( "./widgets.csv" ) />

<!---
	Create our handler. This must have one method - handleEvent() -
	which can respond to events published by the CSV parser.
--->
<cfset handler = createObject( "component", "Handler" ).init() />

<!--- Create our CSV data evented parser. --->
<cfset parser = createObject( "component", "CSVParser" ).init(
	filePath,
	handler
	) />

<!--- Output the result. --->
<cfdump
	var="#handler.getData()#"
	label="CSV Data"
	/>

As you can see, we create an instance of our parser, passing to it, the handler instance and the path to the data file. Internally, the CSV parser creates a Buffered Input Stream and starts to parse the CSV data, one character at a time. When it is done, we can retrieve the aggregated data from the Handler.cfc instance:

CSV data as returned by an evented CSV parser.

Worked like a charm!

Ok, let's take a look at the internals of the CSVParser.cfc ColdFusion component. As I said before, this component creates a Buffered Input Stream so as to be able to read the CSV data file in a chunk at time without exhausting the computer's memory. Then, it examines each character individually, trying to build up field values.

Each individual character of data gets passed to a particular parsing method. These parsing methods represent the current state of the parser. As each state responds to the given character input, it must return a reference to the next state (ie. the next parser method). As such, each state is responsible for the transition to the next valid state.

CSVParser.cfc

<cfcomponent
	output="false"
	hint="I parse a CSV file using a buffered input reader. Rather than parsing the entire file at one time, events are published as aspects of the file are read.">


	<!---
		This finite state machine is used to parse Comma Serparated
		Values. The following states available are:

		- Pre-Data (first state - only used once)
		- Between Fields
		- Non-Quoted Value
		- Quoted Value
		- Escaped Value
		- Carriage Return
		- New Line
	--->


	<cffunction
		name="init"
		access="public"
		returntype="any"
		output="false"
		hint="I initialize this component instance.">

		<!--- Define arguments. --->
		<cfargument
			name="filePath"
			type="string"
			required="true"
			hint="I am the file path to the CSV data."
			/>

		<cfargument
			name="handler"
			type="any"
			required="true"
			hint="I am the object that listens for CSV parsing events. Only one method is requires: handleEvent()."
			/>

		<!--- Store the file path. --->
		<cfset variables.filePath = arguments.filePath />

		<!--- Store the handler for the parsing events. --->
		<cfset variables.handler = arguments.handler />

		<!---
			Store a buffered input stream to the given file path.
			This will allow us to optimize the input process while,
			at the same time, not having to parse the entier file
			in memory at any given time.
		--->
		<cfset variables.inputStream = createObject( "java", "java.io.BufferedInputStream" ).init(
			createObject( "java", "java.io.FileInputStream" ).init(
				javaCast( "string", variables.filePath )
				)
			) />

		<!---
			I am the current value buffer. As we are building field
			values up, a character at a time, we will need a place
			to hold them before we publish a field event.
		--->
		<cfset variables.fieldBuffer = [] />

		<!---
			I am the current state. In our case, a state is
			represented by a parser that can take one character
			at a time. To begin with, we will put the state into a
			pre-data state (this is the only time that it will be
			used in order to see if any data is in the document).
		--->
		<cfset variables.state = this.inPreData />

		<!--- Start the actual CSV input stream parsing. --->
		<cfset this.parse() />

		<!--- Return this object reference. --->
		<cfreturn this />
	</cffunction>


	<cffunction
		name="parse"
		access="public"
		returntype="any"
		output="false"
		hint="I perform the actual parsing of the CSV input stream.">

		<!--- Define the local scope. --->
		<cfset var local = {} />

		<!---
			Even if the document has no data, we will, at the very
			least, start and end the document.
		--->
		<cfset this.publish( "startDocument" ) />

		<!--- Read the first character in the CSV input stream. --->
		<cfset local.nextByte = variables.inputStream.read() />

		<!---
			The input stream will be providing a single byte at a
			time. It will continue doing this until it hits the end
			of the stream, at which point, it will return -1.
		--->
		<cfloop condition="(local.nextByte neq -1)">

			<!--- Get the character version of the byte. --->
			<cfset local.nextCharacter = chr( local.nextByte ) />

			<!---
				Pass the character off to the current state. When the
				state looks at the character, it will (potentially)
				announce events and then return the state to which we
				should transition.
			--->
			<cfset variables.state = variables.state( local.nextCharacter ) />

			<!--- Read the next byte. --->
			<cfset local.nextByte = variables.inputStream.read() />

		</cfloop>

		<!---
			Now that the document has ended, we need to pass it onto
			the current state so that it can wrap it up appropriately
			(or fail if the End-of-File is in an inappropriate place).
			At this point, we don't care about storing the resultant
			state since we are done parsing.

			NOTE: We are using EOT (end of transmission) to denote
			the "End of File" since we can use RegEx to find that.
		--->
		<cfset variables.state( chr( 4 ) ) />

		<!--- End the document. --->
		<cfset this.publish( "endDocument" ) />

		<!--- Return this object reference. --->
		<cfreturn this />
	</cffunction>


	<cffunction
		name="publish"
		access="public"
		returntype="any"
		output="false"
		hint="I publish the given event with the given data.">

		<!--- Define arguments. --->
		<cfargument
			name="eventType"
			type="string"
			required="true"
			hint="I am the even type. Possible types are: startDocument, startRow, startField, endField, endRow, endDocument."
			/>

		<cfargument
			name="eventData"
			type="string"
			required="false"
			hint="I am the optional data to announce with the event."
			/>

		<!---
			For our purposes, we'll just pass the invocation
			arguments along to the event handler.
		--->
		<cfset variables.handler.handleEvent(
			argumentCollection = arguments
			) />

		<!--- Return this object reference for method chaining. --->
		<cfreturn this />
	</cffunction>


	<cffunction
		name="inBetweenFields"
		access="public"
		returntype="any"
		output="false"
		hint="">

		<!--- Define arguments. --->
		<cfargument
			name="nextCharacter"
			type="string"
			required="true"
			hint="I am the next character in the input stream."
			/>

		<!--- Define the local scope. --->
		<cfset var local = {} />

		<!--- Field character. --->
		<cfif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>

			<!--- Start the new field. --->
			<cfset this.publish( "startField" ) />

			<!---
				Add the current character to the field buffer (as we
				being to build up the field value).
			--->
			<cfset arrayAppend(
				variables.fieldBuffer,
				arguments.nextCharacter
				) />

			<!--- Move to the non-quoted value. --->
			<cfreturn this.inNonQuotedValue />

		<!--- Comma. --->
		<cfelseif (arguments.nextCharacter eq ",")>

			<!--- Start and end an empty field. --->
			<cfset this.publish( "startField" ) />
			<cfset this.publish( "endField", "" ) />

			<!--- Move to in between fields. --->
			<cfreturn this.inBetweenFields />

		<!--- Carriage return. --->
		<cfelseif reFind( "\r", arguments.nextCharacter )>

			<!--- Start and end an empty field. --->
			<cfset this.publish( "startField" ) />
			<cfset this.publish( "endField", "" ) />

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

			<!--- Move the carriage return. --->
			<cfreturn this.inCarriageReturn />

		<!--- New line. --->
		<cfelseif reFind( "\n", arguments.nextCharacter )>

			<!--- Start and end an empty field. --->
			<cfset this.publish( "startField" ) />
			<cfset this.publish( "endField", "" ) />

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

			<!--- Move to the new line. --->
			<cfreturn this.inNewLine />

		<!--- Double Quote. --->
		<cfelseif (arguments.nextCharacter eq """")>

			<!--- Start the new field. --->
			<cfset this.publish( "startField" ) />

			<!--- Move to the quoted value. --->
			<cfreturn this.inQuotedValue />

		<!--- End of Transmission. --->
		<cfelseif reFind( "\x04", arguments.nextCharacter )>

			<!--- Start and end an empty field. --->
			<cfset this.publish( "startField" ) />
			<cfset this.publish( "endField", "" ) />

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

		<cfelse>

			<!---
				If we made it this far, this state has been put
				into an invalid state / transition.
			--->
			<cfthrow
				type="InvalidStateTransition"
				message="inBetweenFields[#arguments.nextCharacter#]"
				/>

		</cfif>
	</cffunction>


	<cffunction
		name="inCarriageReturn"
		access="public"
		returntype="any"
		output="false"
		hint="">

		<!--- Define arguments. --->
		<cfargument
			name="nextCharacter"
			type="string"
			required="true"
			hint="I am the next character in the input stream."
			/>

		<!--- Define the local scope. --->
		<cfset var local = {} />

		<!--- New line. --->
		<cfif reFind( "\n", arguments.nextCharacter )>

			<!--- Move to the new line. --->
			<cfreturn this.inNewLine />

		<!--- Carriage return. --->
		<cfelseif reFind( "\r", arguments.nextCharacter )>

			<!--- Start and end an empty row. --->
			<cfset this.publish( "startRow" ) />
			<cfset this.publish( "endRow" ) />

			<!--- Move the carriage return. --->
			<cfreturn this.inCarriageReturn />

		<!--- Field character. --->
		<cfelseif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>

			<!--- Start the next row. --->
			<cfset this.publish( "startRow" ) />

			<!--- Start the next field. --->
			<cfset this.publish( "startField" ) />

			<!--- Add the current character to the field buffer. --->
			<cfset arrayAppend(
				variables.fieldBuffer,
				arguments.nextCharacter
				) />

			<!--- Move to the non-quoted value. --->
			<cfreturn this.inNonQuotedValue />

		<!--- Comma. --->
		<cfelseif (arguments.nextCharacter eq ",")>

			<!--- Start the new row. --->
			<cfset this.publish( "startRow" ) />

			<!--- Start and end the empty field. --->
			<cfset this.publish( "startField" ) />
			<cfset this.publish( "endField", "" ) />

			<!--- Move to in between fields. --->
			<cfreturn this.inBetweenFields />

		<!--- End of Transmission. --->
		<cfelseif reFind( "\x04", arguments.nextCharacter )>

			<!--- Already ended the row - nothing to publish. --->

		<cfelse>

			<!---
				If we made it this far, this state has been put
				into an invalid state / transition.
			--->
			<cfthrow
				type="InvalidStateTransition"
				message="inCarriageReturn[#arguments.nextCharacter#]"
				/>

		</cfif>
	</cffunction>


	<cffunction
		name="inEscapedValue"
		access="public"
		returntype="any"
		output="false"
		hint="">

		<!--- Define arguments. --->
		<cfargument
			name="nextCharacter"
			type="string"
			required="true"
			hint="I am the next character in the input stream."
			/>

		<!--- Define the local scope. --->
		<cfset var local = {} />

		<!--- Double-escaped quote. --->
		<cfif (arguments.nextCharacter eq """")>

			<!---
				This is just an embedded quote. Add it to the
				field buffer. We don't have to worry about the
				previous double-quote as it was only used to
				escape this one.
			--->
			<cfset arrayAppend(
				variables.fieldBuffer,
				arguments.nextCharacter
				) />

			<!--- Return back to the quoted value. --->
			<cfreturn this.inQuotedValue />

		<!--- Comma. --->
		<cfelseif (arguments.nextCharacter eq ",")>

			<!---
				The previous quote was actually the end of the
				previous field. End the previous field.
			--->
			<cfset this.publish(
				"endField",
				arrayToList( variables.fieldBuffer, "" )
				) />

			<!--- Clear the field buffer. --->
			<cfset variables.fieldBuffer = [] />

			<!--- Move to in between fields. --->
			<cfreturn this.inBetweenFields />

		<!--- Carriage return. --->
		<cfelseif reFind( "\r", arguments.nextCharacter )>

			<!---
				The previous quote was actually the end of the
				previous field. End the current field.
			--->
			<cfset this.publish(
				"endField",
				arrayToList( variables.fieldBuffer, "" )
				) />

			<!--- Clear the field buffer. --->
			<cfset variables.fieldBuffer = [] />

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

			<!--- Move the carriage return. --->
			<cfreturn this.inCarriageReturn />

		<!--- New line. --->
		<cfelseif reFind( "\n", arguments.nextCharacter )>

			<!---
				The previous quote was actually the end of the
				previous field. End the current field.
			--->
			<cfset this.publish(
				"endField",
				arrayToList( variables.fieldBuffer, "" )
				) />

			<!--- Clear the field buffer. --->
			<cfset variables.fieldBuffer = [] />

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

			<!--- Move to the new line. --->
			<cfreturn this.inNewLine />

		<!--- End of Transmission. --->
		<cfelseif reFind( "\x04", arguments.nextCharacter )>

			<!---
				The previous quote was actually the end of the
				previous field. End the current field.
			--->
			<cfset this.publish(
				"endField",
				arrayToList( variables.fieldBuffer, "" )
				) />

			<!--- Clear the field buffer. --->
			<cfset variables.fieldBuffer = [] />

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

		<cfelse>

			<!---
				If we made it this far, this state has been put
				into an invalid state / transition.
			--->
			<cfthrow
				type="InvalidStateTransition"
				message="inEscapedValue[#arguments.nextCharacter#]"
				/>

		</cfif>
	</cffunction>


	<cffunction
		name="inNewLine"
		access="public"
		returntype="any"
		output="false"
		hint="">

		<!--- Define arguments. --->
		<cfargument
			name="nextCharacter"
			type="string"
			required="true"
			hint="I am the next character in the input stream."
			/>

		<!--- Define the local scope. --->
		<cfset var local = {} />

		<!--- Carriage return. --->
		<cfif reFind( "\r", arguments.nextCharacter )>

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

			<!--- Move the carriage return. --->
			<cfreturn this.inCarriageReturn />

		<!--- New line. --->
		<cfelseif reFind( "\n", arguments.nextCharacter )>

			<!--- Start and end an empty row. --->
			<cfset this.publish( "startRow" ) />
			<cfset this.publish( "endRow" ) />

			<!--- Move to the new line. --->
			<cfreturn this.inNewLine />

		<!--- Field character. --->
		<cfelseif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>

			<!--- Start the next row. --->
			<cfset this.publish( "startRow" ) />

			<!--- Start the next field. --->
			<cfset this.publish( "startField" ) />

			<!--- Add the current character to the field buffer. --->
			<cfset arrayAppend(
				variables.fieldBuffer,
				arguments.nextCharacter
				) />

			<!--- Move to the non-quoted value. --->
			<cfreturn this.inNonQuotedValue />

		<!--- Comma. --->
		<cfelseif (arguments.nextCharacter eq ",")>

			<!--- Start the new row. --->
			<cfset this.publish( "startRow" ) />

			<!--- Start and end the empty field. --->
			<cfset this.publish( "startField" ) />
			<cfset this.publish( "endField", "" ) />

			<!--- Move to in between fields. --->
			<cfreturn this.inBetweenFields />

		<!--- Double-quote. --->
		<cfelseif (arguments.nextCharacter eq """")>

			<!--- Start the new row. --->
			<cfset this.publish( "startRow" ) />

			<!--- Start the next field. --->
			<cfset this.publish( "startField" ) />

			<!--- Move to quoted value. --->
			<cfreturn this.inQuotedValue />

		<!--- End of Transmission. --->
		<cfelseif reFind( "\x04", arguments.nextCharacter )>

			<!--- Already ended the row, nothing left to do. --->

		<cfelse>

			<!---
				If we made it this far, this state has been put
				into an invalid state / transition.
			--->
			<cfthrow
				type="InvalidStateTransition"
				message="inNewLine[#arguments.nextCharacter#]"
				/>

		</cfif>
	</cffunction>


	<cffunction
		name="inNonQuotedValue"
		access="public"
		returntype="any"
		output="false"
		hint="">

		<!--- Define arguments. --->
		<cfargument
			name="nextCharacter"
			type="string"
			required="true"
			hint="I am the next character in the input stream."
			/>

		<!--- Define the local scope. --->
		<cfset var local = {} />

		<!--- Field character. --->
		<cfif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>

			<!--- Add the current character to the field buffer. --->
			<cfset arrayAppend(
				variables.fieldBuffer,
				arguments.nextCharacter
				) />

			<!--- Move to the non-quoted value. --->
			<cfreturn this.inNonQuotedValue />

		<!--- Comma. --->
		<cfelseif (arguments.nextCharacter eq ",")>

			<!--- End the current field. --->
			<cfset this.publish(
				"endField",
				arrayToList( variables.fieldBuffer, "" )
				) />

			<!--- Clear the field buffer. --->
			<cfset variables.fieldBuffer = [] />

			<!--- Move to in between fields. --->
			<cfreturn this.inBetweenFields />

		<!--- Carriage return. --->
		<cfelseif reFind( "\r", arguments.nextCharacter )>

			<!--- End the current field. --->
			<cfset this.publish(
				"endField",
				arrayToList( variables.fieldBuffer, "" )
				) />

			<!--- Clear the field buffer. --->
			<cfset variables.fieldBuffer = [] />

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

			<!--- Move the carriage return. --->
			<cfreturn this.inCarriageReturn />

		<!--- New line. --->
		<cfelseif reFind( "\n", arguments.nextCharacter )>

			<!--- End the current field. --->
			<cfset this.publish(
				"endField",
				arrayToList( variables.fieldBuffer, "" )
				) />

			<!--- Clear the field buffer. --->
			<cfset variables.fieldBuffer = [] />

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

			<!--- Move to the new line. --->
			<cfreturn this.inNewLine />

		<!--- End of Transmission. --->
		<cfelseif reFind( "\x04", arguments.nextCharacter )>

			<!--- End the current field. --->
			<cfset this.publish(
				"endField",
				arrayToList( variables.fieldBuffer, "" )
				) />

			<!--- Clear the field buffer. --->
			<cfset variables.fieldBuffer = [] />

			<!--- End the row. --->
			<cfset this.publish( "endRow" ) />

		<cfelse>

			<!---
				If we made it this far, this state has been put
				into an invalid state / transition.
			--->
			<cfthrow
				type="InvalidStateTransition"
				message="inNonQuotedValue[#arguments.nextCharacter#]"
				/>

		</cfif>
	</cffunction>


	<cffunction
		name="inPreData"
		access="public"
		returntype="any"
		output="false"
		hint="">

		<!--- Define arguments. --->
		<cfargument
			name="nextCharacter"
			type="string"
			required="true"
			hint="I am the next character in the input stream."
			/>

		<!--- Define the local scope. --->
		<cfset var local = {} />

		<!--- Comma. --->
		<cfif (arguments.nextCharacter eq ",")>

			<!--- Start the current row. --->
			<cfset this.publish( "startRow" ) />

			<!--- Start and end an empty field. --->
			<cfset this.publish( "startField" ) />
			<cfset this.publish( "endField", "" ) />

			<!--- Move to in between fields. --->
			<cfreturn this.inBetweenFields />

		<!--- Carriage return. --->
		<cfelseif reFind( "\r", arguments.nextCharacter )>

			<!--- Start and end the row. --->
			<cfset this.publish( "startRow" ) />
			<cfset this.publish( "endRow" ) />

			<!--- Move the carriage return. --->
			<cfreturn this.inCarriageReturn />

		<!--- New line. --->
		<cfelseif reFind( "\n", arguments.nextCharacter )>

			<!--- Start and end the row. --->
			<cfset this.publish( "startRow" ) />
			<cfset this.publish( "endRow" ) />

			<!--- Move to the new line. --->
			<cfreturn this.inNewLine />

		<!--- Double Quote. --->
		<cfelseif (arguments.nextCharacter eq """")>

			<!--- Start the first row. --->
			<cfset this.publish( "startRow" ) />

			<!--- Start the new field. --->
			<cfset this.publish( "startField" ) />

			<!--- Move to the quoted value. --->
			<cfreturn this.inQuotedValue />

		<!--- Field character. --->
		<cfelseif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>

			<!--- Start the first row. --->
			<cfset this.publish( "startRow" ) />

			<!--- Start the new field. --->
			<cfset this.publish( "startField" ) />

			<!---
				Add the current character to the field buffer (as we
				being to build up the field value).
			--->
			<cfset arrayAppend(
				variables.fieldBuffer,
				arguments.nextCharacter
				) />

			<!--- Move to the non-quoted value. --->
			<cfreturn this.inNonQuotedValue />

		<!--- End of Transmission. --->
		<cfelseif reFind( "\x04", arguments.nextCharacter )>

			<!--- This file had no data, nothing left to do. --->

		<cfelse>

			<!---
				If we made it this far, this state has been put
				into an invalid state / transition.
			--->
			<cfthrow
				type="InvalidStateTransition"
				message="inBetweenRows[#arguments.nextCharacter#]"
				/>

		</cfif>
	</cffunction>


	<cffunction
		name="inQuotedValue"
		access="public"
		returntype="any"
		output="false"
		hint="">

		<!--- Define arguments. --->
		<cfargument
			name="nextCharacter"
			type="string"
			required="true"
			hint="I am the next character in the input stream."
			/>

		<!--- Define the local scope. --->
		<cfset var local = {} />

		<!--- Non-double-quote. --->
		<cfif (arguments.nextCharacter neq """")>

			<!--- Add the current character to the field buffer. --->
			<cfset arrayAppend(
				variables.fieldBuffer,
				arguments.nextCharacter
				) />

			<!--- Move to the quoted value. --->
			<cfreturn this.inQuotedValue />

		<!--- Double quote. --->
		<cfelseif (arguments.nextCharacter eq """")>

			<!---
				Not sure if this quote is an escaped quote or is the
				end of this quoted field. Move to the escaped state
				for further testing.
			--->
			<cfreturn this.inEscapedValue />

		<cfelse>

			<!---
				If we made it this far, this state has been put
				into an invalid state / transition.
			--->
			<cfthrow
				type="InvalidStateTransition"
				message="inQuotedValue[#arguments.nextCharacter#]"
				/>

		</cfif>
	</cffunction>

</cfcomponent>

There's a lot of code here, most of which determines how to transition from one state to the next. The actual parsing, however, is very straightforward. The parser just keeps reading in a character at a time, passing it off to the current state method.

I think there is a lot of power in the Finite State Machine model of problem solving. The more complicated my JavaScript applications get, the more and more I an finding it necessary to think and plan in terms of states of being. I still feel very shaky when it comes to transitions; but, if I can be more diligent about state diagrams and event tables, I think I'll be able to get a lot out of this kind of approach.

Want to use code from this post? Check out the license.

Reader Comments

35 Comments

A few years ago, I took a class on requirements engineering, and one of the things they showed us was the SCR Toolset, an application that you can use to do requirements testing ... and as a side effect, you can use it to set up states for your application. (Note that this is self-contained, so from what I remember, it's not going to export something to Java that you can use to build the app itself.)

You can definitely use it to set up complex situations, but the problem is that it is not an easy tool to learn. (To give you an idea, the #1 hit in Google for "SCR Toolset" is a page on the US Navy's site.) They basically walked us through an example during the course of a few weeks in class, and I understood how it worked in that context, but I had a devil of a time using it for a different application.

55 Comments

"ColdFusion may run out of memory when parsing ginormous files" ? You mean the file IO part? You can loop a file by line in CF8+ to avoid running out of memory.

Or do you mean you're running out of ram constructing the data structure? Then FSM or not you're going to run out of memory.

15,674 Comments

@Dave,

That sounds pretty interesting. While that might be too complicated for me, it would be cool to have some sort of tool that just helps you map out states and transitions. Though, I suppose I could do that with a spreadsheet.

@Henry,

To be fair, I've never had an "out of memory" problem with CSV parsing. However, I've never dealt with files that were too huge (maybe a few thousand rows max).

As far as using CFLoop, you have to be careful. While you can loop over rows in a text file, the CSV format can handle embedded line breaks in a field value. As such, you can't simply break on line breaks as they may NOT be valid row delimiters based on context.

However, you can use CFLoop to loop over chunks of a file (which probably uses a File Input Stream under the hood). At that point, you could probably replace the Java buffer with the CF-based input buffer.

And, you make a good point regarding I/O memory vs. aggregate data memory. I suppose, if you are in a place where you are having memory problems, part of the beauty of this approach is that you never actually have a "complete" file in memory. So, you wouldn't do what I am doing in the demo (building up an array of arrays). Rather, you'd probably treat each row individually, acting on the fields / rows as they become available (ex. inserting into a database, making a web service call, etc.).

35 Comments

@Ben,

It was a pretty neat tool ... but I suspect there is a catch. In fact, looking at that website, here it is:

"NOTE: The SCR toolset is a research product, and should not be viewed as having industrial strength quality. Requests for the software will only be granted to those in U.S. Federal government agencies."

My instructor must have had permission to distribute it to us, but I suspect that didn't carry on to me. (I'm sure you wouldn't want us just posting links to random jar files anyway!)

1 Comments

As long as the SCR toolset isn't applied to mass scale, there are no catches or glitches, at least that i could find. We just have to remember that it is a research tool, nothing else.

5 Comments

Ben, good post. I'll have to see if it works on my ginormous files. The conditions I'm seeing relate to CF not giving up heap space during request process, so no matter how I read it, I still run out. Sure would be nice to have a Lex/Yacc for CF when it comes to parsing files. While creating FSM by hand is fun, the rules get complex and are worse when you have to add additional states.

15,674 Comments

@Thomas,

While I haven't had a good situation where I can test heap space stuff in a while (when I try to run out of space purposefully, it never seems to happen!), I have been told in passing that Sleeping the thread for 10ms should help with the garbage collection. Not sure if that can help anywhere in this particular scenario as the parsing is so encapsulated.

5 Comments

@Ben,

I've tried a number of things, and the best I could come up with was breaking the array of lines down and using a seperate thread for each chunk. I've tried even interacting with GC to encourage collection (I don't like doing this) but I'm not sure CF releases memory until after the request is done. However, after a thread completes it does (and perhaps after a sleep).

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel