Using A State Machine To Parse Tokenized Data In ColdFusion

By Ben Nadel on March 1, 2010

I have always been fascinated by the concept of parsing. I know nothing about it, which is probably why it seems to mystical to me. I've taken stabs at the parsing game before, using various approaches to try and color-code my blog code samples. But, nothing has ever really seemed completely satisfactory. After reading Ray Camden's blog post on translating roman numerals a few weeks ago, I was again tempted to play around with parsing data into meaningful interpretations.

Before I got to the parsing algorithm, I set up the following test page - I needed to know where I was going before I could figure out how to get there:

<!---
	Store the character data in a buffer. This is just some
	made-up data that looks like it has "tokenizable" data
	contained within it.
--->
<cfsavecontent variable="data">
print "hello there"
# This is a fake comment.
pause 10 # Another comment here.
exit
</cfsavecontent>


<!--- Create a new instance of our tokenizer. --->
<cfset tokenizer = createObject( "component", "Tokenizer" ).init() />

<!--- Get the tokens. --->
<cfset tokens = tokenizer.getTokens( trim( data ) ) />


<!--- Output the parsed tokens. --->
<cfdump
	var="#tokens#"
	label="Data Tokens"
	/>

The idea here is to take the raw data stored in the "Data" string buffer and parse it into an array of meaningful tokens. "Meaningful" is probably a very subjective term; but, for my learning purposes, I wanted to keep it simple - I wanted to parse the above data into "commands," "command arguments," and "comments."

Once I had my goals defined, I started to think about how I wanted to attack the problem. This turned into me just staring at HomeSite for a good 25 minutes with a blank expression on my face. I kept thinking about how I might want to approach the parsing using regular expressions. This is how I color-code my blog code snippets; but, I just couldn't think it through properly.

Then, I started to think about State Machines. Something about a state machine makes sense since we are going to be compiling multiple data tokens into a single, meaningful token. As such, you can think of parsing as collecting data points while in a given state. In a state machine, not only do we need to know about the current state, we need to know how to transition from state to state. In my previous experiment with state machines, each state machine event handler would return the next state. In a parsing scenario, I figured I could use this same approach; but, rather than returning the next state, I return the next tokenizer.

In the following approach, I have my main method, getTokens(), which loops over each raw data token in the data string. It then passes each token off to the currently selected tokenizer method. This tokenizer represents the state of the parser; it understands both how to utilize the given data token and how to transition to the next state (by returning the next tokenizer method reference).

If a token needs to be built-up across several "data token events", the tokenizer methods use a tokenBuffer array to store partial token data. Then, when a tokenizer method switches states (to a different tokenizer), it will first flush the existing token buffer to the token collection.

I am sure there is a lot I am missing here; but hey, this was my first attempt at this kind of thing.

Tokenizer.cfc

<cfcomponent
	output="false"
	hint="I parse character data into tokens.">


	<cffunction
		name="init"
		access="public"
		returntype="any"
		output="false"
		hint="I return an initialized component.">

		<!--- I am the token collection. --->
		<cfset this.tokens = [] />

		<!---
			I am the token buffer - each token fragment might
			be managed here to build up a single token.
		--->
		<cfset this.tokenBuffer = [] />

		<!--- Return this object reference. --->
		<cfreturn this />
	</cffunction>


	<cffunction
		name="getTokens"
		access="public"
		returntype="array"
		output="false"
		hint="I tokenize the given data and return the token collection.">

		<!--- Define arguments. --->
		<cfargument
			name="data"
			type="string"
			required="true"
			hint="I am the raw data."
			/>

		<!--- Setup the local socpe. --->
		<cfset var local = {} />

		<!--- Clear the current tokens. --->
		<cfset this.tokens = [] />

		<!--- Clear the current token buffer. --->
		<cfset this.tokenBuffer = [] />

		<!---
			Split the data on spaces (but not on line-breaks -
			they will be required during the parsing processing
			to know where tokens end).
		--->
		<cfset local.dataTokens = reMatch(
			"[^ \r\n]+|[\r\n]+",
			arguments.data
			) />

		<!--- Set the default tokenizer. --->
		<cfset local.tokenizer = this.getNextCommand />

		<!---
			Loop over all of the raw data tokens. Each token
			will be processed by one of the parsers.
		--->
		<cfloop
			index="local.dataToken"
			array="#local.dataTokens#">

			<!---
				Pass the data token to the tokenizer and get the
				next tokenizer. The tokenizer will use the next
				data item to update the token buffer and token
				collection and then return the next appropriate
				tokenizer to be used.
			--->
			<cfset local.tokenizer = local.tokenizer(
				local.dataToken
				) />

		</cfloop>

		<!---
			If there is anything in the token buffer, flush it
			to the tokens.
		--->
		<cfif arrayLen( this.tokenBuffer )>

			<!--- Flush buffer to tokens. --->
			<cfset arrayAppend(
				this.tokens,
				arrayToList( this.tokenBuffer, " " )
				) />

		</cfif>

		<!--- Return the prased tokens. --->
		<cfreturn this.tokens />
	</cffunction>


	<cffunction
		name="getNextCommand"
		access="public"
		returntype="any"
		output="false"
		hint="I get the next command item.">

		<!--- Define arguments. --->
		<cfargument name="dataToken" type="string" />

		<!--- Check to see if this is a standard command. --->
		<cfif listFind( "print,pause,exit", arguments.dataToken )>

			<!--- Add this command to the tokens. --->
			<cfset arrayAppend( this.tokens, arguments.dataToken ) />

			<!--- Return the next parser. --->
			<cfreturn this.getNextCommandArgument />

		<cfelseif reFind( "^##", arguments.dataToken )>

			<!---
				We are starting a comment. Add this to the
				token buffer as we will need to start building
				up a comment.
			--->
			<cfset arrayAppend( this.tokenBuffer, arguments.dataToken ) />

			<!--- Return the next parser. --->
			<cfreturn this.getNextComment />

		</cfif>
	</cffunction>


	<cffunction
		name="getNextCommandArgument"
		access="public"
		returntype="any"
		output="false"
		hint="I get the next command argument.">

		<!--- Define arguments. --->
		<cfargument name="dataToken" type="string" />

		<!--- Check to see if we hit a comment. --->
		<cfif reFind( "^##", arguments.dataToken )>

			<!---
				We have gotten to end of the command argument.
				Use the token buffer to build the next token.
			--->
			<cfset arrayAppend(
				this.tokens,
				arrayToList( this.tokenBuffer, " " )
				) />

			<!---
				Reset the token buffer with the opening of the
				next comment.
			--->
			<cfset this.tokenBuffer = [ arguments.dataToken ] />

			<!--- Return the next tokenizer. --->
			<cfreturn this.getNextComment />

		<cfelseif reFind( "^[\r\n]", arguments.dataToken )>

			<!---
				We have gotten to end of the command argument.
				Use the token buffer to build the next token.
			--->
			<cfset arrayAppend(
				this.tokens,
				arrayToList( this.tokenBuffer, " " )
				) />

			<!--- Reset the token buffer. --->
			<cfset this.tokenBuffer = [] />

			<!--- Return the next tokenizer. --->
			<cfreturn this.getNextCommand />

		<cfelse>

			<!---
				We are still building the command argument. Add
				the data point to the buffer and return the same
				tokenizer.
			--->
			<cfset arrayAppend( this.tokenBuffer, arguments.dataToken ) />

			<!--- Return this tokenizer. --->
			<cfreturn this.getNextCommandArgument />

		</cfif>
	</cffunction>


	<cffunction
		name="getNextComment"
		access="public"
		returntype="any"
		output="false"
		hint="I get the next comment item.">

		<!--- Define arguments. --->
		<cfargument name="dataToken" type="string" />

		<!--- Check to see if we hit the end of the comment. --->
		<cfif reFind( "^[\r\n]", arguments.dataToken )>

			<!---
				We have gotten to end of the comment. Use the
				token buffer to build the next token.
			--->
			<cfset arrayAppend(
				this.tokens,
				arrayToList( this.tokenBuffer, " " )
				) />

			<!--- Reset the token buffer. --->
			<cfset this.tokenBuffer = [] />

			<!--- Return the next tokenizer. --->
			<cfreturn this.getNextCommand />

		<cfelse>

			<!---
				We are still building the comment. Add the data
				point to the buffer and return the same tokenizer.
			--->
			<cfset arrayAppend( this.tokenBuffer, arguments.dataToken ) />

			<!--- Return this tokenizer. --->
			<cfreturn this.getNextComment />

		</cfif>

	</cffunction>

</cfcomponent>

As you can see, there are three tokenizer methods in this ColdFusion component: getNextCommand(), getNextCommandArgument(), and getNextComment(). Each of these is passed a single data token that it then must append to the token collection or the intermediate token buffer before returning a reference to the next tokenizer method. Running this ColdFusion component against my test page, we get the following CFDump output:

Data Tokens As Parsed By The ColdFusion Tokenizer State Machine.

As you can see, the tokenizer methods successfully parsed the raw data into meaningful tokens. Now, again, I say "meaningful" here in the loosest sense of the word; I have no idea if having the tokens in this format makes them any easier to work with. Mostly, I just wanted to take a stab at a parsing algorithm before I resorted to looking something up. I figured this would make the final solution much more memorable.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/go/1863

Reader Comments

Jon Dowdle Mar 1, 2010 at 12:15 PM

5 Comments

@Ben Have you looked at flex (as in free lex, not Adobe's)? The way flex works might give you some more ideas on how to parse.

http://en.wikipedia.org/wiki/Flex_lexical_analyser

Mark Mandel Mar 1, 2010 at 4:47 PM

26 Comments

Ben,

If you are really interested in parsing, I really suggest you check out ANTLR http://www.antlr.org/. Its an amazing parser generator, that allows you to do some really great stuff in short order.

The book Terrance (the author) also has available is really good reference to parsing in general, and also is a great reference to ANTLR as well.

Ben Nadel Mar 1, 2010 at 6:42 PM

15,688 Comments

@Jon, @Mark,

Both those things looks very interesting; but they are perhaps much more complicated that I need to get (or perhaps I am just not thinking about it). Of course, when I look at the "brushes" in syntax highlighters, they are also really complex - so it's probably a much bigger job than I realize :)

James White Mar 2, 2010 at 10:35 AM

7 Comments

This was a very helpful post. I am actually looking for a tokenizer type function to help parse long titles so that they fit on a calendar without expanding the size of the calendar, thanks for sharing this.

Ben Nadel Mar 2, 2010 at 11:10 AM

15,688 Comments

@James,

What are you trying to do? Select the first N characters of a title or something?

James White Mar 2, 2010 at 11:20 AM

7 Comments

@Ben,

I'm trying to take a long title (e.g. The University of Illinois Urbana-Champaign Nuclear Research Reactor) and break it down into pieces that will actually fit on a calendar (maybe display 2-3 words per line). I have to be able to display the entire name for printing purposes. Currently I am using Java's String split method to only display the first two to three words of the name and ellipses behind that to show there are more words, this ok for test purposes, but I need to display the whole name without stretching the calendar out of shape.

Ben Nadel Mar 2, 2010 at 11:23 AM

15,688 Comments

@James,

Gotcha; well, if you get stuck, let us know.

Michael Mar 3, 2010 at 9:50 AM

3 Comments

@James,
I was working on something like this a few months back and my solution was to enforce a line length upper limit (something that will fit perfectly in the allotted calendar space, for example) and then split the string on the last space character to occur before the upper limit.
Another option would be to add the feature to avoid the whole space capture and just use a hyphen for the line-split.

Happy hacking!

James White Mar 3, 2010 at 11:08 AM

7 Comments

@Micheal

Yes that's exactly what I am going for. If you have some code that you would not mind sharing, I'd be happy to use it. Currently my code stops at the maximum line character at the first break that comes before the max and displays ellipses.

James White Mar 5, 2010 at 1:06 PM

7 Comments

I actually did figure something out that seems to work well enough. I am using the Java SE 6 version of the StringTokenizer class along with the Java Array classes copyOfRange() method to do what I need it to do. It probably is not the best way, but it does work. If someone has a better way, I'd be happy to optimize my code.

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.