Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at RIA Unleashed (Nov. 2010) with:

Using A State Machine To Parse Tokenized Data In ColdFusion

Posted by Ben Nadel
Tags: ColdFusion

I have always been fascinated by the concept of parsing. I know nothing about it, which is probably why it seems to mystical to me. I've taken stabs at the parsing game before, using various approaches to try and color-code my blog code samples. But, nothing has ever really seemed completely satisfactory. After reading Ray Camden's blog post on translating roman numerals a few weeks ago, I was again tempted to play around with parsing data into meaningful interpretations.

Before I got to the parsing algorithm, I set up the following test page - I needed to know where I was going before I could figure out how to get there:

  • <!---
  • Store the character data in a buffer. This is just some
  • made-up data that looks like it has "tokenizable" data
  • contained within it.
  • --->
  • <cfsavecontent variable="data">
  • print "hello there"
  • # This is a fake comment.
  • pause 10 # Another comment here.
  • exit
  • </cfsavecontent>
  •  
  •  
  • <!--- Create a new instance of our tokenizer. --->
  • <cfset tokenizer = createObject( "component", "Tokenizer" ).init() />
  •  
  • <!--- Get the tokens. --->
  • <cfset tokens = tokenizer.getTokens( trim( data ) ) />
  •  
  •  
  • <!--- Output the parsed tokens. --->
  • <cfdump
  • var="#tokens#"
  • label="Data Tokens"
  • />

The idea here is to take the raw data stored in the "Data" string buffer and parse it into an array of meaningful tokens. "Meaningful" is probably a very subjective term; but, for my learning purposes, I wanted to keep it simple - I wanted to parse the above data into "commands," "command arguments," and "comments."

Once I had my goals defined, I started to think about how I wanted to attack the problem. This turned into me just staring at HomeSite for a good 25 minutes with a blank expression on my face. I kept thinking about how I might want to approach the parsing using regular expressions. This is how I color-code my blog code snippets; but, I just couldn't think it through properly.

Then, I started to think about State Machines. Something about a state machine makes sense since we are going to be compiling multiple data tokens into a single, meaningful token. As such, you can think of parsing as collecting data points while in a given state. In a state machine, not only do we need to know about the current state, we need to know how to transition from state to state. In my previous experiment with state machines, each state machine event handler would return the next state. In a parsing scenario, I figured I could use this same approach; but, rather than returning the next state, I return the next tokenizer.

In the following approach, I have my main method, getTokens(), which loops over each raw data token in the data string. It then passes each token off to the currently selected tokenizer method. This tokenizer represents the state of the parser; it understands both how to utilize the given data token and how to transition to the next state (by returning the next tokenizer method reference).

If a token needs to be built-up across several "data token events", the tokenizer methods use a tokenBuffer array to store partial token data. Then, when a tokenizer method switches states (to a different tokenizer), it will first flush the existing token buffer to the token collection.

I am sure there is a lot I am missing here; but hey, this was my first attempt at this kind of thing.

Tokenizer.cfc

  • <cfcomponent
  • output="false"
  • hint="I parse character data into tokens.">
  •  
  •  
  • <cffunction
  • name="init"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="I return an initialized component.">
  •  
  • <!--- I am the token collection. --->
  • <cfset this.tokens = [] />
  •  
  • <!---
  • I am the token buffer - each token fragment might
  • be managed here to build up a single token.
  • --->
  • <cfset this.tokenBuffer = [] />
  •  
  • <!--- Return this object reference. --->
  • <cfreturn this />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="getTokens"
  • access="public"
  • returntype="array"
  • output="false"
  • hint="I tokenize the given data and return the token collection.">
  •  
  • <!--- Define arguments. --->
  • <cfargument
  • name="data"
  • type="string"
  • required="true"
  • hint="I am the raw data."
  • />
  •  
  • <!--- Setup the local socpe. --->
  • <cfset var local = {} />
  •  
  • <!--- Clear the current tokens. --->
  • <cfset this.tokens = [] />
  •  
  • <!--- Clear the current token buffer. --->
  • <cfset this.tokenBuffer = [] />
  •  
  • <!---
  • Split the data on spaces (but not on line-breaks -
  • they will be required during the parsing processing
  • to know where tokens end).
  • --->
  • <cfset local.dataTokens = reMatch(
  • "[^ \r\n]+|[\r\n]+",
  • arguments.data
  • ) />
  •  
  • <!--- Set the default tokenizer. --->
  • <cfset local.tokenizer = this.getNextCommand />
  •  
  • <!---
  • Loop over all of the raw data tokens. Each token
  • will be processed by one of the parsers.
  • --->
  • <cfloop
  • index="local.dataToken"
  • array="#local.dataTokens#">
  •  
  • <!---
  • Pass the data token to the tokenizer and get the
  • next tokenizer. The tokenizer will use the next
  • data item to update the token buffer and token
  • collection and then return the next appropriate
  • tokenizer to be used.
  • --->
  • <cfset local.tokenizer = local.tokenizer(
  • local.dataToken
  • ) />
  •  
  • </cfloop>
  •  
  • <!---
  • If there is anything in the token buffer, flush it
  • to the tokens.
  • --->
  • <cfif arrayLen( this.tokenBuffer )>
  •  
  • <!--- Flush buffer to tokens. --->
  • <cfset arrayAppend(
  • this.tokens,
  • arrayToList( this.tokenBuffer, " " )
  • ) />
  •  
  • </cfif>
  •  
  • <!--- Return the prased tokens. --->
  • <cfreturn this.tokens />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="getNextCommand"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="I get the next command item.">
  •  
  • <!--- Define arguments. --->
  • <cfargument name="dataToken" type="string" />
  •  
  • <!--- Check to see if this is a standard command. --->
  • <cfif listFind( "print,pause,exit", arguments.dataToken )>
  •  
  • <!--- Add this command to the tokens. --->
  • <cfset arrayAppend( this.tokens, arguments.dataToken ) />
  •  
  • <!--- Return the next parser. --->
  • <cfreturn this.getNextCommandArgument />
  •  
  • <cfelseif reFind( "^##", arguments.dataToken )>
  •  
  • <!---
  • We are starting a comment. Add this to the
  • token buffer as we will need to start building
  • up a comment.
  • --->
  • <cfset arrayAppend( this.tokenBuffer, arguments.dataToken ) />
  •  
  • <!--- Return the next parser. --->
  • <cfreturn this.getNextComment />
  •  
  • </cfif>
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="getNextCommandArgument"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="I get the next command argument.">
  •  
  • <!--- Define arguments. --->
  • <cfargument name="dataToken" type="string" />
  •  
  • <!--- Check to see if we hit a comment. --->
  • <cfif reFind( "^##", arguments.dataToken )>
  •  
  • <!---
  • We have gotten to end of the command argument.
  • Use the token buffer to build the next token.
  • --->
  • <cfset arrayAppend(
  • this.tokens,
  • arrayToList( this.tokenBuffer, " " )
  • ) />
  •  
  • <!---
  • Reset the token buffer with the opening of the
  • next comment.
  • --->
  • <cfset this.tokenBuffer = [ arguments.dataToken ] />
  •  
  • <!--- Return the next tokenizer. --->
  • <cfreturn this.getNextComment />
  •  
  • <cfelseif reFind( "^[\r\n]", arguments.dataToken )>
  •  
  • <!---
  • We have gotten to end of the command argument.
  • Use the token buffer to build the next token.
  • --->
  • <cfset arrayAppend(
  • this.tokens,
  • arrayToList( this.tokenBuffer, " " )
  • ) />
  •  
  • <!--- Reset the token buffer. --->
  • <cfset this.tokenBuffer = [] />
  •  
  • <!--- Return the next tokenizer. --->
  • <cfreturn this.getNextCommand />
  •  
  • <cfelse>
  •  
  • <!---
  • We are still building the command argument. Add
  • the data point to the buffer and return the same
  • tokenizer.
  • --->
  • <cfset arrayAppend( this.tokenBuffer, arguments.dataToken ) />
  •  
  • <!--- Return this tokenizer. --->
  • <cfreturn this.getNextCommandArgument />
  •  
  • </cfif>
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="getNextComment"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="I get the next comment item.">
  •  
  • <!--- Define arguments. --->
  • <cfargument name="dataToken" type="string" />
  •  
  • <!--- Check to see if we hit the end of the comment. --->
  • <cfif reFind( "^[\r\n]", arguments.dataToken )>
  •  
  • <!---
  • We have gotten to end of the comment. Use the
  • token buffer to build the next token.
  • --->
  • <cfset arrayAppend(
  • this.tokens,
  • arrayToList( this.tokenBuffer, " " )
  • ) />
  •  
  • <!--- Reset the token buffer. --->
  • <cfset this.tokenBuffer = [] />
  •  
  • <!--- Return the next tokenizer. --->
  • <cfreturn this.getNextCommand />
  •  
  • <cfelse>
  •  
  • <!---
  • We are still building the comment. Add the data
  • point to the buffer and return the same tokenizer.
  • --->
  • <cfset arrayAppend( this.tokenBuffer, arguments.dataToken ) />
  •  
  • <!--- Return this tokenizer. --->
  • <cfreturn this.getNextComment />
  •  
  • </cfif>
  •  
  • </cffunction>
  •  
  • </cfcomponent>

As you can see, there are three tokenizer methods in this ColdFusion component: getNextCommand(), getNextCommandArgument(), and getNextComment(). Each of these is passed a single data token that it then must append to the token collection or the intermediate token buffer before returning a reference to the next tokenizer method. Running this ColdFusion component against my test page, we get the following CFDump output:

 
 
 
 
 
 
Data Tokens As Parsed By The ColdFusion Tokenizer State Machine. 
 
 
 

As you can see, the tokenizer methods successfully parsed the raw data into meaningful tokens. Now, again, I say "meaningful" here in the loosest sense of the word; I have no idea if having the tokens in this format makes them any easier to work with. Mostly, I just wanted to take a stab at a parsing algorithm before I resorted to looking something up. I figured this would make the final solution much more memorable.




Reader Comments

Ben,

If you are really interested in parsing, I really suggest you check out ANTLR http://www.antlr.org/. Its an amazing parser generator, that allows you to do some really great stuff in short order.

The book Terrance (the author) also has available is really good reference to parsing in general, and also is a great reference to ANTLR as well.

Reply to this Comment

@Jon, @Mark,

Both those things looks very interesting; but they are perhaps much more complicated that I need to get (or perhaps I am just not thinking about it). Of course, when I look at the "brushes" in syntax highlighters, they are also really complex - so it's probably a much bigger job than I realize :)

Reply to this Comment

This was a very helpful post. I am actually looking for a tokenizer type function to help parse long titles so that they fit on a calendar without expanding the size of the calendar, thanks for sharing this.

Reply to this Comment

@Ben,

I'm trying to take a long title (e.g. The University of Illinois Urbana-Champaign Nuclear Research Reactor) and break it down into pieces that will actually fit on a calendar (maybe display 2-3 words per line). I have to be able to display the entire name for printing purposes. Currently I am using Java's String split method to only display the first two to three words of the name and ellipses behind that to show there are more words, this ok for test purposes, but I need to display the whole name without stretching the calendar out of shape.

Reply to this Comment

@James,
I was working on something like this a few months back and my solution was to enforce a line length upper limit (something that will fit perfectly in the allotted calendar space, for example) and then split the string on the last space character to occur before the upper limit.
Another option would be to add the feature to avoid the whole space capture and just use a hyphen for the line-split.

Happy hacking!

Reply to this Comment

@Micheal

Yes that's exactly what I am going for. If you have some code that you would not mind sharing, I'd be happy to use it. Currently my code stops at the maximum line character at the first break that comes before the max and displays ellipses.

Reply to this Comment

I actually did figure something out that seems to work well enough. I am using the Java SE 6 version of the StringTokenizer class along with the Java Array classes copyOfRange() method to do what I need it to do. It probably is not the best way, but it does work. If someone has a better way, I'd be happy to optimize my code.

Reply to this Comment

Post A Comment

?
You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.