Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at NCDevCon 2011 (Raleigh, NC) with:

String Tokenizer ColdFusion Component That Can Handle Qualified Fields

By Ben Nadel on
Tags: ColdFusion

I know this has been done out there already, but in order to fully understand the way something works, I like to build it for myself (or at least attempt to) from the ground up. In creating the next version of my POI Utility for reading and writing Excel files using ColdFusion, I want to add CSV parsing abilities. The problem with CSV is that the field values can become tricky when the fields are qualified and have embedded delimiters and qualifiers.

To help understand this part of the CSV parsing algorithm, I have created a ColdFusion component called StringTokenizer. This is modeled after the Java StringTokenizer and, in fact, has the same interface. The difference with mine is that the StringTokenizer can take a field Qualifier in the Init() method. Also, this StringTokenizer can be re-initialized without calling CreateObject() again. Therefore, it can be re-used with minimal overhead.

So, in case you have no idea what a String Tokenizer is, it's an object that will iterate over the tokens in a given string. Here is my ColdFusion component implementation of the String Tokenizer:

  • <cfcomponent
  • displayname="StringTokenzier"
  • output="false"
  • hint="Iterates over the tokens of a given string based on string delimiters and token qualifiers.">
  •  
  • <!--- Run the pseudo constructor. --->
  • <cfscript>
  •  
  • // Set up an instance structure to hold instance data.
  • VARIABLES.Instance = StructNew();
  •  
  • // This will hold the original string passed in by the user.
  • VARIABLES.Instance.OriginalString = "";
  •  
  • // Set the default delimiter and qualifiers.
  • VARIABLES.Instance.Delimiter = ",";
  • VARIABLES.Instance.Qualifier = "";
  •  
  • // This will hold the index of the previously returned token.
  • VARIABLES.Instance.TokenIndex = 0;
  •  
  • // This will hold the data for the raw tokens. These are related to
  • // the tokens returned, but not exactly the same thing.
  • VARIABLES.Instance.RawTokens = "";
  •  
  • // This will keep track of where we are in the raw tokens.
  • VARIABLES.Instance.RawTokenIndex = 0;
  •  
  • </cfscript>
  •  
  •  
  • <cffunction
  • name="Init"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="Returns an initialized String Tokenizer instance.">
  •  
  • <!--- Define arguments. --->
  • <cfargument
  • name="String"
  • type="string"
  • required="true"
  • hint="This is the string that will be broken up into tokens."
  • />
  •  
  • <cfargument
  • name="Delimiter"
  • type="string"
  • required="false"
  • default=","
  • hint="This is the delimiter that will separate the tokens."
  • />
  •  
  • <cfargument
  • name="Qualifier"
  • type="string"
  • required="false"
  • default=""""
  • hint="This is the qualifier that will wrap around fields that have special characters embeded."
  • />
  •  
  • <!---
  • When storing the delimiter, we only want to accept the first character
  • returned. This is different than standard ColdFusion, but I am trying
  • to make this as easy as possible.
  • --->
  • <cfset VARIABLES.Instance.Delimiter = Left( ARGUMENTS.Delimiter, 1 ) />
  •  
  • <!---
  • When storing the qualifier, we only want to accept the first character
  • returned. Is is possible that there is no qualifier being used. In that
  • case, we can just store the empty string.
  • --->
  • <cfif Len( ARGUMENTS.Qualifier )>
  •  
  • <cfset VARIABLES.Instance.Qualifier = Left( ARGUMENTS.Qualifier, 1 ) />
  •  
  • <cfelse>
  •  
  • <cfset VARIABLES.Instance.Qualifier = "" />
  •  
  • </cfif>
  •  
  • <!--- Store the original string. --->
  • <cfset VARIABLES.Instance.OriginalString = ARGUMENTS.String />
  •  
  • <!---
  • Break the original string up into raw tokens. Going forward, some of
  • these tokens may be merged, but doing it this way will help us
  • iterate over them. When splitting the string, add a space to each
  • token first to ensure that the split works properly.
  •  
  • BE CAREFUL! Splitting a string into an array using the Split
  • notation does not create a COLDFUSION ARRAY. You cannot alter this
  • array once it has been created. It can merely be referenced.
  • --->
  • <cfset VARIABLES.Instance.RawTokens = ToString(
  • " " &
  • ARGUMENTS.String
  • ).ReplaceAll(
  • "([\#VARIABLES.Instance.Delimiter#]{1})",
  • "$1 "
  • ).Split( "[\#VARIABLES.Instance.Delimiter#]{1}" )
  • />
  •  
  •  
  • <!--- Set the default indexes. --->
  • <cfset VARIABLES.Instance.TokenIndex = 0 />
  • <cfset VARIABLES.Instance.RawTokenIndex = 0 />
  •  
  •  
  • <!--- Return This reference. --->
  • <cfreturn THIS />
  •  
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="CountTokens"
  • access="public"
  • returntype="numeric"
  • output="false"
  • hint="Returns the number over which the tokenizer has iterated.">
  •  
  • <!---
  • Return the number of tokens that we have returned. This should be
  • equal to the token index (seeing as this value it incremented for
  • each call to NextElement()).
  • --->
  • <cfreturn VARIABLES.Instance.TokenIndex />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="HasMoreElements"
  • access="public"
  • returntype="boolean"
  • output="false"
  • hint="Checks to see if there are more elemnts to be returned.">
  •  
  • <!---
  • We know that we have more elements if the current raw token index
  • is still less than the number of raw tokens we have.
  • --->
  • <cfreturn (VARIABLES.Instance.RawTokenIndex LT ArrayLen( VARIABLES.Instance.RawTokens )) />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="HasMoreTokens"
  • access="public"
  • returntype="boolean"
  • output="false"
  • hint="Checks to see if there are more elemnts to be returned (this just wraps around HasMoreElements()).">
  •  
  • <cfreturn THIS.HasMoreElements() />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="NextElement"
  • access="public"
  • returntype="string"
  • output="false"
  • hint="Returns the next element.">
  •  
  • <!--- Define the local scope. --->
  • <cfset var LOCAL = StructNew() />
  •  
  • <!--- Set the default value for the returned token. --->
  • <cfset LOCAL.Value = "" />
  •  
  • <!---
  • Set the default flag for wether or not we are in the middle
  • of building a value across raw tokens.
  • --->
  • <cfset LOCAL.IsInValue = false />
  •  
  •  
  • <!---
  • Check to see if we have a field qualifier. If we do, then we might
  • have to build the value across multiple fields. If we do not, then
  • the raw tokens should line up perfectly with the real tokens.
  • --->
  • <cfif Len( VARIABLES.Instance.Qualifier )>
  •  
  •  
  • <!---
  • Since we are using a field qualifier, we might have to build a value
  • across several raw tokens. Remember, for this, all fields containing
  • embedded qualifiers and/or delimiters MUST be in qualified field values.
  • --->
  •  
  • <!--- Increment raw token index. --->
  • <cfset VARIABLES.Instance.RawTokenIndex = (VARIABLES.Instance.RawTokenIndex + 1) />
  •  
  • <!--- Set the value to the current raw token. --->
  • <cfset LOCAL.Value = VARIABLES.Instance.RawTokens[ VARIABLES.Instance.RawTokenIndex ] />
  •  
  • <!--- Remove the leading white space from the raw token. --->
  • <cfset LOCAL.Value = LOCAL.Value.ReplaceFirst( "^.{1}", "" ) />
  •  
  •  
  • <!--- Now, we have to check to see what kind of token we are dealing with. --->
  • <cfif (LOCAL.Value EQ (VARIABLES.Instance.Qualifier & VARIABLES.Instance.Qualifier))>
  •  
  • <!---
  • This field is just a fully qualified empty field. Set the
  • current value to be empty.
  • --->
  • <cfset LOCAL.Value = "" />
  •  
  •  
  • <!---
  • Check to see if we are dealing with a qualified field. If we are,
  • then we MIGHT have to build the value across tokens.
  • --->
  • <cfelseif (Left( LOCAL.Value, 1 ) EQ VARIABLES.Instance.Qualifier)>
  •  
  • <!--- Strip out the first qualifier. --->
  • <cfset LOCAL.Value = LOCAL.Value.ReplaceFirst( "^.{1}", "" ) />
  •  
  • <!---
  • Replace any escaped qualifiers (double-instance) with text
  • that cannot be confused.
  • --->
  • <cfset LOCAL.Value = LOCAL.Value.ReplaceAll(
  • "\#VARIABLES.Instance.Qualifier#{2}",
  • "[[QUALIFIER]]"
  • ) />
  •  
  • <!---
  • Now, check to see if this value ends with a quote. If it does,
  • then we know that we are dealing with a single qualified field.
  • If it does NOT, then that is when we have to build across tokens.
  • --->
  • <cfif (Right( LOCAL.Value, 1 ) EQ VARIABLES.Instance.Qualifier)>
  •  
  • <!---
  • We are dealing with a single field here. Just remove the
  • last character of the value.
  • --->
  • <cfset LOCAL.Value = LOCAL.Value.ReplaceFirst( ".{1}$", "" ) />
  •  
  • <cfelse>
  •  
  • <!---
  • We have just started a value that is incomplete. Now, we
  • must loop over the tokens to find the rest of the value.
  • --->
  • <cfloop
  • index="VARIABLES.Instance.RawTokenIndex"
  • from="#(VARIABLES.Instance.RawTokenIndex + 1)#"
  • to="#ArrayLen( VARIABLES.Instance.RawTokens )#"
  • step="1">
  •  
  • <!--- Grab the next token value. --->
  • <cfset LOCAL.TempValue = VARIABLES.Instance.RawTokens[ VARIABLES.Instance.RawTokenIndex ] />
  •  
  • <!--- Remove the leading white space from the raw token. --->
  • <cfset LOCAL.TempValue = LOCAL.TempValue.ReplaceFirst( "^.{1}", "" ) />
  •  
  • <!---
  • Replace any escaped qualifiers (double-instance) with text
  • that cannot be confused.
  • --->
  • <cfset LOCAL.TempValue = LOCAL.TempValue.ReplaceAll(
  • "\#VARIABLES.Instance.Qualifier#{2}",
  • "[[QUALIFIER]]"
  • ) />
  •  
  • <!---
  • Check to see if this token ends with a qualifier. If it does,
  • then we have reached the end of the true value.
  • --->
  • <cfif (Right( LOCAL.TempValue, 1 ) EQ VARIABLES.Instance.Qualifier)>
  •  
  • <!---
  • Add this temp value to the value we are building. Remember
  • to add the delimiter to the last value and to remove the
  • trailing qualifier.
  • --->
  • <cfset LOCAL.Value = (
  • LOCAL.Value &
  • VARIABLES.Instance.Delimiter &
  • LOCAL.TempValue.ReplaceFirst( ".{1}$", "" )
  • ) />
  •  
  • <!---
  • Since we have reached the end of the value we are building,
  • break out of this FOR loop.
  • --->
  • <cfbreak />
  •  
  • <cfelse>
  •  
  • <!---
  • Since we have NOT finished building this value, just add the
  • temp value to the value we are building.
  • --->
  • <cfset LOCAL.Value = (
  • LOCAL.Value &
  • VARIABLES.Instance.Delimiter &
  • LOCAL.TempValue
  • ) />
  •  
  • </cfif>
  •  
  • </cfloop>
  •  
  • </cfif>
  •  
  •  
  • <!--- Replace any escape qualifiers with actual qualifiers. --->
  • <cfset LOCAL.Value = LOCAL.Value.ReplaceAll(
  • "\[\[QUALIFIER\]\]",
  • VARIABLES.Instance.Qualifier
  • ) />
  •  
  • </cfif>
  •  
  •  
  • <!---
  • ASSERT: At this point, whether we built the value across raw tokens
  • or just grabbed a single token, we now have a complete value to return.
  • --->
  •  
  •  
  • <!--- Increment the token index. --->
  • <cfset VARIABLES.Instance.TokenIndex = (VARIABLES.Instance.TokenIndex + 1) />
  •  
  •  
  • <cfelse>
  •  
  •  
  • <!---
  • Since we don't have a qualifier, just return the next raw token
  • as we don't have to worry about building values.
  • --->
  •  
  • <!--- Increment raw token index. --->
  • <cfset VARIABLES.Instance.RawTokenIndex = (VARIABLES.Instance.RawTokenIndex + 1) />
  •  
  • <!---
  • Set the token index equal to the raw token index as they should
  • both be the same value when a delimiter is not used.
  • --->
  • <cfset VARIABLES.Instance.TokenIndex = VARIABLES.Instance.RawTokenIndex />
  •  
  • <!--- Set the value to the current raw token. --->
  • <cfset LOCAL.Value = VARIABLES.Instance.RawTokens[ VARIABLES.Instance.RawTokenIndex ] />
  •  
  • <!--- Remove the leading white space from the raw token. --->
  • <cfset LOCAL.Value = LOCAL.Value.ReplaceFirst( "^.{1}", "" ) />
  •  
  •  
  • </cfif>
  •  
  •  
  • <!--- Return the value. --->
  • <cfreturn LOCAL.Value />
  •  
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="NextToken"
  • access="public"
  • returntype="string"
  • output="false"
  • hint="Returns the next element (this just wraps around NextElement()).">
  •  
  • <cfreturn THIS.NextElement() />
  • </cffunction>
  •  
  • </cfcomponent>

Now, to use the String Tokenizer, all we need to do is pass it some sort of delimited value and then iterate over it:

  • <!---
  • First we have to set up the delimited value that we are
  • going to pass in. In this case, I am going to use the
  • comma as the delimiter and the quote as the field
  • qualifier. Notice that the third value has embedded
  • delimiters and qualifiers.
  • --->
  • <cfsavecontent variable="strCSV">
  • a,b,"cat,kitten,""mog"",puppy",d,e,"""",f
  • </cfsavecontent>
  •  
  •  
  • <!---
  • Now, let's create the ColdFusion String Tokenizer. We are
  • going to pass in the CSV value and the qualifier. We do not
  • have to pass in the field delimiter or the qualifier as they
  • default to comma and quote respectively.
  • --->
  • <cfset objTokenizer = CreateObject(
  • "component",
  • "StringTokenizer"
  • ).Init(
  • String = strCSV.Trim()
  • ) />
  •  
  •  
  • <!---
  • Now that we have the String Tokenizer, we can loop over it
  • until it has no more elements / tokens to returns. Here I am
  • demonstrating the "Elements" method call, but there is also
  • a short-hand HasMoreTokens() method call that does the same
  • thing. I only use Elements here because I feel it is more
  • common to the Iterator interface.
  • --->
  • <cfloop condition="objTokenizer.HasMoreElements()">
  •  
  • <!---
  • Get the next token and output it. We are using the
  • brackets to help clarify where certain values are blank.
  • --->
  • [#objTokenizer.NextElement()#]<br />
  •  
  • </cfloop>
  •  
  •  
  • <!---
  • Now, in order to demonstrate that this Tokenizer can be
  • re-used without calling CreateObject(), we are going to just
  • re-Init() it and pass in new values... well actually, the
  • same CSV, but this time, the field qualifier is being sent
  • in as the empty string.
  • --->
  • <cfset objTokenizer.Init(
  • String = strCSV.Trim(),
  • Qualifier = ""
  • ) />
  •  
  •  
  • <!--- Now, as we did before, loop over the tokens. --->
  • <cfloop condition="objTokenizer.HasMoreElements()">
  •  
  • [#objTokenizer.NextElement()#]<br />
  •  
  • </cfloop>

The above code gives us:

[a]
[b]
[cat,kitten,"mog",puppy]
[d]
[e]
["]
[f]

... on the first CFLoop. Then, on the second CFLoop, where we have no field qualifer, notice that the quotes in the original string are used as literal characters (and that the double quotes [""] are not unescaped):

[a]
[b]
["cat]
[kitten]
[""mog""]
[puppy"]
[d]
[e]
[""""]
[f]

Now, keep in mind that this only handles tokens in a string with a single delimiters. Line breaks can also be escaped in a CSV field value AND act as one of the delimiters in a CSV file... but I am not quite there yet.




Reader Comments

Ben,

If there's anything that I can count on with your blog, it's that you usually confuse the hell out of my just with your post titles! This is not a criticism, but rather a compliment. Your posts are often way over my head, but I know if I ever need to do some of the weird stuff you get into, I'll have an excellent resource here to dig through. :)

Reply to this Comment

Jacob,

Thanks for the compliment ;) This is something I normally wouldn't really work on (as ColdFusion has a ton of outstanding list functionality)... but I want to build a bigger and better Excel component and some more complicated list parsing is required. Hopefully that should come soon.

Thanks for checking in!

Reply to this Comment

Oops... accidentally posted my comment to "Ask Ben." Sorry about that!

Anyway, I just said great going and I'll probably be using this sometime soon to clean up my workarounds from the past.

Or something to that effect =)

Reply to this Comment

Sammny,

Always glad to help :) Let me know if you find any bugs in it or anyways that it can be improved.

Thanks!

Reply to this Comment

Wow, this is exactly what I was trying to figure out how to do in coldfusion. Great work on the code -- also, great comments. Thanks for the post.

Reply to this Comment

Post A Comment

?
You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.