Skip to main content
Ben Nadel at CFUNITED 2010 (Landsdown, VA) with: Katie Maher
Ben Nadel at CFUNITED 2010 (Landsdown, VA) with: Katie Maher

String Tokenizer ColdFusion Component That Can Handle Qualified Fields

By
Published in Comments (6)

I know this has been done out there already, but in order to fully understand the way something works, I like to build it for myself (or at least attempt to) from the ground up. In creating the next version of my POI Utility for reading and writing Excel files using ColdFusion, I want to add CSV parsing abilities. The problem with CSV is that the field values can become tricky when the fields are qualified and have embedded delimiters and qualifiers.

To help understand this part of the CSV parsing algorithm, I have created a ColdFusion component called StringTokenizer. This is modeled after the Java StringTokenizer and, in fact, has the same interface. The difference with mine is that the StringTokenizer can take a field Qualifier in the Init() method. Also, this StringTokenizer can be re-initialized without calling CreateObject() again. Therefore, it can be re-used with minimal overhead.

So, in case you have no idea what a String Tokenizer is, it's an object that will iterate over the tokens in a given string. Here is my ColdFusion component implementation of the String Tokenizer:

<cfcomponent
	displayname="StringTokenzier"
	output="false"
	hint="Iterates over the tokens of a given string based on string delimiters and token qualifiers.">

	<!--- Run the pseudo constructor. --->
	<cfscript>

		// Set up an instance structure to hold instance data.
		VARIABLES.Instance = StructNew();

		// This will hold the original string passed in by the user.
		VARIABLES.Instance.OriginalString = "";

		// Set the default delimiter and qualifiers.
		VARIABLES.Instance.Delimiter = ",";
		VARIABLES.Instance.Qualifier = "";

		// This will hold the index of the previously returned token.
		VARIABLES.Instance.TokenIndex = 0;

		// This will hold the data for the raw tokens. These are related to
		// the tokens returned, but not exactly the same thing.
		VARIABLES.Instance.RawTokens = "";

		// This will keep track of where we are in the raw tokens.
		VARIABLES.Instance.RawTokenIndex = 0;

	</cfscript>


	<cffunction
		name="Init"
		access="public"
		returntype="any"
		output="false"
		hint="Returns an initialized String Tokenizer instance.">

		<!--- Define arguments. --->
		<cfargument
			name="String"
			type="string"
			required="true"
			hint="This is the string that will be broken up into tokens."
			/>

		<cfargument
			name="Delimiter"
			type="string"
			required="false"
			default=","
			hint="This is the delimiter that will separate the tokens."
			/>

		<cfargument
			name="Qualifier"
			type="string"
			required="false"
			default=""""
			hint="This is the qualifier that will wrap around fields that have special characters embeded."
			/>

		<!---
			When storing the delimiter, we only want to accept the first character
			returned. This is different than standard ColdFusion, but I am trying
			to make this as easy as possible.
		--->
		<cfset VARIABLES.Instance.Delimiter = Left( ARGUMENTS.Delimiter, 1 ) />

		<!---
			When storing the qualifier, we only want to accept the first character
			returned. Is is possible that there is no qualifier being used. In that
			case, we can just store the empty string.
		--->
		<cfif Len( ARGUMENTS.Qualifier )>

			<cfset VARIABLES.Instance.Qualifier = Left( ARGUMENTS.Qualifier, 1 ) />

		<cfelse>

			<cfset VARIABLES.Instance.Qualifier = "" />

		</cfif>

		<!--- Store the original string. --->
		<cfset VARIABLES.Instance.OriginalString = ARGUMENTS.String />

		<!---
			Break the original string up into raw tokens. Going forward, some of
			these tokens may be merged, but doing it this way will help us
			iterate over them. When splitting the string, add a space to each
			token first to ensure that the split works properly.

			BE CAREFUL! Splitting a string into an array using the Split
			notation does not create a COLDFUSION ARRAY. You cannot alter this
			array once it has been created. It can merely be referenced.
		--->
		<cfset VARIABLES.Instance.RawTokens = ToString(
			" " &
			ARGUMENTS.String
			).ReplaceAll(
				"([\#VARIABLES.Instance.Delimiter#]{1})",
				"$1 "
				).Split( "[\#VARIABLES.Instance.Delimiter#]{1}" )
			/>


		<!--- Set the default indexes. --->
		<cfset VARIABLES.Instance.TokenIndex = 0 />
		<cfset VARIABLES.Instance.RawTokenIndex = 0 />


		<!--- Return This reference. --->
		<cfreturn THIS />

	</cffunction>


	<cffunction
		name="CountTokens"
		access="public"
		returntype="numeric"
		output="false"
		hint="Returns the number over which the tokenizer has iterated.">

		<!---
			Return the number of tokens that we have returned. This should be
			equal to the token index (seeing as this value it incremented for
			each call to NextElement()).
		--->
		<cfreturn VARIABLES.Instance.TokenIndex />
	</cffunction>


	<cffunction
		name="HasMoreElements"
		access="public"
		returntype="boolean"
		output="false"
		hint="Checks to see if there are more elemnts to be returned.">

		<!---
			We know that we have more elements if the current raw token index
			is still less than the number of raw tokens we have.
		--->
		<cfreturn (VARIABLES.Instance.RawTokenIndex LT ArrayLen( VARIABLES.Instance.RawTokens )) />
	</cffunction>


	<cffunction
		name="HasMoreTokens"
		access="public"
		returntype="boolean"
		output="false"
		hint="Checks to see if there are more elemnts to be returned (this just wraps around HasMoreElements()).">

		<cfreturn THIS.HasMoreElements() />
	</cffunction>


	<cffunction
		name="NextElement"
		access="public"
		returntype="string"
		output="false"
		hint="Returns the next element.">

		<!--- Define the local scope. --->
		<cfset var LOCAL = StructNew() />

		<!--- Set the default value for the returned token. --->
		<cfset LOCAL.Value = "" />

		<!---
			Set the default flag for wether or not we are in the middle
			of building a value across raw tokens.
		--->
		<cfset LOCAL.IsInValue = false />


		<!---
			Check to see if we have a field qualifier. If we do, then we might
			have to build the value across multiple fields. If we do not, then
			the raw tokens should line up perfectly with the real tokens.
		--->
		<cfif Len( VARIABLES.Instance.Qualifier )>


			<!---
				Since we are using a field qualifier, we might have to build a value
				across several raw tokens. Remember, for this, all fields containing
				embedded qualifiers and/or delimiters MUST be in qualified field values.
			--->

			<!--- Increment raw token index. --->
			<cfset VARIABLES.Instance.RawTokenIndex = (VARIABLES.Instance.RawTokenIndex + 1) />

			<!--- Set the value to the current raw token. --->
			<cfset LOCAL.Value = VARIABLES.Instance.RawTokens[ VARIABLES.Instance.RawTokenIndex ] />

			<!--- Remove the leading white space from the raw token. --->
			<cfset LOCAL.Value = LOCAL.Value.ReplaceFirst( "^.{1}", "" ) />


			<!--- Now, we have to check to see what kind of token we are dealing with. --->
			<cfif (LOCAL.Value EQ (VARIABLES.Instance.Qualifier & VARIABLES.Instance.Qualifier))>

				<!---
					This field is just a fully qualified empty field. Set the
					current value to be empty.
				--->
				<cfset LOCAL.Value = "" />


			<!---
				Check to see if we are dealing with a qualified field. If we are,
				then we MIGHT have to build the value across tokens.
			--->
			<cfelseif (Left( LOCAL.Value, 1 ) EQ VARIABLES.Instance.Qualifier)>

				<!--- Strip out the first qualifier. --->
				<cfset LOCAL.Value = LOCAL.Value.ReplaceFirst( "^.{1}", "" ) />

				<!---
					Replace any escaped qualifiers (double-instance) with text
					that cannot be confused.
				--->
				<cfset LOCAL.Value = LOCAL.Value.ReplaceAll(
					"\#VARIABLES.Instance.Qualifier#{2}",
					"[[QUALIFIER]]"
					) />

				<!---
					Now, check to see if this value ends with a quote. If it does,
					then we know that we are dealing with a single qualified field.
					If it does NOT, then that is when we have to build across tokens.
				--->
				<cfif (Right( LOCAL.Value, 1 ) EQ VARIABLES.Instance.Qualifier)>

					<!---
						We are dealing with a single field here. Just remove the
						last character of the value.
					--->
					<cfset LOCAL.Value = LOCAL.Value.ReplaceFirst( ".{1}$", "" ) />

				<cfelse>

					<!---
						We have just started a value that is incomplete. Now, we
						must loop over the tokens to find the rest of the value.
					--->
					<cfloop
						index="VARIABLES.Instance.RawTokenIndex"
						from="#(VARIABLES.Instance.RawTokenIndex + 1)#"
						to="#ArrayLen( VARIABLES.Instance.RawTokens )#"
						step="1">

						<!--- Grab the next token value. --->
						<cfset LOCAL.TempValue = VARIABLES.Instance.RawTokens[ VARIABLES.Instance.RawTokenIndex ] />

						<!--- Remove the leading white space from the raw token. --->
						<cfset LOCAL.TempValue = LOCAL.TempValue.ReplaceFirst( "^.{1}", "" ) />

						<!---
							Replace any escaped qualifiers (double-instance) with text
							that cannot be confused.
						--->
						<cfset LOCAL.TempValue = LOCAL.TempValue.ReplaceAll(
							"\#VARIABLES.Instance.Qualifier#{2}",
							"[[QUALIFIER]]"
							) />

						<!---
							Check to see if this token ends with a qualifier. If it does,
							then we have reached the end of the true value.
						--->
						<cfif (Right( LOCAL.TempValue, 1 ) EQ VARIABLES.Instance.Qualifier)>

							<!---
								Add this temp value to the value we are building. Remember
								to add the delimiter to the last value and to remove the
								trailing qualifier.
							--->
							<cfset LOCAL.Value = (
								LOCAL.Value &
								VARIABLES.Instance.Delimiter &
								LOCAL.TempValue.ReplaceFirst( ".{1}$", "" )
								) />

							<!---
								Since we have reached the end of the value we are building,
								break out of this FOR loop.
							--->
							<cfbreak />

						<cfelse>

							<!---
								Since we have NOT finished building this value, just add the
								temp value to the value we are building.
							--->
							<cfset LOCAL.Value = (
								LOCAL.Value &
								VARIABLES.Instance.Delimiter &
								LOCAL.TempValue
								) />

						</cfif>

					</cfloop>

				</cfif>


				<!--- Replace any escape qualifiers with actual qualifiers. --->
				<cfset LOCAL.Value = LOCAL.Value.ReplaceAll(
					"\[\[QUALIFIER\]\]",
					VARIABLES.Instance.Qualifier
					) />

			</cfif>


			<!---
				ASSERT: At this point, whether we built the value across raw tokens
				or just grabbed a single token, we now have a complete value to return.
			--->


			<!--- Increment the token index. --->
			<cfset VARIABLES.Instance.TokenIndex = (VARIABLES.Instance.TokenIndex + 1) />


		<cfelse>


			<!---
				Since we don't have a qualifier, just return the next raw token
				as we don't have to worry about building values.
			--->

			<!--- Increment raw token index. --->
			<cfset VARIABLES.Instance.RawTokenIndex = (VARIABLES.Instance.RawTokenIndex + 1) />

			<!---
				Set the token index equal to the raw token index as they should
				both be the same value when a delimiter is not used.
			--->
			<cfset VARIABLES.Instance.TokenIndex = VARIABLES.Instance.RawTokenIndex />

			<!--- Set the value to the current raw token. --->
			<cfset LOCAL.Value = VARIABLES.Instance.RawTokens[ VARIABLES.Instance.RawTokenIndex ] />

			<!--- Remove the leading white space from the raw token. --->
			<cfset LOCAL.Value = LOCAL.Value.ReplaceFirst( "^.{1}", "" ) />


		</cfif>


		<!--- Return the value. --->
		<cfreturn LOCAL.Value />

	</cffunction>


	<cffunction
		name="NextToken"
		access="public"
		returntype="string"
		output="false"
		hint="Returns the next element (this just wraps around NextElement()).">

		<cfreturn THIS.NextElement() />
	</cffunction>

</cfcomponent>

Now, to use the String Tokenizer, all we need to do is pass it some sort of delimited value and then iterate over it:

<!---
	First we have to set up the delimited value that we are
	going to pass in. In this case, I am going to use the
	comma as the delimiter and the quote as the field
	qualifier. Notice that the third value has embedded
	delimiters and qualifiers.
--->
<cfsavecontent variable="strCSV">
a,b,"cat,kitten,""mog"",puppy",d,e,"""",f
</cfsavecontent>


<!---
	Now, let's create the ColdFusion String Tokenizer. We are
	going to pass in the CSV value and the qualifier. We do not
	have to pass in the field delimiter or the qualifier as they
	default to comma and quote respectively.
--->
<cfset objTokenizer = CreateObject(
	"component",
	"StringTokenizer"
	).Init(
		String = strCSV.Trim()
	) />


<!---
	Now that we have the String Tokenizer, we can loop over it
	until it has no more elements / tokens to returns. Here I am
	demonstrating the "Elements" method call, but there is also
	a short-hand HasMoreTokens() method call that does the same
	thing. I only use Elements here because I feel it is more
	common to the Iterator interface.
--->
<cfloop condition="objTokenizer.HasMoreElements()">

	<!---
		Get the next token and output it. We are using the
		brackets to help clarify where certain values are blank.
	--->
	[#objTokenizer.NextElement()#]<br />

</cfloop>


<!---
	Now, in order to demonstrate that this Tokenizer can be
	re-used without calling CreateObject(), we are going to just
	re-Init() it and pass in new values... well actually, the
	same CSV, but this time, the field qualifier is being sent
	in as the empty string.
--->
<cfset objTokenizer.Init(
	String = strCSV.Trim(),
	Qualifier = ""
	) />


<!--- Now, as we did before, loop over the tokens. --->
<cfloop condition="objTokenizer.HasMoreElements()">

	[#objTokenizer.NextElement()#]<br />

</cfloop>

The above code gives us:

[a]
[b]
[cat,kitten,"mog",puppy]
[d]
[e]
["]
[f]

... on the first CFLoop. Then, on the second CFLoop, where we have no field qualifer, notice that the quotes in the original string are used as literal characters (and that the double quotes [""] are not unescaped):

[a]
[b]
["cat]
[kitten]
[""mog""]
[puppy"]
[d]
[e]
[""""]
[f]

Now, keep in mind that this only handles tokens in a string with a single delimiters. Line breaks can also be escaped in a CSV field value AND act as one of the delimiters in a CSV file... but I am not quite there yet.

Want to use code from this post? Check out the license.

Reader Comments

44 Comments

Ben,

If there's anything that I can count on with your blog, it's that you usually confuse the hell out of my just with your post titles! This is not a criticism, but rather a compliment. Your posts are often way over my head, but I know if I ever need to do some of the weird stuff you get into, I'll have an excellent resource here to dig through. :)

15,880 Comments

Jacob,

Thanks for the compliment ;) This is something I normally wouldn't really work on (as ColdFusion has a ton of outstanding list functionality)... but I want to build a bigger and better Excel component and some more complicated list parsing is required. Hopefully that should come soon.

Thanks for checking in!

79 Comments

Oops... accidentally posted my comment to "Ask Ben." Sorry about that!

Anyway, I just said great going and I'll probably be using this sometime soon to clean up my workarounds from the past.

Or something to that effect =)

1 Comments

Wow, this is exactly what I was trying to figure out how to do in coldfusion. Great work on the code -- also, great comments. Thanks for the post.

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel