Using ColdFusion Structures To Remove Duplicate List Values
Posted December 13, 2006 at 7:37 AM
The other week, I posted a solution for Ray Camden's Friday Puzzler for Santa's list. It involved using the ColdFusion structure (struct) as a utility to clean lists. I thought it worked quite nicely and thought that I would talk about it here quickly. The beauty of the ColdFusion struct is that it is an object that maps keys to values. This mapping feature forces the struct to only have unique keys as there would be no way to map two duplicate keys to two different values. This mapping can be leveraged to clean the list of duplicate values.
To demonstrate this, let's build a list that has duplicate values:
Launch code in new window » Download code as text file »
- <!--- Create a group values that have duplicate values. --->
- <cfsavecontent variable="strListData">
- Ghost Busters
- What About Bob
- Broken Flowers
- What About Bob
- Lost In Translation
- Stripes
- Scrooge
- What About Bob
- Lost In Translation
- </cfsavecontent>
-
-
- <!--- Create a list based on the list data. --->
- <cfset lstMovies = strListData.Trim().ReplaceAll(
- "[\r\n\t]+",
- ","
- ) />
This is a bit of a round-about way to create a list, but I am doing this for readability (so you can really see the list). If we output the list above:
Launch code in new window » Download code as text file »
- <!--- Break list on to different lines. --->
- #lstMovies.ReplaceAll( ",", "<br />" )#
... we get:
Ghost Busters
What About Bob
Broken Flowers
What About Bob
Lost In Translation
Stripes
Scrooge
What About Bob
Lost In Translation
Now, we can loop over the list and add them to the ColdFusion struct. Remember, due to the mapping nature of the struct, duplicate keys will override each other, thereby removing duplicates:
Launch code in new window » Download code as text file »
- <!--- Create a structure to hold the movie titles. --->
- <cfset objMovies = StructNew() />
-
- <!--- Loop over the list to add mappings to the struct. --->
- <cfloop index="strMovie" list="#lstMovies#" delimiters=",">
-
- <!---
- Store key/value pair. In this case, we don't
- really care about the value (hence storing ""
- as value). We only care about the key.
- --->
- <cfset objMovies[ strMovie ] = "" />
-
- </cfloop>
-
- <!--- Convert the now unique key values into a list. --->
- <cfset lstMovies = StructKeyList( objMovies ) />
Now, outputting that list:
Launch code in new window » Download code as text file »
- <!--- Break list on to different lines. --->
- #lstMovies.ReplaceAll( ",", "<br />" )#
... we get:
Lost In Translation
What About Bob
Stripes
Scrooge
Ghost Busters
Broken Flowers
Notice that this time, the duplicates have been removed. Notice also that the order of the list has been changed. That usually won't matter as I don't think you should require proper ordering in a list that is required to remove duplicate values.
I am not sure how this compares (performance-wise) to the loop/check strategy that constantly checks for duplicates. I almost want to say that this would be faster as there really is no logic involved. There's no CFIF statements, no conditionals; all we do is set values and then join them together. It's that easy.
One final note: this is not case sensitive. The structure, like most all ColdFusion, treats mixed cases as the same value. If you want to do this type of algorithm with case sensitivity, you can create a Java 2 Hashtable and use it in similar way. I won't go into that though, as that is not something any of us really need to do (or otherwise, we probably shouldn't be using lists perhaps?).
Download Code Snippet ZIP File
Post Comment | Ask Ben | Permalink | Other Searches | Print Page
Newer Post
Clark Valberg On Interface Driven Architecture Methodology
Older Post
Import Lesson: Always Codify Change Requests From Clients
Reader Comments
Very Clean!
There are other ways to remove duplicates in a list, but none as elegant as this.
Good Show!
Dan Wilson
Dan,
Glad you like. Always happy to share.
Thanks! I've been looking for a way to speed up the creation of a unique 5 character id list I create for emails tracking in one of our products. The best way I could find to keep the list unique was just using string contains on a list but your struct alternative is MUCH faster. I'm still checking my code but here are some figures:
2,000 IDs
- List/String = 1797ms
- Struct = 171ms
10,000 IDs
- List/String = 41750ms
- Struct = 937ms
Daniel,
That's awesome. I had a gut feeling tell me it would be faster, but it's nice to see some numbers on it.
100,000 IDs
- List/String (not going to do it now, tried in the past and its like 20 minutes)
- Struct = 50266ms (WOW!)
I have it flushing counts every 100 ID attempts and it pauses around 36800 and 74900 every time so there is some type of conversion going on around those sizes. I noticed similar pauses occurring when using strings.
The algorithm is close to linear instead of growing in time based on the size of the list. Don't remember all that Big O notation stuff to explain that further.
I'm no regexp ninja but would you not be better just using regular expressions?
nothing at hand at the moment to show for code, but thought i'd chuck that in the ring
DC,
Regular expressions are totally awesome, but they require a lot of parsing and processing. And, the larger the string, the longer it takes in what seems a non-linear fashion... that's not from official testing, that's just what I have noticed.
That, and I wouldn't even know how to start writing that expression.
Hey Ben,
VERY cool! Very cool indeed!
Pete,
Glad you like. One thing to note as well is that you can do StructKeyArray() as well if you need to clean an array rather than a list. This will just return the struct-keys into a one-dimensional array.
Now how about an easy way to tell *which* elements of the list are duplicated?
The regular expression idea interests me. I'm curious about the efficiency of checking for duplicates across a list in a single regular expression search. Not as efficient as structs for deduping but if you have to find dupes it may be faster than looping for every item in list.
First you would need to sort the list, then run a regular expression such as the following "(,|$)([^,]+),\2(,|^)".
Just remembered though that REFind only returns the first match so that would require some looping to find the next match. hmmm
@Dan,
Regular expression are totally awesome... but you want to avoid having to parse a string too many times. I would guess that any kind of looping in conjunction with regex is gonna be slower than using arrays and structs.
I'm not actually advocating a loop through each piece of data, rather an REFind on an entire list string using the delimiter in regular expression to detect dupes. The benefit with a regular expression in this instance over an array loop is that you are only looping as many times as there are matches and you are able to start where you left off.
For example with the regular expression, if there are no duplicates you would do a single REFind which would find no duplicates and you would be done. If it found one you would start another REFind from where you found the first dupe to look for the next.
A loop over each item in the list or array on the other hand would require an action on ever item in the list to compare to every other item in the list. That quickly looses efficiency because processing increases exponentially, which was the need for the struct work around of course.
Anyways, something to think about and I might play with when I have time.
@Dan,
I this is interesting. I would certainly love to see what you come up with if you have time play around with the idea. I am always looking to learn new techniques.
Though the RE method may help us find duplicates be it tell us they exist or tell us which are the duplicates the original purpose of this post was to REMOVE duplicates not to find them.
For it's purpose this works well enough in comparison to the alternatives out there which serve ONLY this purpose and no extra.
As a thought should you find your data in a Query object (say you pulled a list from you database and you only want the unique entries for a particular column where the db does not have this constraint imposed) a QofQ may work and scale far better than this method. Where the method here is linear (O(n)) for those that know the notation it is possible the QoQ method may be < O(n) for larger datasets.
Absolutley love this, have used the technique a few times!
How about a way to count the occurrences of each value in the list, before or during the conversion to a struct key list?
This is a great approach! Thanks for laying out the logic so easily and cleanly. I remember relying on a UDF from cflib.org that also removed duplicates. This one would loop through each list item and add it to a second list only if it hadn't been seen before. As the second list gets larger then searching gets slower. The method shown here would definitely be faster as others have already pointed out. Shoot, Adobe might as well take the code here and make a built-in function out of it. I can dream, can't I?
I wrote this to count the occurrences of each item when looping a populated list with redundancies into a struct. Worked pretty well. Thanks for the original idea!! :-)
<cfloop index="tag" list="#tagList#" delimiters=",">
<cfset tag1 = trim(#tag#)>
<cfif tag1 neq "">
<cfif structKeyExists(tagStruct,"#tag1#")>
<cfset tagStruct[tag1] = #tagStruct[tag1]# + 1>
<cfelse>
<cfset tagStruct[tag1] = "1" />
</cfif>
</cfif>
</cfloop>
@Rich,
Looks good - always glad to provide any inspiration. As a quick note, you don't need to use the # sign so much. It is only needed when a variable is being evaluated *inside* a string.
Ben,
Great post! Used it on a couple different projects in the past week! Thanks!!




