Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at cf.Objective() 2013 (Bloomington, MN) with:

Using ColdFusion Structures To Remove Duplicate List Values

Posted by Ben Nadel
Tags: ColdFusion

The other week, I posted a solution for Ray Camden's Friday Puzzler for Santa's list. It involved using the ColdFusion structure (struct) as a utility to clean lists. I thought it worked quite nicely and thought that I would talk about it here quickly. The beauty of the ColdFusion struct is that it is an object that maps keys to values. This mapping feature forces the struct to only have unique keys as there would be no way to map two duplicate keys to two different values. This mapping can be leveraged to clean the list of duplicate values.

To demonstrate this, let's build a list that has duplicate values:

  • <!--- Create a group values that have duplicate values. --->
  • <cfsavecontent variable="strListData">
  • Ghost Busters
  • What About Bob
  • Broken Flowers
  • What About Bob
  • Lost In Translation
  • Stripes
  • Scrooge
  • What About Bob
  • Lost In Translation
  • </cfsavecontent>
  •  
  •  
  • <!--- Create a list based on the list data. --->
  • <cfset lstMovies = strListData.Trim().ReplaceAll(
  • "[\r\n\t]+",
  • ","
  • ) />

This is a bit of a round-about way to create a list, but I am doing this for readability (so you can really see the list). If we output the list above:

  • <!--- Break list on to different lines. --->
  • #lstMovies.ReplaceAll( ",", "<br />" )#

... we get:

Ghost Busters
What About Bob
Broken Flowers
What About Bob
Lost In Translation
Stripes
Scrooge
What About Bob
Lost In Translation

Now, we can loop over the list and add them to the ColdFusion struct. Remember, due to the mapping nature of the struct, duplicate keys will override each other, thereby removing duplicates:

  • <!--- Create a structure to hold the movie titles. --->
  • <cfset objMovies = StructNew() />
  •  
  • <!--- Loop over the list to add mappings to the struct. --->
  • <cfloop index="strMovie" list="#lstMovies#" delimiters=",">
  •  
  • <!---
  • Store key/value pair. In this case, we don't
  • really care about the value (hence storing ""
  • as value). We only care about the key.
  • --->
  • <cfset objMovies[ strMovie ] = "" />
  •  
  • </cfloop>
  •  
  • <!--- Convert the now unique key values into a list. --->
  • <cfset lstMovies = StructKeyList( objMovies ) />

Now, outputting that list:

  • <!--- Break list on to different lines. --->
  • #lstMovies.ReplaceAll( ",", "<br />" )#

... we get:

Lost In Translation
What About Bob
Stripes
Scrooge
Ghost Busters
Broken Flowers

Notice that this time, the duplicates have been removed. Notice also that the order of the list has been changed. That usually won't matter as I don't think you should require proper ordering in a list that is required to remove duplicate values.

I am not sure how this compares (performance-wise) to the loop/check strategy that constantly checks for duplicates. I almost want to say that this would be faster as there really is no logic involved. There's no CFIF statements, no conditionals; all we do is set values and then join them together. It's that easy.

One final note: this is not case sensitive. The structure, like most all ColdFusion, treats mixed cases as the same value. If you want to do this type of algorithm with case sensitivity, you can create a Java 2 Hashtable and use it in similar way. I won't go into that though, as that is not something any of us really need to do (or otherwise, we probably shouldn't be using lists perhaps?).




Reader Comments

Very Clean!

There are other ways to remove duplicates in a list, but none as elegant as this.

Good Show!

Dan Wilson

Reply to this Comment

Thanks! I've been looking for a way to speed up the creation of a unique 5 character id list I create for emails tracking in one of our products. The best way I could find to keep the list unique was just using string contains on a list but your struct alternative is MUCH faster. I'm still checking my code but here are some figures:

2,000 IDs
- List/String = 1797ms
- Struct = 171ms

10,000 IDs
- List/String = 41750ms
- Struct = 937ms

Reply to this Comment

Daniel,

That's awesome. I had a gut feeling tell me it would be faster, but it's nice to see some numbers on it.

Reply to this Comment

100,000 IDs
- List/String (not going to do it now, tried in the past and its like 20 minutes)
- Struct = 50266ms (WOW!)

I have it flushing counts every 100 ID attempts and it pauses around 36800 and 74900 every time so there is some type of conversion going on around those sizes. I noticed similar pauses occurring when using strings.

The algorithm is close to linear instead of growing in time based on the size of the list. Don't remember all that Big O notation stuff to explain that further.

Reply to this Comment

I'm no regexp ninja but would you not be better just using regular expressions?

nothing at hand at the moment to show for code, but thought i'd chuck that in the ring

Reply to this Comment

DC,

Regular expressions are totally awesome, but they require a lot of parsing and processing. And, the larger the string, the longer it takes in what seems a non-linear fashion... that's not from official testing, that's just what I have noticed.

That, and I wouldn't even know how to start writing that expression.

Reply to this Comment

Pete,

Glad you like. One thing to note as well is that you can do StructKeyArray() as well if you need to clean an array rather than a list. This will just return the struct-keys into a one-dimensional array.

Reply to this Comment

The regular expression idea interests me. I'm curious about the efficiency of checking for duplicates across a list in a single regular expression search. Not as efficient as structs for deduping but if you have to find dupes it may be faster than looping for every item in list.

First you would need to sort the list, then run a regular expression such as the following "(,|$)([^,]+),\2(,|^)".

Just remembered though that REFind only returns the first match so that would require some looping to find the next match. hmmm

Reply to this Comment

@Dan,

Regular expression are totally awesome... but you want to avoid having to parse a string too many times. I would guess that any kind of looping in conjunction with regex is gonna be slower than using arrays and structs.

Reply to this Comment

I'm not actually advocating a loop through each piece of data, rather an REFind on an entire list string using the delimiter in regular expression to detect dupes. The benefit with a regular expression in this instance over an array loop is that you are only looping as many times as there are matches and you are able to start where you left off.

For example with the regular expression, if there are no duplicates you would do a single REFind which would find no duplicates and you would be done. If it found one you would start another REFind from where you found the first dupe to look for the next.

A loop over each item in the list or array on the other hand would require an action on ever item in the list to compare to every other item in the list. That quickly looses efficiency because processing increases exponentially, which was the need for the struct work around of course.

Anyways, something to think about and I might play with when I have time.

Reply to this Comment

@Dan,

I this is interesting. I would certainly love to see what you come up with if you have time play around with the idea. I am always looking to learn new techniques.

Reply to this Comment

Though the RE method may help us find duplicates be it tell us they exist or tell us which are the duplicates the original purpose of this post was to REMOVE duplicates not to find them.

For it's purpose this works well enough in comparison to the alternatives out there which serve ONLY this purpose and no extra.

As a thought should you find your data in a Query object (say you pulled a list from you database and you only want the unique entries for a particular column where the db does not have this constraint imposed) a QofQ may work and scale far better than this method. Where the method here is linear (O(n)) for those that know the notation it is possible the QoQ method may be < O(n) for larger datasets.

Reply to this Comment

How about a way to count the occurrences of each value in the list, before or during the conversion to a struct key list?

Reply to this Comment

This is a great approach! Thanks for laying out the logic so easily and cleanly. I remember relying on a UDF from cflib.org that also removed duplicates. This one would loop through each list item and add it to a second list only if it hadn't been seen before. As the second list gets larger then searching gets slower. The method shown here would definitely be faster as others have already pointed out. Shoot, Adobe might as well take the code here and make a built-in function out of it. I can dream, can't I?

Reply to this Comment

I wrote this to count the occurrences of each item when looping a populated list with redundancies into a struct. Worked pretty well. Thanks for the original idea!! :-)

<cfloop index="tag" list="#tagList#" delimiters=",">
<cfset tag1 = trim(#tag#)>
<cfif tag1 neq "">
<cfif structKeyExists(tagStruct,"#tag1#")>
<cfset tagStruct[tag1] = #tagStruct[tag1]# + 1>
<cfelse>
<cfset tagStruct[tag1] = "1" />
</cfif>
</cfif>
</cfloop>

Reply to this Comment

@Rich,

Looks good - always glad to provide any inspiration. As a quick note, you don't need to use the # sign so much. It is only needed when a variable is being evaluated *inside* a string.

Reply to this Comment

This is something that is making me pull my hair out, and still is.

I would like to order the results based on the number of occurrences, if a result or particular ID shows up more than once it would have a higher position in the array or list. Then the ones that have less would follow behind.

This is really driving me bonkers.

Reply to this Comment

@Jody

To the best of my knowledge, structures don't really have a notion of ordering, so you would have to convert the structure to some other data type such as 2D array or a query.

If you wanted to go the query route, you first convert your structure to a query and then run a query of queries to run your sort.

<!--- convert struct to query --->
<cfscript>
struct = { one = 1, two = 2, three = 3, four = 4, five = 5 };
query = queryNew( "key,num" );
for ( key in struct ){
queryAddRow( query );
querySetCell( query, "key", key );
querySetCell( query, "num", struct[key] );
}
</cfscript>

<!--- query of queries --->
<cfquery name="qSort" dbtype="query">
SELECT *
FROM query
ORDER BY num desc
</cfquery>

<cfdump var="#struct#" />
<cfdump var="#query#" />
<cfdump var="#qSort#" />

I got my inspiration for this workaround from the following tutorials.

http://www.coldfusioncookbook.com/entry/23/How-do-you-loop-over-the-values-in-a-structure?

http://www.coldfusioncookbook.com/entry/87/How-do-I-sort-a-2-dimensional-array?

If you wanted to go the 2D array route, you could use the code from the latter of the two links above.

Reply to this Comment

I don't think that's what Jody was exactly looking for but at least it's close.

If there's an array with duplicates I believe Jody wants them ordered from most to least? Obviously the easiest way is likely:

1. Create the 'unique key' structure defined above with value = 0
2. Loop the array incrementing a counter each time a value is seen
3. Iterate the new structure creating again ANOTHER structure where the keys are the numerical counters stored in struct1. When you have 2 keys with the same counter make the value a list or an array of the appropriate text keys.

You can of course use a QofQ method:

<cfscript>
array = ListToArray( "one,two,three,one,one,three,two,four,two,two,four,four,four" );
counter = ArrayNew(1);
ArraySet( counter, 1, ArrayLen( array ), 1 );
query = QueryNew( '' );
QueryAddColumn( query, 'key', array );
QueryAddColumn( query, 'counter', counter );
</cfscript>

<cfquery name="sorted" dbtype="query">
SELECT key, sum(counter) AS total FROM query
GROUP BY key ORDER BY total DESC
</cfquery>

You now have a query of with the keys and the number of occurrences for each key. Do with this as you like.

Reply to this Comment

@Matt

My solution was based off an earlier comment where Rich showed some code on how to create the unique key struct and maintain a count of how many times each duplicate was seen. Either way, looks like there's a couple of ways to get the same result.

Reply to this Comment

Thanks Guys!

After hundreds of hairs being pulled out I finally figured it out.

I just have a few more things to do and my project will be 95.5 percent done.

Once again thanks for all the help here was my solution...

<!--- CONVERT THE SCRUB URL TO A LIST --->
<cfset convertResults_toList = ArrayToList(results_scrub_url) >

<!--- CREATE THE LIST FOR THE RESULTS || THIS LIST CONTAINS ONLY THE POSITION OF
THE ARRAY SO WE CAN THEN LOOP OVER THE LIST FOR APPRIPORATE RESULTS --->
<cfset finalResults_List = "">

<!--- NOW THAT THE SCRUBED URLS ARE CONVERTED TO A LIST WE CAN NOW DO A LOOP OVER THEM AND NOW COMPARE THEM --->
<!--- THE LOOP WILL PREPEND ( ADD TO THE END OF ) THE LIST WHICH THERITICALLY MEANS
THAT IT WILL ADD THE FIRST ONE TO THE START ETC... --->

<cfloop from="1" to="#arraylen(results_scrub_url)#" index="curRes">

<!--- THIS LOOP ALLOWS US TO LOOP OVER THE REMAINING RESULTS TO SEE
IF WE HAVE A MATCH IF WE HAVE A MATCH WE ARE GOING TO PREPEND THE RESULT TO THE LIST --->
<cfloop from="1" to="#arraylen(results_scrub_url)#" index="checkRes">

<!--- CHECKING CURRES (CURRENT RESULT) VS THE CHECKRES --->
<cfif results_scrub_url[curRes] eq results_scrub_url[checkRes] AND NOT curRes eq checkRes >

<cfif NOT listContains(finalResults_List, #curRes#) >
<!--- APPEND WHAT WE FOUND TO THE LIST --->
<cfset finalResults_List = ListAppend(finalResults_List, "#curRes#") >
</cfif>

</cfif>

</cfloop>

</cfloop>

Reply to this Comment

Well Never Mind!

-- Ugh, I thought I finished it but it turns out its still kicking duplicates. It's going to be a simple fix. I will let you guys know.

Reply to this Comment

@Jody

I can't seem to get your code sample to work. If you are still having problems, try this code out and see if it gets you what you wanted.

<!--- Comma delimited list with various duplicates --->
<cfset listWithDups = "second,second,second,third,third,first,first,first,first,fourth" />

<!--- Convert list to 1 column query var --->
<cfset query = queryNew("val") />
<cfloop list="#listWithDups#" index="i">
<cfset queryAddRow(query) />
<cfset querySetCell(query, "val", i) />
</cfloop>

<!--- Remove duplicates and sort results using query of query and "group by" --->
<cfquery name="sortedQuery" dbtype="query">
SELECT val, count(*) as numOcurrences
FROM query
GROUP BY val
ORDER BY numOcurrences desc, val asc
</cfquery>

<!--- Display sorted query --->
<cfdump var="#sortedQuery#" />

<!--- Convert sorted values to list --->
<cfset finalList = valueList(sortedQuery.val) />

<!--- Display final list --->
<cfdump var="#finalList#" />

Reply to this Comment

Hello Ben.

No good :(
Gave me a "500 server internal error" on CF8 when playing this trick with list consisting of 1 element [37077015].
In all other cases it worked like a charm.
Think I'll have to try Java method instead :(

Reply to this Comment

@Dmitry,

Hmm, I am not sure why this would error out with only one list item. I can't see anything that would cause this require two or more list items. What was the exception that is actually raised?

@Paul,

Using the hash set in Java seems like a really cool approach. I believe that under the hood, ColdFusion's struct is actually extending a Map class (FastHashMap if I remember). Looks like the Set class tree is the more appropriate approach. Thanks for the link!

Reply to this Comment

I believe the reason for the exception was not the list length but it's items. 37077015 is a big numeric value. Making a structure key from it might cause some problems.
Exception was 500 internal server error and then many encoded characters like "‹ ÅZ{sâ8 ÿ;SµßAÇ" :(

Reply to this Comment

Here I am again, 4 years after you created this blog entry and it just helped me today! Thanks for all of your work for the Coldfusion community.

Reply to this Comment

Ben, I used your method of removing duplicates again. I did it using CFSCRIPT this time. I figured I'd share it.

SecurityStruct = structNew();
for (i = 1; i lte listLen(SecurityList); i++) {
SecurityStruct[listGetAt(SecurityList, i)] = "";
}
SecurityList = StructKeyList(SecurityStruct);

Reply to this Comment

Ben

This is just awesome! Great thinking outside of the box. Its powerful stuff like this that kicks the poo out of .Net

Thanks again!

Reply to this Comment

Late to the party but figured I would add my two cents anyway.

How about a way to count the occurrences of each value in the list, before or during the conversion to a struct key list?

I would imagine that one would use listValueCount() (or listValueCountNoCase()), e.g.:

  • <cfset objMovies[ strMovie ] = listValueCount(lstMovies, strMovie) />

Not having actually tested it I'm not sure what it would add to the processing time.

Reply to this Comment

Hi Ben. I ran across this post trying to find a solution to my problem of cleaning duplicates from a 2D array. I can't convert the array to a list because it is not 1D. Do you have any suggestions on how to do this for 2D arrays?
Thanks.

Reply to this Comment

@Nina,

That's certainly an interesting one to solve. Does your 2D array have a consistent width for each row? What happens to the duplicate values? Do you set the cells to null?

Reply to this Comment

@Jose,

No, the widths are not consistent for each row. Right now I am not doing anything with the duplicate values, they're just sitting there being duplicates (lol). I don't know how to get rid of them in the array is the reason.
The array is getting duplicate values because I have 2 different queries that are populating the array, each of which may be returning the same value, thus the duplicate.
I think I am just going to have to figure out a way to combine the queries, or maybe a way to make it a 1D array, get rid of dups, then go get the other data I need using the array value.

Reply to this Comment

@Nina,

I'm wondering if something like this would work. What do you think?

If this is too inefficient maybe there's a way of doing a UNION on the queries and a GROUP BY on top of that UNION.

  • <cfscript>
  • var duplicateStruct = {};
  • var key = "";
  •  
  • // Loop through outer array
  • for (rowNum = 1; rowNum <= arrayLen(twoDArray); rowNum++) {
  • // Loop through inner array
  • for (colNum = 1; colNum <= arrayLen(twoDArray[rowNum]); colNum++) {
  • key = twoDArray[rowNum][colNum];
  •  
  • if (!structKeyExists(duplicateStruct, key)) {
  • // First time seeing this key
  • duplicateStruct[key] = 1;
  • } else {
  • // Keep count of how many times duplicate appears
  • duplicateStruct[key]++;
  •  
  • // Set array cell to null/undefined
  • twoDArray[rowNum][colNum] = javaCast("null", 0);
  • }
  • }
  • }
  • </cfscript>

Reply to this Comment

For those who end up here using Google and scroll all the way down, if you are using CF10, you can now use listRemoveDuplicates() to get a list without duplicates.

Reply to this Comment

Post A Comment

?
You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.