Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at cf.Objective() 2014 (Bloomington, MN) with:

Sorting Really Big Files (Inspired By Sammy Larbi)

By Ben Nadel on
Tags: ColdFusion

Earlier today, Sammy Larbi posted about how he created an algorithm in Java for sorting a really BIG file by breaking it up into multiple small files and then using an "external merge sort" to join those files into a final sorted file. This sounded cool, more because it was high-level than anything else. Sounded like fun to try and duplicate this kind of functionality in ColdFusion.

First off, why would we do this? Simple - memory usage. Sammy's scenario was that you had to read in and sort a massive file whose entirety could not be read into the server's memory without causing a crash. Now, ColdFusion automatically reads in complete file data when you use the CFFile tag. Therefore, we cannot use that. We are going to need to harness the buffering input and output power of Java's I/O library.

As it turns out, there is a really great convenience class in Java for reading in file data one line at a time. This is the java.io.LineNumberReader and I have done some experimentation with it before. As far as writing files to disk, if this file is MASSIVE, then we want to be careful about how often we go to the file system (as file I/O is a very costly action). Therefore, we can use one of Java's buffered output writers to help manage the efficiency with which we write the file (as opposed to calling CFFile Action="Append" for each line of data).

This is a proof of concept; we are not writing massive files - we are just using 30 rows of data. I have no idea how this scales, but it seems to work on a small level. I was not sure how to best decide how many lines of data could go in each small file, so I set a variable that could be altered to change the size of the file. This variable, and the size of the original file, determines how many small files will be created.

Without any further delay, the ColdFusion code that pulls this off (sorry that the Java object creation code lines are not formatted / indented as nicely as they could be - they're just too long to do in this space):

  • <!---
  • First we have to create a file that has random values.
  • Since we are doing a proof of concept, we don't need
  • a bagillion lines - we just need enough that we can
  • divide up into multple files.
  • --->
  • <cfset intTotalRows = 30 />
  •  
  • <!---
  • Set a value for the MAX number of lines that we are
  • going to use in each individual sub-sort file.
  • --->
  • <cfset intFileRowSize = 8 />
  •  
  •  
  •  
  • <!---
  • Create an array to hold the values that we are going
  • to use to create each row value. This array will be
  • shuffled and converted to a string to create randomly
  • ordered array values.
  • --->
  • <cfset arrChars = ListToArray(
  • "A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z"
  • ) />
  •  
  •  
  • <!--- Create a constant for line feeds. --->
  • <cfset strNL = (Chr( 13 ) & Chr( 10 )) />
  •  
  •  
  • <!---
  • Create a collection utility object that will be used
  • to shuffle the arrays by reference.
  • --->
  • <cfset objCollection = CreateObject(
  • "java",
  • "java.util.Collections"
  • ) />
  •  
  •  
  • <!---
  • Create a string buffer to build the individual values
  • before we store them in the unsorted file.
  • --->
  • <cfset sbUnsorted = CreateObject(
  • "java",
  • "java.lang.StringBuffer"
  • ).Init()
  • />
  •  
  •  
  • <!--- Create random values for each row. --->
  • <cfloop
  • index="intRow"
  • from="1"
  • to="#intTotalRows#"
  • step="1">
  •  
  • <!--- Shuffle the character array. --->
  • <cfset objCollection.Shuffle( arrChars ) />
  •  
  • <!---
  • Convert the array to a string and then add it
  • as an individual line to the output buffer.
  • --->
  • <cfset sbUnsorted.Append(
  • JavaCast(
  • "string",
  • (ArrayToList( arrChars, "" ) & strNL)
  • )
  • ) />
  •  
  • </cfloop>
  •  
  •  
  • <!--- Write the unsorted values string buffer to disk. --->
  • <cffile
  • action="WRITE"
  • file="#ExpandPath( './unsorted.txt' )#"
  • output="#sbUnsorted.ToString()#"
  • addnewline="false"
  • />
  •  
  •  
  •  
  • <!---
  • ASSERT: At this point, we have a completely unsorted
  • text file, unsorted.txt. Our goal is to created a sorted
  • file, sorted.txt.
  • --->
  •  
  •  
  • <!---
  • Keep track of how many files we end up using for our
  • sub-sorting of the larger file.
  • --->
  • <cfset intFileCount = 0 />
  •  
  •  
  • <!---
  • Create a buffered file line reader so that we can start
  • reading in the unsorted file one line at a time. By doing
  • this, we can create the individual files without totally
  • chewing up the server's memory.
  • --->
  • <cfset objLineReader = CreateObject( "java", "java.io.LineNumberReader" ).Init(
  •  
  • <!---
  • Create a buffered reader to feed our line number
  • reader. This will help optimize the file access.
  • --->
  • CreateObject( "java", "java.io.BufferedReader" ).Init(
  •  
  • <!--- Create a file reader to buffer. --->
  • CreateObject( "java", "java.io.FileReader" ).Init(
  •  
  • JavaCast(
  • "string",
  • ExpandPath( "./unsorted.txt" )
  • )
  •  
  • )
  •  
  • )
  •  
  • ) />
  •  
  •  
  • <!---
  • Create an array to keep track of the lines for our
  • sub-sort file.
  • --->
  • <cfset arrLines = ArrayNew( 1 ) />
  •  
  •  
  • <!--- Get the first value from the line reader. --->
  • <cfset strLine = objLineReader.ReadLine() />
  •  
  • <!---
  • Keep looping over the line reader until it fails to
  • return a value. We will know if it returns a NULL value
  • if our variable no longer exists in the page scope.
  • --->
  • <cfloop
  • condition="StructKeyExists( VARIABLES, 'strLine' )">
  •  
  • <!--- Add the current line of data to our array. --->
  • <cfset ArrayAppend(
  • arrLines,
  • strLine
  • ) />
  •  
  •  
  • <!---
  • Check to see if we have reached the max row size
  • we wanted to use for the sub files.
  • --->
  • <cfif (ArrayLen( arrLines ) EQ intFileRowSize)>
  •  
  • <!---
  • Sort the array. Each individual file must
  • contain sorted values.
  • --->
  • <cfset ArraySort( arrLines, "text", "ASC" ) />
  •  
  • <!--- Increment the file counter. --->
  • <cfset intFileCount = (intFileCount + 1) />
  •  
  • <!--- Write these lines to the sub-file. --->
  • <cffile
  • action="WRITE"
  • file="#ExpandPath( './sorted_#intFileCount#.txt')#"
  • output="#ArrayToList( arrLines, strNL )#"
  • addnewline="false"
  • />
  •  
  • <!---
  • Create a new array so that we can keep reading in
  • lines for the next sub-sorted file.
  • --->
  • <cfset arrLines = ArrayNew( 1 ) />
  •  
  • </cfif>
  •  
  •  
  • <!--- Read in the next line of data. --->
  • <cfset strLine = objLineReader.ReadLine() />
  •  
  • </cfloop>
  •  
  •  
  • <!---
  • At this point, we may have written some sub-sort
  • files to disk. However, we might still have data
  • left in our lines array. Check to see if we need
  • to write one final file.
  • --->
  • <cfif ArrayLen( arrLines )>
  •  
  • <!---
  • Sort the array. Each individual file must contain
  • sorted values.
  • --->
  • <cfset ArraySort( arrLines, "text", "ASC" ) />
  •  
  • <!--- Increment the file counter. --->
  • <cfset intFileCount = (intFileCount + 1) />
  •  
  • <!--- Write these lines to the sub-file. --->
  • <cffile
  • action="WRITE"
  • file="#ExpandPath( './sorted_#intFileCount#.txt')#"
  • output="#ArrayToList( arrLines, strNL )#"
  • addnewline="false"
  • />
  •  
  • </cfif>
  •  
  •  
  • <!--- Close the file reader. --->
  • <cfset objLineReader.Close() />
  •  
  •  
  • <!---
  • ASSERT: At this point, we have split our unsorted file
  • up into many smaller, sorted files. Now, here's where it
  • gets really exciting. We have to combine each of those
  • sorted files into a single sorted file.
  • --->
  •  
  •  
  • <!---
  • Now, we are gonna get out the latex and do some really
  • kinky stuff with the language. We are going to create
  • a query object that has two columns: one for the
  • line file reader for each file and one for the smallest
  • row of data from that file.
  • --->
  • <cfset qReader = QueryNew( "reader, value" ) />
  •  
  •  
  • <!---
  • Loop over the file count, add a row to the query and
  • populate the value with the first record.
  • --->
  • <cfloop
  • index="intFileIndex"
  • from="1"
  • to="#intFileCount#"
  • step="1">
  •  
  • <!--- Add a row to the query. --->
  • <cfset QueryAddRow( qReader ) />
  •  
  • <!---
  • Create a file reader for this record and store that
  • file reader into the query. Notice that we are not
  • casting it to any java type. This is sooooooo not the
  • proper way to use a query :D
  • --->
  • <cfset qReader[ "reader" ][ intFileIndex ] = CreateObject( "java", "java.io.LineNumberReader" ).Init(
  • CreateObject( "java", "java.io.BufferedReader" ).Init(
  • CreateObject( "java", "java.io.FileReader" ).Init(
  • JavaCast( "string", ExpandPath( "./sorted_#intFileIndex#.txt" ) )
  • )
  • )
  • ) />
  •  
  •  
  • <!---
  • Read the first row from that file. Since a file only
  • gets written if it has data, we don't have to worry
  • about checking row validity at this point. Be sure to
  • cast this as a string as we will need to be able
  • to sort on it properly.
  • --->
  • <cfset qReader[ "value" ][ intFileIndex ] = JavaCast(
  • "string",
  • qReader[ "reader" ][ intFileIndex ].ReadLine()
  • ) />
  •  
  • </cfloop>
  •  
  •  
  •  
  • <!---
  • Create a buffered writer to create the final sorted file.
  • This will allow us to optimize file writes.
  • --->
  • <cfset objOutput = CreateObject( "java", "java.io.BufferedWriter" ).Init(
  •  
  • <!---
  • Create a file writer to feed our buffered writer.
  • By doing this, we can let the writer write the data
  • to the file system as it feels it should.
  • --->
  • CreateObject( "java", "java.io.FileWriter" ).Init(
  •  
  • <!--- Create a file name to store the sorted data. --->
  • JavaCast(
  • "string",
  • ExpandPath( "./sorted.txt" )
  • )
  •  
  • )
  •  
  • ) />
  •  
  •  
  •  
  • <!--- Keep looping until we have break out of the loop. --->
  • <cfloop condition="true">
  •  
  • <!---
  • The first thing we want to do is sort the query by
  • value so that the line reader with the smallest
  • value is currently the first record. Use a query of
  • queries to store it right back into itself.
  • --->
  • <cfquery name="qReader" dbtype="query">
  • SELECT
  • *
  • FROM
  • qReader
  • ORDER BY
  • [value] ASC
  • </cfquery>
  •  
  •  
  • <!--- Write the smalled value to the file. --->
  • <cfset objOutput.Write(
  • JavaCast( "string", qReader.value )
  • ) />
  •  
  • <!--- Add a new line. --->
  • <cfset objOutput.NewLine() />
  •  
  •  
  • <!---
  • Get a reference to the reader. We cannot refer to
  • the reader directly in the query column as there
  • is some sort of strange cast issue that happens.
  • By getting a reference to it first, it will properly
  • cast to the LineNumberReader object.
  • --->
  • <cfset objReader = qReader.reader />
  •  
  •  
  • <!---
  • Get the next value from the first line reader. We cannot
  • store that directly into the query as we want to check
  • to see if a NULL value is returned.
  • --->
  • <cfset strValue = objReader.ReadLine() />
  •  
  •  
  • <!---
  • Check to see if we have a valid value returned. If a
  • NULL value was returned, the strValue will no longer
  • exists in the page scope.
  • --->
  • <cfif StructKeyExists( VARIABLES, "strValue" )>
  •  
  • <!---
  • We have a good value. Store that into the value
  • column of the current line reader.
  • --->
  • <cfset qReader[ "value" ][ 1 ] = JavaCast(
  • "string",
  • strValue
  • ) />
  •  
  • <cfelse>
  •  
  • <!---
  • A valid value was NOT returned from the line reader.
  • That means that this line reader is of no more use
  • to us. Close the file input stream.
  • --->
  • <cfset objReader.Close() />
  •  
  • <!--- Delete this row from the query. --->
  • <cfset qReader.RemoveRows(
  • JavaCast( "int", 0 ),
  • JavaCast( "int", 1 )
  • ) />
  •  
  • </cfif>
  •  
  •  
  • <!---
  • Check to see if we have any records left in out query.
  • If we do not then we are read in all of the sub-sorted
  • files and have written them out our buffered output.
  • --->
  • <cfif NOT qReader.RecordCount>
  •  
  • <!--- Close the current output. --->
  • <cfset objOutput.Close() />
  •  
  • <!--- Break out of the loop. --->
  • <cfbreak />
  •  
  • </cfif>
  •  
  • </cfloop>

The first part of the above process actually generates a random data file. This is the content of that file:

GTUJXMEHPKICBQYZDAWLRSVFON
WHMTQVKZFULDOYGCRXEAJNSPIB
SMDPXWKYTNZFCOHALIJBRGEUVQ
VYWLUIFGBRXNKECTAPMSZJHDOQ
NODBUJHCZQAPIMKRELXGTSVWYF
TINGSFBJPXHDWZMCAEOQLRVUYK
WDBJIETYRHXGOSVZUFLCKAMPQN
PWBEKMJOGAYISVTFXDRZUQHNCL
ZSPAEILTWVXYMOQBJDNCKGHFUR
KWDMCPZBTYAEOQVGNJUFRIHSXL
DFIXZQUGLPCVEYSBONTHWKARJM
TBPJZWYMAKQLRXUDNCSEGOHVIF
ICLJWBTPFVQUZOGYRESXHKMNDA
GDKIBVWJTYRCPZEOUALFHSQNXM
AGLFKOQRNTZPBWUVEHYMIXJSCD
EHWMBZJAKICFSQLTRNYDXOPVUG
MNWUALEQIRPGTKHBFVYOJDZXCS
CXASBFOEJQGZWKHUTRVMDLNPYI
YXMGQIHSOKDPVTLWJBAUFRZCNE
RGWBEMOHQYZFSCVTAUNKPJXLDI
JLYCGZHKDSWRBPUIOTEFMXQVAN
EZXVPSDIQCUTRNKLAYOJHWGMBF
PRWBMJSETZUYNAFQGLOKXIHVCD
FMKPYCGANZQXEUROSVDIJTWBLH
IRNKWEXCUBYMVOQLHZSGDJPFAT
VPTNQYLBEWZGCAXJKOMSHFUIDR
HUTRAJPVBXCDGYSIZWLFOQEKNM
HRFEGUDQANMTXYKPCZSWVLIBOJ
ESKXZUWALQBYCPJVDINRTMGOHF
NBVGXZUISFCHKPELMWJDAYROTQ

As the page starts to process, it splits the unsorted file into several smaller files. Each of these files is sorted before it is written to disk. Here are the 4 sub-sorted files (notice that they don't all have to have the same number of data rows):

sorted_1.txt

GTUJXMEHPKICBQYZDAWLRSVFON
NODBUJHCZQAPIMKRELXGTSVWYF
PWBEKMJOGAYISVTFXDRZUQHNCL
SMDPXWKYTNZFCOHALIJBRGEUVQ
TINGSFBJPXHDWZMCAEOQLRVUYK
VYWLUIFGBRXNKECTAPMSZJHDOQ
WDBJIETYRHXGOSVZUFLCKAMPQN
WHMTQVKZFULDOYGCRXEAJNSPIB

sorted_2.txt

AGLFKOQRNTZPBWUVEHYMIXJSCD
DFIXZQUGLPCVEYSBONTHWKARJM
EHWMBZJAKICFSQLTRNYDXOPVUG
GDKIBVWJTYRCPZEOUALFHSQNXM
ICLJWBTPFVQUZOGYRESXHKMNDA
KWDMCPZBTYAEOQVGNJUFRIHSXL
TBPJZWYMAKQLRXUDNCSEGOHVIF
ZSPAEILTWVXYMOQBJDNCKGHFUR

sorted_3.txt

CXASBFOEJQGZWKHUTRVMDLNPYI
EZXVPSDIQCUTRNKLAYOJHWGMBF
FMKPYCGANZQXEUROSVDIJTWBLH
JLYCGZHKDSWRBPUIOTEFMXQVAN
MNWUALEQIRPGTKHBFVYOJDZXCS
PRWBMJSETZUYNAFQGLOKXIHVCD
RGWBEMOHQYZFSCVTAUNKPJXLDI
YXMGQIHSOKDPVTLWJBAUFRZCNE

sorted_4.txt

ESKXZUWALQBYCPJVDINRTMGOHF
HRFEGUDQANMTXYKPCZSWVLIBOJ
HUTRAJPVBXCDGYSIZWLFOQEKNM
IRNKWEXCUBYMVOQLHZSGDJPFAT
NBVGXZUISFCHKPELMWJDAYROTQ
VPTNQYLBEWZGCAXJKOMSHFUIDR

Once we have each of those files sorted and written on their own, we read them back in, one line at a time. For each read, the smallest of the top-rows was written to the buffered output (resultant) file stream. The final, sorted data was:

AGLFKOQRNTZPBWUVEHYMIXJSCD
CXASBFOEJQGZWKHUTRVMDLNPYI
DFIXZQUGLPCVEYSBONTHWKARJM
EHWMBZJAKICFSQLTRNYDXOPVUG
ESKXZUWALQBYCPJVDINRTMGOHF
EZXVPSDIQCUTRNKLAYOJHWGMBF
FMKPYCGANZQXEUROSVDIJTWBLH
GDKIBVWJTYRCPZEOUALFHSQNXM
GTUJXMEHPKICBQYZDAWLRSVFON
HRFEGUDQANMTXYKPCZSWVLIBOJ
HUTRAJPVBXCDGYSIZWLFOQEKNM
ICLJWBTPFVQUZOGYRESXHKMNDA
IRNKWEXCUBYMVOQLHZSGDJPFAT
JLYCGZHKDSWRBPUIOTEFMXQVAN
KWDMCPZBTYAEOQVGNJUFRIHSXL
MNWUALEQIRPGTKHBFVYOJDZXCS
NBVGXZUISFCHKPELMWJDAYROTQ
NODBUJHCZQAPIMKRELXGTSVWYF
PRWBMJSETZUYNAFQGLOKXIHVCD
PWBEKMJOGAYISVTFXDRZUQHNCL
RGWBEMOHQYZFSCVTAUNKPJXLDI
SMDPXWKYTNZFCOHALIJBRGEUVQ
TBPJZWYMAKQLRXUDNCSEGOHVIF
TINGSFBJPXHDWZMCAEOQLRVUYK
VPTNQYLBEWZGCAXJKOMSHFUIDR
VYWLUIFGBRXNKECTAPMSZJHDOQ
WDBJIETYRHXGOSVZUFLCKAMPQN
WHMTQVKZFULDOYGCRXEAJNSPIB
YXMGQIHSOKDPVTLWJBAUFRZCNE
ZSPAEILTWVXYMOQBJDNCKGHFUR

Now, while it may be hard to look at that and say "Oh, yeah, that's totally sorted," I pasted it into excel and did a data sort. Indeed, none of the rows shifted meaning, this bad boy has been sorted.

This was a very fun exercise to do. Now, I know the code is a bit long, and you probably didn't read through it, but you should really take a look at how I used the ColdFusion query object - it's just bananas.




Reader Comments

Good stuff Ben, and thanks for flushing it out! I should point out that I wouldn't really say I "created" the algorithm. I'm sure its well known enough, but I didn't find it as easily as I thought I would have on the internet (what I did find didn't seem to work for me). In any case, I didn't want to implicitly take credit for it by not saying anything =).

What you did basically does follow what I did ... and I DO like how you used the CF query to store the file reader information. I had a lot of boilerplate code to deal with it in a couple of Java arrays or ArrayLists (can't remember at the moment), and having a query object like you used would have really cut down on the crap. =)

The only thing I would have done differently is when you said "Check to see if we need to write one final file." I would have put that into a function so we didn't need 2 blocks of code that perform the same basic operation (DRY, you know). However, I know this is only an example, so I shouldn't get nitpicky =).

Reply to this Comment

@Sam,

Yeah, I know that you didn't invent the algorithm, you merely put it into practice. All this stuff was created long ago it seems. I took an algorithms course in college that covered an enormous book of stuff just like this... merge sorts, binary sorts, bubble sorts, tree traversal, big O notation, orders of magnitude.... Naturally I have long since forgotten everything from that class (except for the D+ on my first test) :)

But the matter still stands - it was fun to do in ColdFusion as I have never done anything like this before.

As far as the UDF stuff for DRY programming, I honestly tend to avoid doing that in examples. The reason for it is that as you are reading down through the demo, if you hit a UDF reference, then you have to scroll away from it to see what it does. By putting the code inline (most of the time), I figure it is easier to follow.

And, ideally, this entire algorithm would probably be wrapped up in some UDF or CFC: ExternalMergeSort( strFilePathIn, strFilePathOut ).

@Shuns,

Thanks... nothing like a little insanity to get me through the week.

Reply to this Comment

Ben, just had another idea:

Replace the "Launch code in new window ยป" with a link that will turn the code block into a textbox (with the code included) so people can copy the code without going to a new page. Not sure how easy it is to do, but might be worth doing. Either that, or check out lightWindow to just popup a overlay window with the code without going to a new page.

Reply to this Comment

@Boyan,

That could be pretty cool. I don't think it would be that difficult, but I will evaluate. Consider it on the "List." Thanks.

Reply to this Comment

Hey..
I also need to sort a big file but the sorting should be done based only on the first word of the line.
Any suggestion of how to do that? Thanks in advance.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.