Someone over on CF-Talk was having a problem with parsing a very large CSV file. Someone suggested taking that file and splitting it down into smaller files and then parsing each file individually. So, for the poster's benefit, I have come up with one way to do this (you TOO can ask for demos to be built).
This method uses the java.io.LineNumberReader to read in one line at a time from a potentially huge CSV file. For every X number of lines (where X is an arbitrary number), the algorithm dumps the already-read lines into a smaller CSV file. The idea is that you would break up the huge CSV file into smaller files and then parse each smaller file individually.
The benefits of doing it this way are:
The originally CSV file does not have to be read in as a whole. This might be slower than reading it all in at once, but it is going to be MUCH nicer on the RAM of your server (which will, in return, free up RAM for faster processing of smaller CSV files).
The line number reader provides a very convenient way to split the original CSV into organized, smaller CSV files.
Ok, so now, on to the code. This code assumes that you have a data file "data.txt" and that the smaller files are going to be stored into the sub directory "small". These values, of course, would have to be tweaked for your situation.
<!--- Set the number of lines to use per smaller data file. This is the max number of lines per file; there is potential for the last file created to have less than this number. ---> <cfset REQUEST.MaxLines = 1000 /> <!--- Get path to HUGE CSV data file.. ---> <cfset REQUEST.DataFilePath = ExpandPath( "./data.txt" ) /> <!--- Create line number reader. As you can see, as per usual Java stylings, we get this by wrapping the file several times within utility classes. ---> <cfset REQUEST.LineReader = CreateObject( "java", "java.io.LineNumberReader" ).Init( CreateObject( "java", "java.io.FileReader" ).Init( CreateObject( "java", "java.io.File" ).Init( REQUEST.DataFilePath ) ) ) /> <!--- This is a string buffer for building the smaller CSV data files. The string buffer allows us to append strings at one time (instead of for every string concatenation). ---> <cfset REQUEST.CSVData = CreateObject( "java", "java.lang.StringBuffer" ).Init() /> <!--- Read the first line of data. ---> <cfset REQUEST.LineData = REQUEST.LineReader.ReadLine() /> <!--- Continue while we still have lines to read. Once we run out of lines to read, the LineReader will return null. That will cause the key, "LineData," to be removed from its parent scope, REQUEST. ---> <cfloop condition="StructKeyExists( REQUEST, 'LineData' )"> <!--- Get the line number for this iteration. ---> <cfset REQUEST.LineNumber = REQUEST.LineReader.GetLineNumber() /> <!--- Add this line of data to the string buffer. Be sure to add new lines as the line reader strips out the new line / carriage return data. ---> <cfset REQUEST.CSVData.Append( REQUEST.LineData & Chr( 13 ) & Chr( 10 ) ) /> <!--- Read the next line. ---> <cfset REQUEST.LineData = REQUEST.LineReader.ReadLine() /> <!--- Check to see if our buffer is big enough. For this demo, we will be creating files that are 1000 lines or less. At this point, we might have 100 lines, or, we might not have ANY lines left to read. If we do not have any lines left, then the LineData variable will no longer exist. ---> <cfif ( (NOT (REQUEST.LineNumber MOD REQUEST.MaxLines)) OR (NOT StructKeyExists( REQUEST, "LineData" )) )> <!--- Get the CSV file name for this smaller file. The file name is based on the last line number that was read in at the end of the previous iteration. ---> <cfset REQUEST.SmallFilePath = ExpandPath( "./small/" & NumberFormat( REQUEST.LineNumber, '0000000000' ) & ".txt" ) /> <!--- We have a 1000 lines. Output the CSV buffer to the smaller data file. For ease of use, the file name will be based on the line numbers (see just above). ---> <cffile action="WRITE" file="#REQUEST.SmallFilePath#" output="#REQUEST.CSVData.ToString()#" addnewline="false" fixnewline="true" /> <!--- Create a new string buffer to be used with the next CSV file. ---> <cfset REQUEST.CSVData = CreateObject( "java", "java.lang.StringBuffer" ).Init() /> </cfif> </cfloop>
At this point, there should be many smaller files created based on the single large file. The data in each of these files is in the similar format. This, of course, assumes that the CSV file does NOT have a header row. Of course, for such large files, I do not think this is a bad assumption. Once this process is completed, you could have it CFLocation to another page that starts to parse the smaller CSV files one at a time (per request??).
I played around with using a Buffered Reader (around the file reader) but that didn't seem to affect performance in any way. Of course, I don't really know much about Java, so who knows. Other people can tweak this much better than I can. What I do know is that on my Dev machine, the above example split an 11 MegaByte file (230,000+ lines of data) in 232 smaller files in 16 seconds (on average). Not bad at all.
Want to use code from this post? Check out the license.