Breaking Enormous CSV Files Into Smaller CSV Files

Posted December 15, 2006 at 7:15 PM

Tags: ColdFusion

Someone over on CF-Talk was having a problem with parsing a very large CSV file. Someone suggested taking that file and splitting it down into smaller files and then parsing each file individually. So, for the poster's benefit, I have come up with one way to do this (you TOO can ask for demos to be built).

This method uses the java.io.LineNumberReader to read in one line at a time from a potentially huge CSV file. For every X number of lines (where X is an arbitrary number), the algorithm dumps the already-read lines into a smaller CSV file. The idea is that you would break up the huge CSV file into smaller files and then parse each smaller file individually.

The benefits of doing it this way are:

  1. The originally CSV file does not have to be read in as a whole. This might be slower than reading it all in at once, but it is going to be MUCH nicer on the RAM of your server (which will, in return, free up RAM for faster processing of smaller CSV files).
  2. The line number reader provides a very convenient way to split the original CSV into organized, smaller CSV files.

Ok, so now, on to the code. This code assumes that you have a data file "data.txt" and that the smaller files are going to be stored into the sub directory "small". These values, of course, would have to be tweaked for your situation.

 Launch code in new window » Download code as text file »

  • <!---
  • Set the number of lines to use per smaller data file. This
  • is the max number of lines per file; there is potential
  • for the last file created to have less than this number.
  • --->
  • <cfset REQUEST.MaxLines = 1000 />
  •  
  • <!--- Get path to HUGE CSV data file.. --->
  • <cfset REQUEST.DataFilePath = ExpandPath( "./data.txt" ) />
  •  
  • <!---
  • Create line number reader. As you can see, as per usual
  • Java stylings, we get this by wrapping the file several
  • times within utility classes.
  • --->
  • <cfset REQUEST.LineReader = CreateObject( "java", "java.io.LineNumberReader" ).Init(
  • CreateObject( "java", "java.io.FileReader" ).Init(
  • CreateObject( "java", "java.io.File" ).Init(
  • REQUEST.DataFilePath
  • )
  • )
  • ) />
  •  
  •  
  • <!---
  • This is a string buffer for building the smaller CSV data
  • files. The string buffer allows us to append strings at
  • one time (instead of for every string concatenation).
  • --->
  • <cfset REQUEST.CSVData = CreateObject(
  • "java",
  • "java.lang.StringBuffer"
  • ).Init() />
  •  
  •  
  • <!--- Read the first line of data. --->
  • <cfset REQUEST.LineData = REQUEST.LineReader.ReadLine() />
  •  
  •  
  • <!---
  • Continue while we still have lines to read. Once we
  • run out of lines to read, the LineReader will return
  • null. That will cause the key, "LineData," to be
  • removed from its parent scope, REQUEST.
  • --->
  • <cfloop condition="StructKeyExists( REQUEST, 'LineData' )">
  •  
  • <!--- Get the line number for this iteration. --->
  • <cfset REQUEST.LineNumber = REQUEST.LineReader.GetLineNumber() />
  •  
  • <!---
  • Add this line of data to the string buffer. Be sure
  • to add new lines as the line reader strips out the
  • new line / carriage return data.
  • --->
  • <cfset REQUEST.CSVData.Append(
  • REQUEST.LineData & Chr( 13 ) & Chr( 10 )
  • ) />
  •  
  •  
  • <!--- Read the next line. --->
  • <cfset REQUEST.LineData = REQUEST.LineReader.ReadLine() />
  •  
  •  
  • <!---
  • Check to see if our buffer is big enough. For this demo,
  • we will be creating files that are 1000 lines or less.
  • At this point, we might have 100 lines, or, we might not
  • have ANY lines left to read. If we do not have any lines
  • left, then the LineData variable will no longer exist.
  • --->
  • <cfif (
  • (NOT (REQUEST.LineNumber MOD REQUEST.MaxLines)) OR
  • (NOT StructKeyExists( REQUEST, "LineData" ))
  • )>
  •  
  • <!---
  • Get the CSV file name for this smaller file. The
  • file name is based on the last line number that
  • was read in at the end of the previous iteration.
  • --->
  • <cfset REQUEST.SmallFilePath = ExpandPath(
  • "./small/" &
  • NumberFormat( REQUEST.LineNumber, '0000000000' ) &
  • ".txt"
  • ) />
  •  
  • <!---
  • We have a 1000 lines. Output the CSV buffer to the
  • smaller data file. For ease of use, the file name
  • will be based on the line numbers (see just above).
  • --->
  • <cffile
  • action="WRITE"
  • file="#REQUEST.SmallFilePath#"
  • output="#REQUEST.CSVData.ToString()#"
  • addnewline="false"
  • fixnewline="true"
  • />
  •  
  • <!---
  • Create a new string buffer to be used with the
  • next CSV file.
  • --->
  • <cfset REQUEST.CSVData = CreateObject(
  • "java",
  • "java.lang.StringBuffer"
  • ).Init() />
  •  
  • </cfif>
  •  
  • </cfloop>

At this point, there should be many smaller files created based on the single large file. The data in each of these files is in the similar format. This, of course, assumes that the CSV file does NOT have a header row. Of course, for such large files, I do not think this is a bad assumption. Once this process is completed, you could have it CFLocation to another page that starts to parse the smaller CSV files one at a time (per request??).

I played around with using a Buffered Reader (around the file reader) but that didn't seem to affect performance in any way. Of course, I don't really know much about Java, so who knows. Other people can tweak this much better than I can. What I do know is that on my Dev machine, the above example split an 11 MegaByte file (230,000+ lines of data) in 232 smaller files in 16 seconds (on average). Not bad at all.

Download Code Snippet ZIP File

Comments (4)  |  Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page




Adobe ColdFusion 8.0.1 Update - Helping Programmers To Be Signifanctly Less Girlie - Download ColdFusion 8 Update 8.0.1 Now.

Reader Comments

Is it possible you could have a huge CSV file with no carraige return / new line characters? This might report only one line, and just create a copy of the entire file?

Perhaps there would be a way to factor in data size or number of values or something?

Of course this might not be an issue, but I know from reading your blog you like to explore all avenues for learnings sake and it crossed my mind as I read your post.

Posted by Ryan Stewart on Dec 19, 2006 at 4:08 AM


Ryan,

That is an interesting point. I do not think that you would ever have that many columns for a single line that would create memory problems. If it took over 200,000 records with several columns to create 11 Megs of CSV file, then it would take a single record with probably over 800,000 Columns to create a file of the same size.

But as you say, I like to explore... but I wouldn't even know how to with this one. The ReadLine() doesn't know there are no line break until it reads a line. I suppose you could stream in the file through a standard buffered input stream to find an occurrence of the new line. If you didn't find it, you could assume a columns-only approach, and if you did, you could assume a multi-records approach. Hmmmm.

Posted by Ben Nadel on Dec 19, 2006 at 6:59 AM


Having never worked with a CSV file, I never really gave much though as to the point of the carraige return. I asked a friend who said his understanding of it was that the carraige return was indicative of a new record, explaining to me that each line represents a record.

That makes much more sense now, and it's easy to see why you didn't consider no carraige returns. But hey, if nothing else my seemingly daft question has giving me an understanding of CSV files.

Cheers,

Ryan.

Posted by Ryan Stewart on Dec 20, 2006 at 10:27 AM


Ryan,

Heck yeah! There are no stupid questions as long as we are all learning something right? Always glad to help.

Posted by Ben Nadel on Dec 20, 2006 at 10:32 AM


Post Comment  |  Ask Ben


Home   |   Web Log   |   ColdFusion   |   Projects   |   Resume   |   Job Form   |   Search   |   Contact
Epicenter Consulting - Custom Software Solutions for Business Evolution HostMySite.com - The Leader In ColdFusion Hosting