Breaking Enormous CSV Files Into Smaller CSV Files
Someone over on CF-Talk was having a problem with parsing a very large CSV file. Someone suggested taking that file and splitting it down into smaller files and then parsing each file individually. So, for the poster's benefit, I have come up with one way to do this (you TOO can ask for demos to be built).
This method uses the java.io.LineNumberReader to read in one line at a time from a potentially huge CSV file. For every X number of lines (where X is an arbitrary number), the algorithm dumps the already-read lines into a smaller CSV file. The idea is that you would break up the huge CSV file into smaller files and then parse each smaller file individually.
The benefits of doing it this way are:
The originally CSV file does not have to be read in as a whole. This might be slower than reading it all in at once, but it is going to be MUCH nicer on the RAM of your server (which will, in return, free up RAM for faster processing of smaller CSV files).
The line number reader provides a very convenient way to split the original CSV into organized, smaller CSV files.
Ok, so now, on to the code. This code assumes that you have a data file "data.txt" and that the smaller files are going to be stored into the sub directory "small". These values, of course, would have to be tweaked for your situation.
<!--- Set the number of lines to use per smaller data file. This is the max number of lines per file; there is potential for the last file created to have less than this number. ---> <cfset REQUEST.MaxLines = 1000 /> <!--- Get path to HUGE CSV data file.. ---> <cfset REQUEST.DataFilePath = ExpandPath( "./data.txt" ) /> <!--- Create line number reader. As you can see, as per usual Java stylings, we get this by wrapping the file several times within utility classes. ---> <cfset REQUEST.LineReader = CreateObject( "java", "java.io.LineNumberReader" ).Init( CreateObject( "java", "java.io.FileReader" ).Init( CreateObject( "java", "java.io.File" ).Init( REQUEST.DataFilePath ) ) ) /> <!--- This is a string buffer for building the smaller CSV data files. The string buffer allows us to append strings at one time (instead of for every string concatenation). ---> <cfset REQUEST.CSVData = CreateObject( "java", "java.lang.StringBuffer" ).Init() /> <!--- Read the first line of data. ---> <cfset REQUEST.LineData = REQUEST.LineReader.ReadLine() /> <!--- Continue while we still have lines to read. Once we run out of lines to read, the LineReader will return null. That will cause the key, "LineData," to be removed from its parent scope, REQUEST. ---> <cfloop condition="StructKeyExists( REQUEST, 'LineData' )"> <!--- Get the line number for this iteration. ---> <cfset REQUEST.LineNumber = REQUEST.LineReader.GetLineNumber() /> <!--- Add this line of data to the string buffer. Be sure to add new lines as the line reader strips out the new line / carriage return data. ---> <cfset REQUEST.CSVData.Append( REQUEST.LineData & Chr( 13 ) & Chr( 10 ) ) /> <!--- Read the next line. ---> <cfset REQUEST.LineData = REQUEST.LineReader.ReadLine() /> <!--- Check to see if our buffer is big enough. For this demo, we will be creating files that are 1000 lines or less. At this point, we might have 100 lines, or, we might not have ANY lines left to read. If we do not have any lines left, then the LineData variable will no longer exist. ---> <cfif ( (NOT (REQUEST.LineNumber MOD REQUEST.MaxLines)) OR (NOT StructKeyExists( REQUEST, "LineData" )) )> <!--- Get the CSV file name for this smaller file. The file name is based on the last line number that was read in at the end of the previous iteration. ---> <cfset REQUEST.SmallFilePath = ExpandPath( "./small/" & NumberFormat( REQUEST.LineNumber, '0000000000' ) & ".txt" ) /> <!--- We have a 1000 lines. Output the CSV buffer to the smaller data file. For ease of use, the file name will be based on the line numbers (see just above). ---> <cffile action="WRITE" file="#REQUEST.SmallFilePath#" output="#REQUEST.CSVData.ToString()#" addnewline="false" fixnewline="true" /> <!--- Create a new string buffer to be used with the next CSV file. ---> <cfset REQUEST.CSVData = CreateObject( "java", "java.lang.StringBuffer" ).Init() /> </cfif> </cfloop>
At this point, there should be many smaller files created based on the single large file. The data in each of these files is in the similar format. This, of course, assumes that the CSV file does NOT have a header row. Of course, for such large files, I do not think this is a bad assumption. Once this process is completed, you could have it CFLocation to another page that starts to parse the smaller CSV files one at a time (per request??).
I played around with using a Buffered Reader (around the file reader) but that didn't seem to affect performance in any way. Of course, I don't really know much about Java, so who knows. Other people can tweak this much better than I can. What I do know is that on my Dev machine, the above example split an 11 MegaByte file (230,000+ lines of data) in 232 smaller files in 16 seconds (on average). Not bad at all.
Want to use code from this post? Check out the license.
Is it possible you could have a huge CSV file with no carraige return / new line characters? This might report only one line, and just create a copy of the entire file?
Perhaps there would be a way to factor in data size or number of values or something?
Of course this might not be an issue, but I know from reading your blog you like to explore all avenues for learnings sake and it crossed my mind as I read your post.
That is an interesting point. I do not think that you would ever have that many columns for a single line that would create memory problems. If it took over 200,000 records with several columns to create 11 Megs of CSV file, then it would take a single record with probably over 800,000 Columns to create a file of the same size.
But as you say, I like to explore... but I wouldn't even know how to with this one. The ReadLine() doesn't know there are no line break until it reads a line. I suppose you could stream in the file through a standard buffered input stream to find an occurrence of the new line. If you didn't find it, you could assume a columns-only approach, and if you did, you could assume a multi-records approach. Hmmmm.
Having never worked with a CSV file, I never really gave much though as to the point of the carraige return. I asked a friend who said his understanding of it was that the carraige return was indicative of a new record, explaining to me that each line represents a record.
That makes much more sense now, and it's easy to see why you didn't consider no carraige returns. But hey, if nothing else my seemingly daft question has giving me an understanding of CSV files.
Heck yeah! There are no stupid questions as long as we are all learning something right? Always glad to help.
Hi there Ben,
Great post which I have just adapted to try and address a memory problem I have with a CSV import script.
Basically Coldfusion is running out of heap space when processing a 31mb CSV file.
To get around this I assumed that if I split up the CSV into smaller parts and processed each one in turn that would solve the issue. However, after doing the work I am finding out it isn't - at least not with how I am implementing it.
Can I just check something. At the moment I split the file into smaller parts and then loop over each part. I have a CSV CFC that takes each part and returns a query for processing.
The problem I think is occuring here, is that even though I am reusing the same variable when calling the CFC the memory from the previous CSV file is not being freed. I assume that Java will only free up the memory used when the template ends?
I'm doing something like this, in psuedo code:
loop csv files
objCSV = Call CSV CFC passing in the current file
process the query in objCSV
Is this right? Will Java keep on allocating the memory for each file part and then not freeing it up until the whole template ends?
If this is the case I don't know what to do as this template runs as a scheduled task so no way to change location.. Although just thinking about it I could always setup a loop which calls CFHTTP.. hmmm..
Any advise would be greatly received and thanks again for this great piece of code (how fast!).
Right, I've answered my own question. :)
I went the CFHTTP route and broke the main processing logic into a seperate template. I then used CFHTTP to call this template within the main template and return the statistics in WDDX. The caller decodes the WDDX and add's the data to the stats array.
I ran the CF8 server monitor and watched the new system running and the memory usage is much much better.. Job done. :)
It all runs *alot* faster now as well, compared to when it all ran in one template..
This is definetely a system I'll be using in future.
Thanks again for this very useful post and the splitting code.
Yeah, in my experience, ColdFusion has problems releasing memory while a single request is running, no matter how you break it up. As such, you have to find someway to make several requests to process HUGE amounts of data (as you have figured out).
It was a great learning experience. Not only did I try something a little different but I also got to try out the CF8 server monitor which was interesting.
I've never looking at heap usage or even tuning the JVM but this whole experience has forced me to learn about it which has been great.
The more we learn on the job the better it is..
Great attitude :) If you are using CF8, I think you can start launching things in CFThread to alleviate some of the garbage collection concerns.
Aha, damn I totally forgot about cfthread.
Another tag I haven't played with yet.. :)
Yeah that would have been the perfect alternative to this approach.
The only thing with this approach is wouldn't there be the same memory problems with running the CSV processing in seperate concurrent threads as each one would take up a fair amount of memory.
Can you run a task in a seperate thread and wait for thread completion before continuing with the outer template?
You can certainly run one thread an wait for it to finish before starting the next one. Since they are running as unique threads, I am not 100% how the memory would work. I assume it would be smart enough to clear it out... but not sure.
I am trying to take this and put it in a cfc, that i can use to help in my various csv importing processes.
However, i have 2 questions...
1. Why is it in the request scope?
2. How can you make sure your object is instantiated?
Also what kind of error checking can you do to protect against errors?
There's no particular reason that I put this in the request scope. I probably was just going through a small, "I don't like the Variables scope" moment :)
Any call to the component will instantiate it. There is no way to call it without it getting instantiated; the trick, however, might be to instantiate, cache the CFC, and then make all subsequent calls to that already-instantiated one (the singleton design pattern).
As far as protecting against errors, that's a tough call. You are doing file read/writing, so that could cause errors; but, at that point, is that an error you can recover from?
Thanks for your comment...
I am wondering about the addnewline, fixnewline, i just red cf 8 docs, that says that is yes/no not true/false..
Only reason, is that at one point i have a csv file with an additional line break every other line, and the only thing i thought of was, some error in the add new line..
Hope your doing well..
-Old CF Purist :)
The true/false data type was only added in CF7 maybe? I can't remember. So, it used to be that you had to have yes/no. But now, you can use true/false in most of those cases.
I was happy when I see this blog until my test result in an error.
I'm currently in a situation where I have to find a way to correct my CF app. to accomodate huge file upload.
I was writing your codes up until the third line when I tried to cfdump REQUEST.LineReader to see the object and got this error.
I don't have a lot of Java knowledge so I'm not sure what this error says and what I need to fix in order to use your code in my app.
My file contains about 50.000 records and each record has about 250 columns. I'm in desperate need of this solution to work. Do you think you could helpa little?
I put your file in a function and my application is checking if the file is larger than 260Kb (approx. 1000 records) then it calls this funtion to split the file.
<CFARGUMENT name="UploadedFilePath" type="String" required="TRUE">
<cfset REQUEST.MaxLines = 1000 />
<cfset REQUEST.DataFilePath "#arguments.UploadedFilePath#" />
<cfset REQUEST.LineReader = CreateObject( "java", "java.io.LineNumberReader" ).Init(CreateObject( "java", "java.io.FileReader" ).Init(CreateObject( "java", "java.io.File" ).Init(REQUEST.DataFilePath))) />
The error says:
Object Instantiation Exception.
An exception occurred when instantiating a Java object. The class must not be an interface or an abstract class. Error: ''.
C:\Inetpub\wwwroot\test\NewLoad\load.cfm: line 131
11 :<cfset REQUEST.DataFilePath = "#arguments.UploadedFilePath#" />
12 :<cfset REQUEST.LineReader = CreateObject( "java", "java.io.LineNumberReader" ).Init(CreateObject( "java", "java.io.FileReader" ).Init(CreateObject( "java", "java.io.File" ).Init(REQUEST.DataFilePath))) />
14: <CFDUMP var="#REQUEST.LineReader#">
is this a dead forum?
This is not a forum - it's my personal web site; as such, I get a bit bogged down (especially around conference season) and my ability to respond to the comments gets a bit backed up.
Your error is probably that your file path is not fully expanded. It probably cannot create the Java File object using the value in REQUEST.DataFilePath. Make sure the path is fully qualified.
Just wanted to thank you for this script. I've been needing something like this for quite some time!
She worked like an absolute charm.
Thanks again, and keep up the great work with this site. I've found many solutions to problems I've had right here. I also really enjoy your presentation, with enough comments to make up for the rest of the world that doesn't use enough!
Thanks a lot my man :) I appreciate the comment.
I tried your code on a client's server. However, I received the following error:
Security: The requested template has been denied access to createobject(java)
I tried looking for another route to take. I ran into the following:
[Macromedia][SQLServer JDBC Driver][SQLServer]You do not have permission to use the bulk load statement.
However, I had the problem above.. go figure lol
The CSV file itself is 0.26 GB large.
I sent requests to the hosting service to see if they would lift one of the restrictions, but I doubt they will.
I guess an alternative would be to setup some type of service on my server. I hate relying on another server though in my code.