Someone over on CF-Talk was having a problem with parsing a very large CSV file. Someone suggested taking that file and splitting it down into smaller files and then parsing each file individually. So, for the poster's benefit, I have come up with one way to do this (you TOO can ask for demos to be built).
This method uses the java.io.LineNumberReader to read in one line at a time from a potentially huge CSV file. For every X number of lines (where X is an arbitrary number), the algorithm dumps the already-read lines into a smaller CSV file. The idea is that you would break up the huge CSV file into smaller files and then parse each smaller file individually.
The benefits of doing it this way are:
Ok, so now, on to the code. This code assumes that you have a data file "data.txt" and that the smaller files are going to be stored into the sub directory "small". These values, of course, would have to be tweaked for your situation.
Launch code in new window » Download code as text file »
At this point, there should be many smaller files created based on the single large file. The data in each of these files is in the similar format. This, of course, assumes that the CSV file does NOT have a header row. Of course, for such large files, I do not think this is a bad assumption. Once this process is completed, you could have it CFLocation to another page that starts to parse the smaller CSV files one at a time (per request??).
I played around with using a Buffered Reader (around the file reader) but that didn't seem to affect performance in any way. Of course, I don't really know much about Java, so who knows. Other people can tweak this much better than I can. What I do know is that on my Dev machine, the above example split an 11 MegaByte file (230,000+ lines of data) in 232 smaller files in 16 seconds (on average). Not bad at all.
Download Code Snippet ZIP File
Comments (4) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
Christopher Walken On ColdFusion
Creating And Streaming Simple Microsoft Excel Files With ColdFusion
Is it possible you could have a huge CSV file with no carraige return / new line characters? This might report only one line, and just create a copy of the entire file?
Perhaps there would be a way to factor in data size or number of values or something?
Of course this might not be an issue, but I know from reading your blog you like to explore all avenues for learnings sake and it crossed my mind as I read your post.
Posted by Ryan Stewart on Dec 19, 2006 at 4:08 AM
Ryan,
That is an interesting point. I do not think that you would ever have that many columns for a single line that would create memory problems. If it took over 200,000 records with several columns to create 11 Megs of CSV file, then it would take a single record with probably over 800,000 Columns to create a file of the same size.
But as you say, I like to explore... but I wouldn't even know how to with this one. The ReadLine() doesn't know there are no line break until it reads a line. I suppose you could stream in the file through a standard buffered input stream to find an occurrence of the new line. If you didn't find it, you could assume a columns-only approach, and if you did, you could assume a multi-records approach. Hmmmm.
Posted by Ben Nadel on Dec 19, 2006 at 6:59 AM
Having never worked with a CSV file, I never really gave much though as to the point of the carraige return. I asked a friend who said his understanding of it was that the carraige return was indicative of a new record, explaining to me that each line represents a record.
That makes much more sense now, and it's easy to see why you didn't consider no carraige returns. But hey, if nothing else my seemingly daft question has giving me an understanding of CSV files.
Cheers,
Ryan.
Posted by Ryan Stewart on Dec 20, 2006 at 10:27 AM
Ryan,
Heck yeah! There are no stupid questions as long as we are all learning something right? Always glad to help.
Posted by Ben Nadel on Dec 20, 2006 at 10:32 AM