Writing Enormous Files Based On Massive Record Sets In ColdFusion
Someone over on CF-Talk was asking about writing massive data files that are the result of several million records coming out of a database. One of the concerns expressed was that the problem was not only with the writing of the file, it was with the amount of memory that the massive record set would take up. Here is my stab at a solution that uses small CFQuery record sets as well as a buffered output stream that should ease the use of ColdFusion at both ends.
Not sure if this helps any one, it was really only for the dude on CF-Talk who has not done any Java. In this example, I pulling blog entries out of a database and writing them in quasi-XML format to a txt file. This isn't really anything that I do, I just needed something to test on.
Also, while I am not doing it in the example, you can wrap the whole thing in a CFSilent tag to stop output and whitespace from being sent to the browser. The ClearBuffer() line should take care of any white space in the ColdFusion buffer??? Not sure.
<!--- Get the number of records to read at one time. This will limit the amount of data that ColdFusion has to pull down from the SQL Server and store in it's memory. ---> <cfset intReadCount = 50 /> <!--- Get the offset for the read. ---> <cfset intOffset = 0 /> <!--- Set up the buffered output stream. The buffered output stream requires a file output stream which requires a file object. By using a buffered output stream we will limit the amount of data that ColdFusion has to store in memory AND we will moderate the number of write commands that need to take place. ---> <cfset objOutput = CreateObject( "java", "java.io.BufferedOutputStream" ).Init( <!--- File Output Stream. ---> CreateObject( "java", "java.io.FileOutputStream" ).Init( <!--- File object. ---> CreateObject( "java", "java.io.File" ).Init( <!--- File Path. ---> JavaCast( "string", ExpandPath( "./output.txt" ) ) ) ) ) /> <!--- Keep looping while we have records to read. This loop will only be broken manually once we have no more records getting returned from the database. ---> <cfloop condition="true"> <!--- Query for TOP ## records. ---> <cfquery name="qBlog" datasource="#REQUEST.DSN.Source#" username="#REQUEST.DSN.Username#" password="#REQUEST.DSN.Password#"> SELECT TOP #intReadCount# b.id, b.name, b.description, b.date_posted FROM blog_entry b WHERE b.id > <cfqueryparam value="#intOffset#" cfsqltype="CF_SQL_INTEGER" /> ORDER BY b.id ASC </cfquery> <!--- Check to see if we have a record count. If we do not, then we are out of data. Break out of the loop. ---> <cfif qBlog.RecordCount> <!--- We have returned records from the database. ---> <!--- Loop over the records. ---> <cfloop query="qBlog"> <!--- Create a string buffer to hold row output. ---> <cfset objRowBuffer = CreateObject( "java", "java.lang.StringBuffer" ).Init() /> <!--- Add ID. ---> <cfset objRowBuffer.Append( "<id>" & qBlog.id & "</id>" ) /> <!--- Add name. ---> <cfset objRowBuffer.Append( "<name>" & XmlFormat( qBlog.name ) & "</name>" ) /> <!--- Add description. ---> <cfset objRowBuffer.Append( "<description>" & XmlFormat( qBlog.description ) & "</description>" ) /> <!--- Add date posted. ---> <cfset objRowBuffer.Append( "<posted>" & XmlFormat( qBlog.date_posted ) & "</posted>" ) /> <!--- Add new line (for next record). ---> <cfset objRowBuffer.Append( Chr( 13 ) & Chr( 10 ) ) /> <!--- Convert the string buffer to a string (concatenate all the fields) and then get the character byte array from the resultant string. ---> <cfset arrRowBytes = objRowBuffer.ToString().GetBytes() /> <!--- Write the byte array to the output stream. ---> <cfset objOutput.Write( arrRowBytes, JavaCast( "int", 0 ), JavaCast( "int", ArrayLen( arrRowBytes ) ) ) /> </cfloop> <!--- Increment the offset. ---> <cfset intOffset = (intOffset + intReadCount) /> <!--- Reset the buffer. This should kill the white space that is building up in the ColdFusion memory space. ---> <cfset GetPageContext().GetOut().ClearBuffer() /> <cfelse> <!--- No more records. Stop looping. ---> <cfbreak /> </cfif> </cfloop> <!--- Flush the buffered output stream to make sure there is no straggling buffer data. ---> <cfset objOutput.Flush() />
Want to use code from this post? Check out the license.
I can't help but think you could have achieved this more easily without horsing around with Java and just using <cffile action="append"> (which will be doing all your stuff "under the hood" anyhow).
The problem from the bloke on CF-TALK is that CF is very ham-fisted when it comes to dealing with large amounts of data (recordsets, file contents, etc). It tries to load it ALL into memory, whether you need it or not (the answer being "not" if you're just spooling the recordset to a file).
My conceptual knowledge of Java is far better than my syntactical one, but I would try to do this by streaming the recordset back from the DB, and streaming THAT to the file. So only the chunk-size you're streaming every takes up memory, not the whole result/file.
Now... you're quite good with your Java... I'm sure you could cook something up that does that?
Adam, you could use the CFFile 'append' method, but that eats up even more memory, mostly from a processing perspective. The company I work for ran into this problem also, as we would build large CSV files from data gained from most of the tables our main application. I compared the Java method you talk to here Ben with the ArrayAppend/ArrayToList method I describe at http://www.philduba.com/index.cfm/2006/10/20/Improving-Performance-Concatenating-Large-Strings.
They came out equivalent in time up to 10 million records being written (at least for our data). If someone doesn't have the Java knowledge, the CF solution might be easier to understand.
I like your idea, but I would have no idea about streaming a record set from the Database. I am not even sure if that is possible as it is not even a ColdFusion recordset until the data is completely returned??? Perhaps if you did the query itself in Java then something like this would be possible, but it is way above my skill set.
Cool idea, especially for people who are unfamiliar with the Java techniques.
No, I meant foresake CF completely, and do it all in Java. The only time it becomes useful for the data to be in a CF-recordset-format is if you want to use CF query functions on it. If you're just spooling it to a file: no need.
Perhaps the only time I'd involve CF would be if maybe this would be a common enough task to wrap up in a <cf_query2file> tag or something.
Adam, I think doing the query and file writing in Java would be a good solution, if the knowledge is there to do that. In what I've had to work on that led me to the ArrayAppend idea, the query is being used for other things in the page, not just CSV or file creation. I looked at that thread today and the guy seems to be CF only.
this might sound like a silly comment but when I try to post the code, I get pound signs that I have to delete before I can try to execute the code. Is that something you could fix or am I making some sort of mistake?
If you are copy / pasting the code directly from this page, then you will get # signs because it is actually an unordered list (and those # are the list item bullets).
The better way to do this would be either to click on the "Download" link above the code you want, or the "open in new window" link (which has a copy-n-paste box with proper formatting).
I've got a similar problem and need to create a 100+MB text file. Is it possible to stream the query results using java?
Using some sort of file output stream should be just fine for that.
Thanks Ben - had this issue myself this will be useful to have a go.
On a side, we are also experiencing some real slow down when adding large numbers of images to a zip file. Do you think this would work on this too?
I don't think this particular strategy would work with ZIP files, but I am sure that if you write a zip file to disk and then iteratively add image files to the zip, you'd be in better shape.
thanks ben - fyi - implemented the above method and it worked well for me the memory for jrun & sql was very stable. Only thing was it took "a lot" longer to run through. I'm going to test out adding in blocks of rows with cfsavecontent in one go, to see if this works quicker for me.
I keep getting java.lang.OutOfMemoryError: GC overhead limit exceeded
for even 1000 records.
Hmm, that's odd - I would be surprised for 1000 records to be causing memory problems at all, with any approach. Are you returning a query with an enormouns number of columns or something?
Have you tried restarting ColdFusion - I've run into problems where CF just stops working properly and I get memory problems; then, I restart the service and it seems to come back OK.
Also, at the end of each loop, you might try doing this:
<cfthread action="sleep" duration="10" />
I've been told that adding a very short sleep allows ColdFusion enough time to perform the necessary garbage collection.
Thank you so much Ben! This really helped me out today. I just have one question. After the entire process is done and I try to move the file and or delete (outside of cf) it says that the file is open by another program. It lloks like java is not letting go of the file. Is there something I am missing?
Just wanted to share a change I did as I needed to write the file set as UTF-8. Below is a simple example of how to do this.
// file to write
fileToWrite = 'c:\testfile.txt';
// create the file
theFile = createObject("java","java.io.File").init(javaCast("string",fileToWrite));
// create my java file writer
fileWriter = createObject("java","java.io.BufferedWriter").init(createObject("java","java.io.OutputStreamWriter").init(createObject("java","java.io.FileOutputStream").init(theFile),"UTF8"));
// create string object
stringToWrite = createObject("java","java.lang.StringBuffer").Init();
// write line
// Flush the buffered output stream to make sure there is no straggling buffer data
Sorry the text did't wrap so here is the cut off piece.
fileWriter = createObject("java","java.io.BufferedWriter").
So, how would this approach need to change if I need to *return* a CSV file to a user? That is, no external file is currently written. It goes like this:
I need to find a way of stopping running out of heap for a large CSV report rather urgently. Thanks in advance for any ideas.
I like this post. A lot of really good, useful information here. When I saw the title, I couldn't help but click on it, because it sounds exactly like the kind of random problem I would have. Good job at finding a solution.
Adding to my previous comment, should I just <cfflush> my CSV output and all will be good, at least on the output side? The DB side is a separate issue.
at the top of the page solved my problem and provided nice download progress feedback for users.
However it has one nasty side effect. When IE prompts and I select to save .csv, everything works great. When IE prompts and I select to Open with Excel, it is downloaded with the wrong extension: .csv.XLS, and, as a result, Excel does not recognize CSV format.
OK, next attempt at a solution is putting <cfflush> after the headers. This time Open file works. However, that change introduced 2 problem with report file name:
File name suddenly started to contain underscores rather than spaces.
Open file generates the file which is shown as reportAction (which is the .CFM name). Weird.
The code now looks like this:
Brilliant! Exactly what I needed. Thanks so much.
On our dev team we always joke about how your blog always shows up in the top five results for our Google searches when we're stuck on something. Thanks for all the help you don't know you've already provided!