ColdFusion Optimizing Case Study: Writing Orders To A CSV File
The other day, someone contacted me needed help to speed up one of their files. They were writing eCommerce order data to a CSV file and the page kept throwing the error:
The request has exceeded the allowable time limit Tag: cffile
This is a case study on that piece of code and how it can be optimized.
PLEASE NOTE: Because I do not have this database, I cannot run this code. That means that it might be buggy. Just follow the concepts, not the exact syntax.
Here is the original code that was throwing ColdFusion timeout exceptions:
<cfquery datasource="XXXXXX" name="theOrders"> SELECT * FROM Orders LEFT OUTER JOIN Loger ON Loger.UserID = Orders.CartID </cfquery> <CFSET tempfile = "#expandpath('allOrders')##DateFormat(Now(), 'yyyy_mm_dd')#.csv" /> <!--- The "spaces" in the text below are TAB characters. Do not change them to spaces otherwise the Excel export will not work. ---> <CFSET output = ""> <CFSET output = output & "WebOrderID CartID Accountnumber Shipping SubTotal Total Order-Hold" /> <CFFILE ACTION="WRITE" FILE="#tempfile#" OUTPUT="#output#" /> <CFLOOP QUERY="theOrders"> <CFSET output = ""> <CFSET output = output & "#theOrders.WebOrderID# #theOrders.CartID# #theOrders.Accountnumber# #theOrders.Shipping1# #theOrders.Subtotal# #theOrders.Total# #theOrders.Hold#"> <CFFILE ACTION="APPEND" FILE="#tempfile#" OUTPUT="#output#" /> <cfquery datasource="XXXXXX" name="theRealCart"> SELECT DISTINCT (WebOrderID), cartID, Location, Delivery, Quantity, ItemNumber, Item, Price, Vendor, Attention, Price FROM realcart Where WebOrderID = '#theOrders.WebOrderID#' </cfquery> <cfloop query="theRealCart"> <CFSET output = ""> <CFSET output = output & "#theRealCart.cartID# #theRealCart.Vendor# #theRealCart.Delivery# #theRealCart.Item# #theRealCart.ItemNumber# #theRealCart.Quantity# #theRealCart.Price# #theRealCart.Attention#"> <CFFILE ACTION="APPEND" FILE="#tempfile#" OUTPUT="#output#" /> </cfloop> </CFLOOP>
Looking at this code, there were several things that popped out at me immediately. The first thing was all the ColdFusion CFFile tags. In this case, a CFFile Append action was being called for every single row of the target CSV file. Writing data to the file system is a very costly procedure (when compared to how fast calculations will process). This is not something that we want to constantly be doing. I would suggest minimizing the number of times that we actually go to the file system (within reason).
Instead of using so many ColdFusion CFFile tags, I would recommend using a Java StringBuffer to build the entire data file output before writing it to disk. The beauty of the StringBuffer is that is can build enormous strings without the overhead cost of string concatenation. It does this by not actually building a single string, but instead, keeping track of the many parts of the constructed string. The final string is only built once at the end when the ToString() method is called (on the StringBuffer).
The second thing that drew my attention was the repeated database calls. The way it was constructed, ColdFusion had to query the database for every single order. This could be a huge cost depending on how many orders there are. Instead of doing it this way, I would recommend creating a single query that joins the order information to the order detail information. This is a bit of a tradeoff; on one hand, you are pulling down duplicate order headers, but on the other hand, you are going to save a lot in terms of database communication costs.
In my sample code below, I am using an INNER JOIN to join the orders and the cart information. I am assuming that this is a valid relationship (one requires the other), but if that is not, you would need to change the INNER JOIN to a LEFT OUTER JOIN and update the CFLoop logic a bit. Also, please note that I am using SELECT * only because I do not know the underlying database structure. As part of the optimization process, only pull down the data columns that are required.
Taking those two red flags into account, here is my suggested optimization (again, keep in mind that I could NOT run this to validate syntax and ideas):
<!--- The first optimization is to combine the two queries (orders and order items) into one query. This will make the following CFLoop a bit more complicated but will only require going to the database once vs. N-times where N is the number of the orders. I am ONLY using SELECT * because I am not sure of the database strucutre. Otherwise, I would NEVER use SELECT * as more often than not, you don't need every column (and for readability reasons). ---> <cfquery name="theOrders" datasource="XXXXXX"> SELECT * FROM <!--- Create the table from the first query. I am doing this for two reasons: 1. I am not sure of the DB structure. 2. We can use a LEFT OUTER Join in this query and still use an INNER join on the next without any sort of confusion. ---> ( SELECT * FROM Orders LEFT OUTER JOIN Loger ON Loger.UserID = Orders.CartID ) AS o INNER JOIN realcart rc ON rc.WebOrderID = o.WebOrderID </cfquery> <!--- Get the name of the temp file that we are going to write our tab delimited data to. ---> <cfset strTempFile = ( ExpandPath( "allOrders" ) & DateFormat( Now(), "yyyy_mm_dd" ) & ".csv" ) /> <!--- Create a string buffer to hold our output. By writing to a string buffer rather than doing CFFile actions for every loop iteration we are going to increase our speed greatly. Writing to the file system is one of the most costly actions you can take in programming. Try to minimize it when possible (within reason). ---> <cfset sbOutput = CreateObject( "java", "java.lang.StringBuffer" ).Init() /> <!--- Start off, instead of writing our column headers to the file system (costly), let's just append them to our string buffer. Notice that we are adding a line break at the end of the data line. This is because there is nothing inherent to the String Buffer about line breaks. In the other method, we were Appending data using CFFile. CFFile adds a new line character by default. ---> <cfset sbOutput.Append( "WebOrderID CartID Accountnumber Shipping SubTotal Total Order-Hold" & Chr( 13 ) & Chr( 10 ) ) /> <!--- Get the initial order ID. We are going to need to keep track of the order ID so that we know when to output a new order row (when that ID changes). To start with, set it to a value we know cannot possibly be valid. ---> <cfset intOrderID = -1 /> <!--- Loop over the order/cart detail query. ---> <cfloop query="theOrders"> <!--- Check to see if we have a new order ID. If we do, then we need to output a mid-header row for this particular order. ---> <cfif (theOrders.WebOrderID NEQ intOrderID)> <!--- Store the order ID so that we don't recreate this mid-header row. ---> <cfset intOrderID = theOrders.WebOrderID /> <!--- Instead of outputing the mid-header row to a file, let's add it to our string buffer. ---> <cfset sbOutput.Append( "#theOrders.WebOrderID# #theOrders.CartID# #theOrders.Accountnumber# #theOrders.Shipping1# #theOrders.Subtotal# #theOrders.Total# #theOrders.Hold#" & Chr( 13 ) & Chr( 10 ) ) /> </cfif> <!--- ASSERT: At this point, we know that we are IN an order detail loop and that our header row has already been taken care of. That means that for the rest of the loop body, all we need to care about are the order details. ---> <cfset sbOutput.Append( "#theOrders.cartID# #theOrders.Vendor# #theOrders.Delivery# #theOrders.Item# #theOrders.ItemNumber# #theOrders.Quantity# #theOrders.Price# #theOrders.Attention#" & Chr( 13 ) & Chr( 10 ) ) /> </cfloop> <!--- Now that we have all of our tab delimited data in the Java StringBuffer, we can write it to the file system. This is a bigger file write that before, but since we are only doing one, rather than ONCE per data row, it will be crazy faster. When writing the string buffer to the file, we have to convert it to a single string using ToString(). ---> <cffile action="WRITE" file="#strTempFile#" output="#sbOutput.ToString()#" />
In the final code, we have ONE database call and ONE CFFile tag. I would be shocked if this method was not faster. Hopefully, I can get some feedback (if this advice is taken). Also, on a final note, increasing the CFSetting RequestTimeOut did solve this problem; but, I don't think this would be necessary if the code itself was optimized.
Want to use code from this post? Check out the license.
The string buffer makes a huge difference. Years ago I optimized some ColdFusion code as much as possible in create a csv string and the time just grew relative with the number of records. Hundreds of records took an acceptable amount of time but something like 10k records would take a few minutes, which wasn't acceptable. Luckily CFDJ had an article about the string buffer right around that same time. That cut down large csv creation from minutes down to seconds.
a generic function along the same lines ;)
I did something similar at my previous job but I used the ArrayAppend() method. I did a blog post a while back: http://www.philduba.com/index.cfm/2006/10/20/Improving-Performance-Concatenating-Large-Strings. The string buffer method is good to. In my testing there wasn't a discernible difference between the two methods and my manager was more willing to do a pure CF use than invoke Java components. Outside of the horrible queries in the reports, this helped reduce the report executing times by 30% and sometimes more.
ArrayAppend()... very interesting, I would not have thought of that. And then, what, you just did like an ArrayToList() at the end (with no delimiter)? That's cool that there is little difference.
It would be interesting to know what the StringBuffer is actually doing underneath the covers.
I'd put #Chr(13)##Chr(10)# as the delimiter to get it on separate lines.
Oh right :) I totally wasn't even thinking straight.
"It would be interesting to know what the StringBuffer is actually doing underneath the covers."
What I've read back when I first started using string buffer is that CF is creating a new variable for every cfset. That means if you are doing 10k concatenations CF is actually recreating a variable 10,000 times to perform the operations. The string buffer does the concatenation on single variable without the overhead of recreating the variable every time. Ok, that doesn't totally answer your question.
This runs ten times faster....
So you think it is still just storing it in a string variable? I am interested in the data structure underneath - is it a string, is it some sort of dynamic length collection?
Sweeeeet! Glad I could help in some way :)
One additional suggestion which will speed things up even more is to write the file progressively using <cffile action="append"...>.
The benefit of this is that CF doesn't retain the entire generated output in memory and thus doesn't need to manipulate that entire string object when it writes a new line. It is significantly faster for large queries.
Adding 'fixnewline="true"' removes the need to specifiy the #chr(13)# or #chr(10)# as the new line delimiter since CF adds the line break on the append.
Sorry. I'm a dork. I was only on my first cup of coffee and didn't read the case for using the string buffer correctly. :-) Your solution is better. My bad.
I think there might be a case for both situations. While File writing is a costly operation, you don't want to build massive amounts of data in the server's RAM as that will choke the machine. What might be good to use is a combination of BufferedFileOutput and StringBuffer. This level of optimization goes beyond my understanding of how all the file decorators works, but I suspect it would be the best of both worlds.
There's a CFX tag CFX_QueryWrite which does a similar thing. I have used it since CF5 days. You can find it at Adobe Exchange.
And its very fast - takes the query name and a delimiter as a parameter.
Please correct me if I'm wrong but ColdFusion already presents the string buffer capability through cfsavecontent? Thats what I have used to solve a similar problem with the exact same results: a huge increase in speed.
I am not sure how CFSaveContent is implemented under the covers. I am not sure how it would StringBuffer. I think it is just writing to an alternate content buffer. However, you will probably see speed improvements over standard string concatenation because the alternate content buffer probably acts in similar way for optimization???
I've used a similar (**coughs** identical) method to create a csv file, but have ran afoul of Excel's 'helpful' cell formatting - I have columns from a dataset in the form of number-hyphen-number, eg. 12-1. When opening the CSV in Excel, it decides to format this to 12-Jan.
The CSV file itself is ok and the data remains 12-1. Any pointers as to how to force Excel to regard these cells as text as opposed to dates?
I am pretty sure Excel has a Data Import Wizard. In that, you can select the columns to import and I believe format the data. Of course, this has to be done manually since this is a feature / issue with Excel, not with the CSV. The whole point of a CSV is that is has very little formatting.
Yeh, that's what I was thinking. The page was producing an Excel worksheet using MS' Open XML nonsense, which allows you to format columns/rows/cells til your hearts' content. But the bl**dy client's decided (after it's finished) that they want it in CSV.
If that's what they want, then that's what they want. Of course, that has implications that they have to deal with.
Really neat-o stuff...
Yes I said neat-o :)
But on the serious side, what about performance issues or concerns, like running csv importing as a scheduled task...
In my testing steps i keep getting jrun servlet errors, which concerns me greatly once i made this as a scheduled task, and how it's performance would affect the whole servers...
I've been told that if performance is really a big concern, running a CSV import through the bulk insert feature of the database would be the most performant (but somewhat less flexible).
I have implemented this solution to create a csv which I am feeding to google AdWords Editor. Works like a champ! Once I build the csv file in memory, instead of writing to disk, I dump the results to the user and the user is prompted to save the file.
However, once I get to around 250,000 lines, I am getting the following error:
javax.servlet.ServletException: ROOT CAUSE:
java.lang.OutOfMemoryError: Java heap space
I checked CF server and its allocating 512MB for CF JVM.
I start off with the following.
<cfset ColumnNames = "Campaign Name,Ad Group,Keyword,Keyword Type,Max CPC,Destination URL,Status">
<cfset KeywordList = ColumnNames & crlf>
"crlf is just a var for 'chr(13) &chr(10)'"
<cfset sbOutput = CreateObject(
<cfset OutputKeywordString = CampaignName & delim & nextvar1 & delim & nextvar2 ....>
<cfset sbOutput.Append(OutputKeywordString &Chr( 13 ) & Chr( 10 )) />
Finally, I present the csv to the user:
<!--- output to file --->
<cfheader name="Content-Disposition" value="attachment; filename=#network#_#type#_#content#_#DateFormat(Now())#_#TimeFormat(Now())#.csv">
Now, I have tried to set the initial size of the stringbuffer (I am not certain of the syntax, guessing here), but to no avail.
For me, a 250,000 record csv file should be about 35MB.
I have also checked used memory using the following:
<cfset runtime = CreateObject("java","java.lang.Runtime").getRuntime()>
<cfset totalmem = runtime.totalMemory() / 1024 / 1024>
<cfset freemem = runtime.freeMemory() / 1024 / 1024>
<cfset memoryUsed = (runtime.totalMemory() - runtime.freeMemory()) / 1024 / 1024>
Any ideas how to get around the "out of memory" problem?
Much thanks in advance!!
It appears this is the correct format to set the SB initial size.
<cfset sbOutput = CreateObject(
However, this does not seem to help my problem.
Thanks again in advance!!
It's hard to say how much memory that something should use. You can't just take the character bytes as the direct memory usage. Remember, there are Java class instances in play here. I don't know how StringBuilder works behind the scenes; I am sure it is more complicated than I imagine (considering it can do things like indexOf(), replace(), and lastIndexOf() - clearly it is a complex data container.
You might want to try writing it to disk every X rows, and then serving up that temporary file using CFContent/File.
Thanks for the quick reply! In the end, I may have to write to disk as you suggested. It's just so sad because the java stringbuffer performed so well.
Then again, I can still use it to wrap up 20k or so records at a time, write to disk, repeat until done.
Setting the initial SB size was just a stab in the dark :)
I have tried
each give me the same result, once I hit xx records from my query.
Each "line" i generate has 8 values, one of which is a URL string (about 100 to 150 characters in length). I found removing the URL from the output allows me to break my 250,000 record barrier. I am curious if any special characters in the url are a factor or if its just the length of the url itself taking up more memory (probably the latter).
Thank you for your time and suggestion!
No problem. And, as you are saying, you can get a best of both worlds kind of approach - use the StringBuffer for X rows, then append it to a file on the disk (and reset the buffer).
At that point, you just have to hope that ColdFusion is good about garbage collecting mid-request... something which can be troublesome.
If it wouldn't be such a waste of a damn-fine enlisted man I'd recommend you for OCS! You are gonna be a general someday!... (little forest gump reference in case ya didnt catch it)
Seems balance is the way to go. I just generated a 1.25 million record csv file in about 55 seconds. No more OutOfMemoryError :)
I am just creating the stringbuffer and setting its initial size to the chunk size I want to work with (20MB chunks). As I append to the SB, I check the length of the SB and when it hits the 20MB ceiling, I dump to file and clear the SB.
something like ....
<cfset sbOutput = CreateObject(
<!--- Create file and append header --->
<!--- append to sbOutput and stuff --->
<cfif sbOutput.length() gte 20971520>
<!--- any remaining need to be added to csv --->
<cfif (qry_get_keywords.currentrow eq qry_get_keywords.recordcount)>
Thanks again for the tip and having this blog (I must have solved over a dozen problems in the past visiting your site! Add 1 more!).
Looks good - I think that's the right approach. I can't believe you are dealing with millions of rows - that's badass. Glad we could collaborate on this.
Well that is just what I wanted!
Such a logical way of improving performance on this. Now writing large csv's in a matter of seconds.
Thanks for this guidance - much appreciated for sharing.
Glad to help. The StringBuffer object can definitely be a great improvement; although, I have to tell you that I've also seen great speed improvements with building CSV data using arrays and then the arrayToList() function to join them. In either case, however, it sounds like you're already seeing significant improvement!
Thank you, thank you!
I have been becoming a big fan of your blog, and thought I'd finally respond to something, especially something that saved my butt!
I have to add a Thank you too! This saved me tons of headache, I was reaching the time limit writing 4500 lines to an xml file. "subscribing"
Maybe this was already brought up with using the string buffer (i didn't read all the comments; i know lazy....)
My simple method turned a 10 minute write to a 10 second write by just taking each record and saving it to a variable and the appending the next record on to it. After 10 records have been processed, i dump it to the file with an append.