ColdFusion Optimizing Case Study: Writing Orders To A CSV File
Posted May 4, 2007 at 8:46 AM
The other day, someone contacted me needed help to speed up one of their files. They were writing eCommerce order data to a CSV file and the page kept throwing the error:
The request has exceeded the allowable time limit Tag: cffile
This is a case study on that piece of code and how it can be optimized.
PLEASE NOTE: Because I do not have this database, I cannot run this code. That means that it might be buggy. Just follow the concepts, not the exact syntax.
Here is the original code that was throwing ColdFusion timeout exceptions:
Launch code in new window » Download code as text file »
- <cfquery datasource="XXXXXX" name="theOrders">
- SELECT
- *
- FROM
- Orders
- LEFT OUTER JOIN
- Loger
- ON
- Loger.UserID = Orders.CartID
- </cfquery>
-
-
- <CFSET tempfile = "#expandpath('allOrders')##DateFormat(Now(), 'yyyy_mm_dd')#.csv" />
-
- <!---
- The "spaces" in the text below are TAB characters.
- Do not change them to spaces otherwise the Excel
- export will not work.
- --->
-
- <CFSET output = "">
- <CFSET output = output & "WebOrderID CartID Accountnumber Shipping SubTotal Total Order-Hold" />
-
- <CFFILE
- ACTION="WRITE"
- FILE="#tempfile#"
- OUTPUT="#output#"
- />
-
- <CFLOOP QUERY="theOrders">
-
- <CFSET output = "">
- <CFSET output = output & "#theOrders.WebOrderID# #theOrders.CartID# #theOrders.Accountnumber# #theOrders.Shipping1# #theOrders.Subtotal# #theOrders.Total# #theOrders.Hold#">
-
- <CFFILE
- ACTION="APPEND"
- FILE="#tempfile#"
- OUTPUT="#output#"
- />
-
- <cfquery datasource="XXXXXX" name="theRealCart">
- SELECT DISTINCT
- (WebOrderID),
- cartID,
- Location,
- Delivery,
- Quantity,
- ItemNumber,
- Item,
- Price,
- Vendor,
- Attention,
- Price
- FROM
- realcart
- Where
- WebOrderID = '#theOrders.WebOrderID#'
- </cfquery>
-
- <cfloop query="theRealCart">
-
- <CFSET output = "">
- <CFSET output = output & "#theRealCart.cartID# #theRealCart.Vendor# #theRealCart.Delivery# #theRealCart.Item# #theRealCart.ItemNumber# #theRealCart.Quantity# #theRealCart.Price# #theRealCart.Attention#">
-
- <CFFILE
- ACTION="APPEND"
- FILE="#tempfile#"
- OUTPUT="#output#"
- />
-
- </cfloop>
-
- </CFLOOP>
Looking at this code, there were several things that popped out at me immediately. The first thing was all the ColdFusion CFFile tags. In this case, a CFFile Append action was being called for every single row of the target CSV file. Writing data to the file system is a very costly procedure (when compared to how fast calculations will process). This is not something that we want to constantly be doing. I would suggest minimizing the number of times that we actually go to the file system (within reason).
Instead of using so many ColdFusion CFFile tags, I would recommend using a Java StringBuffer to build the entire data file output before writing it to disk. The beauty of the StringBuffer is that is can build enormous strings without the overhead cost of string concatenation. It does this by not actually building a single string, but instead, keeping track of the many parts of the constructed string. The final string is only built once at the end when the ToString() method is called (on the StringBuffer).
The second thing that drew my attention was the repeated database calls. The way it was constructed, ColdFusion had to query the database for every single order. This could be a huge cost depending on how many orders there are. Instead of doing it this way, I would recommend creating a single query that joins the order information to the order detail information. This is a bit of a tradeoff; on one hand, you are pulling down duplicate order headers, but on the other hand, you are going to save a lot in terms of database communication costs.
In my sample code below, I am using an INNER JOIN to join the orders and the cart information. I am assuming that this is a valid relationship (one requires the other), but if that is not, you would need to change the INNER JOIN to a LEFT OUTER JOIN and update the CFLoop logic a bit. Also, please note that I am using SELECT * only because I do not know the underlying database structure. As part of the optimization process, only pull down the data columns that are required.
Taking those two red flags into account, here is my suggested optimization (again, keep in mind that I could NOT run this to validate syntax and ideas):
Launch code in new window » Download code as text file »
- <!---
- The first optimization is to combine the two queries
- (orders and order items) into one query. This will make
- the following CFLoop a bit more complicated but will only
- require going to the database once vs. N-times where N is
- the number of the orders.
-
- I am ONLY using SELECT * because I am not sure of the
- database strucutre. Otherwise, I would NEVER use SELECT *
- as more often than not, you don't need every column (and
- for readability reasons).
- --->
- <cfquery name="theOrders" datasource="XXXXXX">
- SELECT
- *
- FROM
- <!---
- Create the table from the first query. I am doing
- this for two reasons:
- 1. I am not sure of the DB structure.
- 2. We can use a LEFT OUTER Join in this query and
- still use an INNER join on the next without any
- sort of confusion.
- --->
- (
-
- SELECT
- *
- FROM
- Orders
- LEFT OUTER JOIN
- Loger
- ON
- Loger.UserID = Orders.CartID
-
- ) AS o
- INNER JOIN
- realcart rc
- ON
- rc.WebOrderID = o.WebOrderID
- </cfquery>
-
-
-
- <!---
- Get the name of the temp file that we are going
- to write our tab delimited data to.
- --->
- <cfset strTempFile = (
- ExpandPath( "allOrders" ) &
- DateFormat( Now(), "yyyy_mm_dd" ) &
- ".csv"
- ) />
-
-
- <!---
- Create a string buffer to hold our output. By writing to
- a string buffer rather than doing CFFile actions for every
- loop iteration we are going to increase our speed greatly.
- Writing to the file system is one of the most costly actions
- you can take in programming. Try to minimize it when
- possible (within reason).
- --->
- <cfset sbOutput = CreateObject(
- "java",
- "java.lang.StringBuffer"
- ).Init() />
-
-
- <!---
- Start off, instead of writing our column headers to
- the file system (costly), let's just append them to
- our string buffer.
-
- Notice that we are adding a line break at the end of the
- data line. This is because there is nothing inherent to
- the String Buffer about line breaks. In the other method,
- we were Appending data using CFFile. CFFile adds a new
- line character by default.
- --->
- <cfset sbOutput.Append(
- "WebOrderID CartID Accountnumber Shipping SubTotal Total Order-Hold" &
- Chr( 13 ) & Chr( 10 )
- ) />
-
-
- <!---
- Get the initial order ID. We are going to need to keep track
- of the order ID so that we know when to output a new order
- row (when that ID changes). To start with, set it to a value
- we know cannot possibly be valid.
- --->
- <cfset intOrderID = -1 />
-
-
- <!--- Loop over the order/cart detail query. --->
- <cfloop query="theOrders">
-
- <!---
- Check to see if we have a new order ID. If we do,
- then we need to output a mid-header row for this
- particular order.
- --->
- <cfif (theOrders.WebOrderID NEQ intOrderID)>
-
- <!---
- Store the order ID so that we don't recreate this
- mid-header row.
- --->
- <cfset intOrderID = theOrders.WebOrderID />
-
-
- <!---
- Instead of outputing the mid-header row to a file,
- let's add it to our string buffer.
- --->
- <cfset sbOutput.Append(
- "#theOrders.WebOrderID# #theOrders.CartID# #theOrders.Accountnumber# #theOrders.Shipping1# #theOrders.Subtotal# #theOrders.Total# #theOrders.Hold#" &
- Chr( 13 ) & Chr( 10 )
- ) />
-
- </cfif>
-
-
- <!---
- ASSERT: At this point, we know that we are IN an order
- detail loop and that our header row has already been
- taken care of. That means that for the rest of the loop
- body, all we need to care about are the order details.
- --->
-
-
- <cfset sbOutput.Append(
- "#theOrders.cartID# #theOrders.Vendor# #theOrders.Delivery# #theOrders.Item# #theOrders.ItemNumber# #theOrders.Quantity# #theOrders.Price# #theOrders.Attention#" &
- Chr( 13 ) & Chr( 10 )
- ) />
-
- </cfloop>
-
-
- <!---
- Now that we have all of our tab delimited data in the
- Java StringBuffer, we can write it to the file system.
- This is a bigger file write that before, but since we
- are only doing one, rather than ONCE per data row, it
- will be crazy faster.
-
- When writing the string buffer to the file, we have
- to convert it to a single string using ToString().
- --->
- <cffile
- action="WRITE"
- file="#strTempFile#"
- output="#sbOutput.ToString()#"
- />
In the final code, we have ONE database call and ONE CFFile tag. I would be shocked if this method was not faster. Hopefully, I can get some feedback (if this advice is taken). Also, on a final note, increasing the CFSetting RequestTimeOut did solve this problem; but, I don't think this would be necessary if the code itself was optimized.
Download Code Snippet ZIP File
Post Comment | Ask Ben | Permalink | Other Searches | Print Page
Newer Post
Defining A CSS Selector That Requires A Multi-Class Union
Older Post
Testing String Equality Of Any Length Happens Instantly In ColdFusion
Reader Comments
The string buffer makes a huge difference. Years ago I optimized some ColdFusion code as much as possible in create a csv string and the time just grew relative with the number of records. Hundreds of records took an acceptable amount of time but something like 10k records would take a few minutes, which wasn't acceptable. Luckily CFDJ had an article about the string buffer right around that same time. That cut down large csv creation from minutes down to seconds.
a generic function along the same lines ;)
http://cfzen.instantspot.com/blog/index.cfm/2007/4/18/queryToCsv2-util-function-updated
I did something similar at my previous job but I used the ArrayAppend() method. I did a blog post a while back: http://www.philduba.com/index.cfm/2006/10/20/Improving-Performance-Concatenating-Large-Strings. The string buffer method is good to. In my testing there wasn't a discernible difference between the two methods and my manager was more willing to do a pure CF use than invoke Java components. Outside of the horrible queries in the reports, this helped reduce the report executing times by 30% and sometimes more.
ArrayAppend()... very interesting, I would not have thought of that. And then, what, you just did like an ArrayToList() at the end (with no delimiter)? That's cool that there is little difference.
It would be interesting to know what the StringBuffer is actually doing underneath the covers.
I'd put #Chr(13)##Chr(10)# as the delimiter to get it on separate lines.
Oh right :) I totally wasn't even thinking straight.
"It would be interesting to know what the StringBuffer is actually doing underneath the covers."
What I've read back when I first started using string buffer is that CF is creating a new variable for every cfset. That means if you are doing 10k concatenations CF is actually recreating a variable 10,000 times to perform the operations. The string buffer does the concatenation on single variable without the overhead of recreating the variable every time. Ok, that doesn't totally answer your question.
Thanks Ben,
This runs ten times faster....
@Dan,
So you think it is still just storing it in a string variable? I am interested in the data structure underneath - is it a string, is it some sort of dynamic length collection?
@Bruce,
Sweeeeet! Glad I could help in some way :)
One additional suggestion which will speed things up even more is to write the file progressively using <cffile action="append"...>.
The benefit of this is that CF doesn't retain the entire generated output in memory and thus doesn't need to manipulate that entire string object when it writes a new line. It is significantly faster for large queries.
Adding 'fixnewline="true"' removes the need to specifiy the #chr(13)# or #chr(10)# as the new line delimiter since CF adds the line break on the append.
-J
Sorry. I'm a dork. I was only on my first cup of coffee and didn't read the case for using the string buffer correctly. :-) Your solution is better. My bad.
-J
@Jon,
I think there might be a case for both situations. While File writing is a costly operation, you don't want to build massive amounts of data in the server's RAM as that will choke the machine. What might be good to use is a combination of BufferedFileOutput and StringBuffer. This level of optimization goes beyond my understanding of how all the file decorators works, but I suspect it would be the best of both worlds.
There's a CFX tag CFX_QueryWrite which does a similar thing. I have used it since CF5 days. You can find it at Adobe Exchange.
And its very fast - takes the query name and a delimiter as a parameter.
Please correct me if I'm wrong but ColdFusion already presents the string buffer capability through cfsavecontent? Thats what I have used to solve a similar problem with the exact same results: a huge increase in speed.
@Dave,
I am not sure how CFSaveContent is implemented under the covers. I am not sure how it would StringBuffer. I think it is just writing to an alternate content buffer. However, you will probably see speed improvements over standard string concatenation because the alternate content buffer probably acts in similar way for optimization???
Hi there,
I've used a similar (**coughs** identical) method to create a csv file, but have ran afoul of Excel's 'helpful' cell formatting - I have columns from a dataset in the form of number-hyphen-number, eg. 12-1. When opening the CSV in Excel, it decides to format this to 12-Jan.
The CSV file itself is ok and the data remains 12-1. Any pointers as to how to force Excel to regard these cells as text as opposed to dates?
@Paul,
I am pretty sure Excel has a Data Import Wizard. In that, you can select the columns to import and I believe format the data. Of course, this has to be done manually since this is a feature / issue with Excel, not with the CSV. The whole point of a CSV is that is has very little formatting.
Yeh, that's what I was thinking. The page was producing an Excel worksheet using MS' Open XML nonsense, which allows you to format columns/rows/cells til your hearts' content. But the bl**dy client's decided (after it's finished) that they want it in CSV.
@Paul,
If that's what they want, then that's what they want. Of course, that has implications that they have to deal with.
Really neat-o stuff...
Yes I said neat-o :)
But on the serious side, what about performance issues or concerns, like running csv importing as a scheduled task...
In my testing steps i keep getting jrun servlet errors, which concerns me greatly once i made this as a scheduled task, and how it's performance would affect the whole servers...
Any thoughts?
@Craig,
I've been told that if performance is really a big concern, running a CSV import through the bulk insert feature of the database would be the most performant (but somewhat less flexible).



