ColdFusion Optimizing Case Study: Writing Orders To A CSV File

By Ben Nadel on May 4, 2007

The other day, someone contacted me needed help to speed up one of their files. They were writing eCommerce order data to a CSV file and the page kept throwing the error:

The request has exceeded the allowable time limit Tag: cffile

This is a case study on that piece of code and how it can be optimized.

PLEASE NOTE: Because I do not have this database, I cannot run this code. That means that it might be buggy. Just follow the concepts, not the exact syntax.

Here is the original code that was throwing ColdFusion timeout exceptions:

<cfquery datasource="XXXXXX" name="theOrders">
	SELECT
		*
	FROM
		Orders
	LEFT OUTER JOIN
		Loger
	ON
		Loger.UserID = Orders.CartID
</cfquery>


<CFSET tempfile = "#expandpath('allOrders')##DateFormat(Now(), 'yyyy_mm_dd')#.csv" />

<!---
	The "spaces" in the text below are TAB characters.
	Do not change them to spaces otherwise the Excel
	export will not work.
--->

<CFSET output = "">
<CFSET output = output & "WebOrderID CartID Accountnumber Shipping SubTotal Total Order-Hold" />

<CFFILE
	ACTION="WRITE"
	FILE="#tempfile#"
	OUTPUT="#output#"
	/>

<CFLOOP QUERY="theOrders">

	<CFSET output = "">
	<CFSET output = output & "#theOrders.WebOrderID# #theOrders.CartID# #theOrders.Accountnumber# #theOrders.Shipping1# #theOrders.Subtotal# #theOrders.Total# #theOrders.Hold#">

	<CFFILE
		ACTION="APPEND"
		FILE="#tempfile#"
		OUTPUT="#output#"
		/>

	<cfquery datasource="XXXXXX" name="theRealCart">
		SELECT DISTINCT
			(WebOrderID),
			cartID,
			Location,
			Delivery,
			Quantity,
			ItemNumber,
			Item,
			Price,
			Vendor,
			Attention,
			Price
		FROM
			realcart
		Where
			WebOrderID = '#theOrders.WebOrderID#'
	</cfquery>

	<cfloop query="theRealCart">

		<CFSET output = "">
		<CFSET output = output & "#theRealCart.cartID# #theRealCart.Vendor# #theRealCart.Delivery# #theRealCart.Item# #theRealCart.ItemNumber# #theRealCart.Quantity# #theRealCart.Price# #theRealCart.Attention#">

		<CFFILE
			ACTION="APPEND"
			FILE="#tempfile#"
			OUTPUT="#output#"
			/>

	</cfloop>

</CFLOOP>

Looking at this code, there were several things that popped out at me immediately. The first thing was all the ColdFusion CFFile tags. In this case, a CFFile Append action was being called for every single row of the target CSV file. Writing data to the file system is a very costly procedure (when compared to how fast calculations will process). This is not something that we want to constantly be doing. I would suggest minimizing the number of times that we actually go to the file system (within reason).

Instead of using so many ColdFusion CFFile tags, I would recommend using a Java StringBuffer to build the entire data file output before writing it to disk. The beauty of the StringBuffer is that is can build enormous strings without the overhead cost of string concatenation. It does this by not actually building a single string, but instead, keeping track of the many parts of the constructed string. The final string is only built once at the end when the ToString() method is called (on the StringBuffer).

The second thing that drew my attention was the repeated database calls. The way it was constructed, ColdFusion had to query the database for every single order. This could be a huge cost depending on how many orders there are. Instead of doing it this way, I would recommend creating a single query that joins the order information to the order detail information. This is a bit of a tradeoff; on one hand, you are pulling down duplicate order headers, but on the other hand, you are going to save a lot in terms of database communication costs.

In my sample code below, I am using an INNER JOIN to join the orders and the cart information. I am assuming that this is a valid relationship (one requires the other), but if that is not, you would need to change the INNER JOIN to a LEFT OUTER JOIN and update the CFLoop logic a bit. Also, please note that I am using SELECT * only because I do not know the underlying database structure. As part of the optimization process, only pull down the data columns that are required.

Taking those two red flags into account, here is my suggested optimization (again, keep in mind that I could NOT run this to validate syntax and ideas):

<!---
	The first optimization is to combine the two queries
	(orders and order items) into one query. This will make
	the following CFLoop a bit more complicated but will only
	require going to the database once vs. N-times where N is
	the number of the orders.

	I am ONLY using SELECT * because I am not sure of the
	database strucutre. Otherwise, I would NEVER use SELECT *
	as more often than not, you don't need every column (and
	for readability reasons).
--->
<cfquery name="theOrders" datasource="XXXXXX">
	SELECT
		*
	FROM
		<!---
			Create the table from the first query. I am doing
			this for two reasons:
			1. I am not sure of the DB structure.
			2. We can use a LEFT OUTER Join in this query and
			still use an INNER join on the next without any
			sort of confusion.
		--->
		(

			SELECT
				*
			FROM
				Orders
			LEFT OUTER JOIN
				Loger
			ON
				Loger.UserID = Orders.CartID

		) AS o
	INNER JOIN
		realcart rc
	ON
		rc.WebOrderID = o.WebOrderID
</cfquery>



<!---
	Get the name of the temp file that we are going
	to write our tab delimited data to.
--->
<cfset strTempFile = (
	ExpandPath( "allOrders" ) &
	DateFormat( Now(), "yyyy_mm_dd" ) &
	".csv"
	) />


<!---
	Create a string buffer to hold our output. By writing to
	a string buffer rather than doing CFFile actions for every
	loop iteration we are going to increase our speed greatly.
	Writing to the file system is one of the most costly actions
	you can take in programming. Try to minimize it when
	possible (within reason).
--->
<cfset sbOutput = CreateObject(
	"java",
	"java.lang.StringBuffer"
	).Init() />


<!---
	Start off, instead of writing our column headers to
	the file system (costly), let's just append them to
	our string buffer.

	Notice that we are adding a line break at the end of the
	data line. This is because there is nothing inherent to
	the String Buffer about line breaks. In the other method,
	we were Appending data using CFFile. CFFile adds a new
	line character by default.
--->
<cfset sbOutput.Append(
	"WebOrderID CartID Accountnumber Shipping SubTotal Total Order-Hold" &
	Chr( 13 ) & Chr( 10 )
	) />


<!---
	Get the initial order ID. We are going to need to keep track
	of the order ID so that we know when to output a new order
	row (when that ID changes). To start with, set it to a value
	we know cannot possibly be valid.
--->
<cfset intOrderID = -1 />


<!--- Loop over the order/cart detail query. --->
<cfloop query="theOrders">

	<!---
		Check to see if we have a new order ID. If we do,
		then we need to output a mid-header row for this
		particular order.
	--->
	<cfif (theOrders.WebOrderID NEQ intOrderID)>

		<!---
			Store the order ID so that we don't recreate this
			mid-header row.
		--->
		<cfset intOrderID = theOrders.WebOrderID />


		<!---
			Instead of outputing the mid-header row to a file,
			let's add it to our string buffer.
		--->
		<cfset sbOutput.Append(
			"#theOrders.WebOrderID# #theOrders.CartID# #theOrders.Accountnumber# #theOrders.Shipping1# #theOrders.Subtotal# #theOrders.Total# #theOrders.Hold#" &
			Chr( 13 ) & Chr( 10 )
			) />

	</cfif>


	<!---
		ASSERT: At this point, we know that we are IN an order
		detail loop and that our header row has already been
		taken care of. That means that for the rest of the loop
		body, all we need to care about are the order details.
	--->


	<cfset sbOutput.Append(
		"#theOrders.cartID# #theOrders.Vendor# #theOrders.Delivery# #theOrders.Item# #theOrders.ItemNumber# #theOrders.Quantity# #theOrders.Price# #theOrders.Attention#" &
		Chr( 13 ) & Chr( 10 )
		) />

</cfloop>


<!---
	Now that we have all of our tab delimited data in the
	Java StringBuffer, we can write it to the file system.
	This is a bigger file write that before, but since we
	are only doing one, rather than ONCE per data row, it
	will be crazy faster.

	When writing the string buffer to the file, we have
	to convert it to a single string using ToString().
--->
<cffile
	action="WRITE"
	file="#strTempFile#"
	output="#sbOutput.ToString()#"
	/>

In the final code, we have ONE database call and ONE CFFile tag. I would be shocked if this method was not faster. Hopefully, I can get some feedback (if this advice is taken). Also, on a final note, increasing the CFSetting RequestTimeOut did solve this problem; but, I don't think this would be necessary if the code itself was optimized.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/679

Reader Comments

Dan Roberts May 4, 2007 at 9:14 AM

27 Comments

The string buffer makes a huge difference. Years ago I optimized some ColdFusion code as much as possible in create a csv string and the time just grew relative with the number of records. Hundreds of records took an acceptable amount of time but something like 10k records would take a few minutes, which wasn't acceptable. Luckily CFDJ had an article about the string buffer right around that same time. That cut down large csv creation from minutes down to seconds.

Aaron Longnion May 4, 2007 at 9:30 AM

39 Comments

a generic function along the same lines ;)

http://cfzen.instantspot.com/blog/index.cfm/2007/4/18/queryToCsv2-util-function-updated

Phil Duba May 4, 2007 at 9:33 AM

8 Comments

I did something similar at my previous job but I used the ArrayAppend() method. I did a blog post a while back: http://www.philduba.com/index.cfm/2006/10/20/Improving-Performance-Concatenating-Large-Strings. The string buffer method is good to. In my testing there wasn't a discernible difference between the two methods and my manager was more willing to do a pure CF use than invoke Java components. Outside of the horrible queries in the reports, this helped reduce the report executing times by 30% and sometimes more.

Ben Nadel May 4, 2007 at 10:55 AM

15,688 Comments

ArrayAppend()... very interesting, I would not have thought of that. And then, what, you just did like an ArrayToList() at the end (with no delimiter)? That's cool that there is little difference.

It would be interesting to know what the StringBuffer is actually doing underneath the covers.

Phil Duba May 4, 2007 at 11:01 AM

8 Comments

I'd put #Chr(13)##Chr(10)# as the delimiter to get it on separate lines.

Ben Nadel May 4, 2007 at 11:13 AM

15,688 Comments

Oh right :) I totally wasn't even thinking straight.

Dan Roberts May 4, 2007 at 11:19 AM

27 Comments

"It would be interesting to know what the StringBuffer is actually doing underneath the covers."

What I've read back when I first started using string buffer is that CF is creating a new variable for every cfset. That means if you are doing 10k concatenations CF is actually recreating a variable 10,000 times to perform the operations. The string buffer does the concatenation on single variable without the overhead of recreating the variable every time. Ok, that doesn't totally answer your question.

wrighterb May 4, 2007 at 11:55 AM

8 Comments

Thanks Ben,

This runs ten times faster....

Ben Nadel May 4, 2007 at 12:31 PM

15,688 Comments

@Dan,

So you think it is still just storing it in a string variable? I am interested in the data structure underneath - is it a string, is it some sort of dynamic length collection?

@Bruce,

Sweeeeet! Glad I could help in some way :)

Jon Clausen May 7, 2007 at 10:29 AM

2 Comments

One additional suggestion which will speed things up even more is to write the file progressively using <cffile action="append"...>.

The benefit of this is that CF doesn't retain the entire generated output in memory and thus doesn't need to manipulate that entire string object when it writes a new line. It is significantly faster for large queries.

Adding 'fixnewline="true"' removes the need to specifiy the #chr(13)# or #chr(10)# as the new line delimiter since CF adds the line break on the append.

-J

Jon Clausen May 7, 2007 at 11:00 AM

2 Comments

Sorry. I'm a dork. I was only on my first cup of coffee and didn't read the case for using the string buffer correctly. :-) Your solution is better. My bad.
-J

Ben Nadel May 7, 2007 at 11:06 AM

15,688 Comments

@Jon,

I think there might be a case for both situations. While File writing is a costly operation, you don't want to build massive amounts of data in the server's RAM as that will choke the machine. What might be good to use is a combination of BufferedFileOutput and StringBuffer. This level of optimization goes beyond my understanding of how all the file decorators works, but I suspect it would be the best of both worlds.

DK May 9, 2007 at 6:37 PM

1 Comments

There's a CFX tag CFX_QueryWrite which does a similar thing. I have used it since CF5 days. You can find it at Adobe Exchange.

And its very fast - takes the query name and a delimiter as a parameter.

Dave Nellis Jul 25, 2007 at 6:50 PM

1 Comments

Please correct me if I'm wrong but ColdFusion already presents the string buffer capability through cfsavecontent? Thats what I have used to solve a similar problem with the exact same results: a huge increase in speed.

Ben Nadel Jul 26, 2007 at 7:00 AM

15,688 Comments

@Dave,

I am not sure how CFSaveContent is implemented under the covers. I am not sure how it would StringBuffer. I think it is just writing to an alternate content buffer. However, you will probably see speed improvements over standard string concatenation because the alternate content buffer probably acts in similar way for optimization???

Paul McCombie Mar 12, 2008 at 11:10 AM

4 Comments

Hi there,

I've used a similar (**coughs** identical) method to create a csv file, but have ran afoul of Excel's 'helpful' cell formatting - I have columns from a dataset in the form of number-hyphen-number, eg. 12-1. When opening the CSV in Excel, it decides to format this to 12-Jan.

The CSV file itself is ok and the data remains 12-1. Any pointers as to how to force Excel to regard these cells as text as opposed to dates?

Ben Nadel Mar 12, 2008 at 11:20 AM

15,688 Comments

@Paul,

I am pretty sure Excel has a Data Import Wizard. In that, you can select the columns to import and I believe format the data. Of course, this has to be done manually since this is a feature / issue with Excel, not with the CSV. The whole point of a CSV is that is has very little formatting.

Paul McCombie Mar 12, 2008 at 11:27 AM

4 Comments

Yeh, that's what I was thinking. The page was producing an Excel worksheet using MS' Open XML nonsense, which allows you to format columns/rows/cells til your hearts' content. But the bl**dy client's decided (after it's finished) that they want it in CSV.

Ben Nadel Mar 12, 2008 at 11:32 AM

15,688 Comments

@Paul,

If that's what they want, then that's what they want. Of course, that has implications that they have to deal with.

Craig M. Rosenblum Oct 5, 2009 at 5:32 PM

23 Comments

Really neat-o stuff...

Yes I said neat-o :)

But on the serious side, what about performance issues or concerns, like running csv importing as a scheduled task...

In my testing steps i keep getting jrun servlet errors, which concerns me greatly once i made this as a scheduled task, and how it's performance would affect the whole servers...

Any thoughts?

Ben Nadel Nov 1, 2009 at 3:38 PM

15,688 Comments

@Craig,

I've been told that if performance is really a big concern, running a CSV import through the bulk insert feature of the database would be the most performant (but somewhat less flexible).

Pete N. Jan 4, 2010 at 8:07 PM

4 Comments

Ben,

I have implemented this solution to create a csv which I am feeding to google AdWords Editor. Works like a champ! Once I build the csv file in memory, instead of writing to disk, I dump the results to the user and the user is prompted to save the file.

However, once I get to around 250,000 lines, I am getting the following error:
javax.servlet.ServletException: ROOT CAUSE:
java.lang.OutOfMemoryError: Java heap space

I checked CF server and its allocating 512MB for CF JVM.

I start off with the following.

"crlf is just a var for 'chr(13) &chr(10)'"

</cfloop>

Finally, I present the csv to the user:

<cfheader name="Content-Disposition" value="attachment; filename=#network#_#type#_#content#_#DateFormat(Now())#_#TimeFormat(Now())#.csv">
<cfcontent type="application/vnd.ms-excel">
<cfoutput>#sbOutput.ToString()#</cfoutput>
<cfabort>

Now, I have tried to set the initial size of the stringbuffer (I am not certain of the syntax, guessing here), but to no avail.

For me, a 250,000 record csv file should be about 35MB.

I have also checked used memory using the following:

total: 387.3125
avail: 302.785392761
used: 84.5271072388

Any ideas how to get around the "out of memory" problem?

Much thanks in advance!!

-Pete

Pete N. Jan 4, 2010 at 10:00 PM

4 Comments

Ben,

It appears this is the correct format to set the SB initial size.

However, this does not seem to help my problem.

Thanks again in advance!!

Ben Nadel Jan 5, 2010 at 8:37 AM

15,688 Comments

@Pete,

It's hard to say how much memory that something should use. You can't just take the character bytes as the direct memory usage. Remember, there are Java class instances in play here. I don't know how StringBuilder works behind the scenes; I am sure it is more complicated than I imagine (considering it can do things like indexOf(), replace(), and lastIndexOf() - clearly it is a complex data container.

You might want to try writing it to disk every X rows, and then serving up that temporary file using CFContent/File.

Pete N. Jan 5, 2010 at 1:34 PM

4 Comments

Ben,

Thanks for the quick reply! In the end, I may have to write to disk as you suggested. It's just so sad because the java stringbuffer performed so well.

Then again, I can still use it to wrap up 20k or so records at a time, write to disk, repeat until done.

Setting the initial SB size was just a stab in the dark :)

I have tried

java.lang.StringBuffer
ArrayNew(1)
a struct
and Cfsavecontent

each give me the same result, once I hit xx records from my query.

Each "line" i generate has 8 values, one of which is a URL string (about 100 to 150 characters in length). I found removing the URL from the output allows me to break my 250,000 record barrier. I am curious if any special characters in the url are a factor or if its just the length of the url itself taking up more memory (probably the latter).

Thank you for your time and suggestion!

-Pete

Ben Nadel Jan 5, 2010 at 6:17 PM

15,688 Comments

@Pete,

No problem. And, as you are saying, you can get a best of both worlds kind of approach - use the StringBuffer for X rows, then append it to a file on the disk (and reset the buffer).

At that point, you just have to hope that ColdFusion is good about garbage collecting mid-request... something which can be troublesome.

Pete N. Jan 5, 2010 at 9:40 PM

4 Comments

@Ben,

If it wouldn't be such a waste of a damn-fine enlisted man I'd recommend you for OCS! You are gonna be a general someday!... (little forest gump reference in case ya didnt catch it)

Seems balance is the way to go. I just generated a 1.25 million record csv file in about 55 seconds. No more OutOfMemoryError :)

I am just creating the stringbuffer and setting its initial size to the chunk size I want to work with (20MB chunks). As I append to the SB, I check the length of the SB and when it hits the 20MB ceiling, I dump to file and clear the SB.

something like ....

</cfif>

<cfif (qry_get_keywords.currentrow eq qry_get_keywords.recordcount)>
<cffile
action="append"
file="#strTempFile#"
output="#sbOutput.ToString()#"/>
</cfif>
</cfoutput>

Thanks again for the tip and having this blog (I must have solved over a dozen problems in the past visiting your site! Add 1 more!).

Cheers,

Pete

Ben Nadel Jan 9, 2010 at 10:11 PM

15,688 Comments

@Pete,

Looks good - I think that's the right approach. I can't believe you are dealing with millions of rows - that's badass. Glad we could collaborate on this.

Peter Doyle Jul 10, 2010 at 6:20 AM

1 Comments

Well that is just what I wanted!

Such a logical way of improving performance on this. Now writing large csv's in a matter of seconds.

Thanks for this guidance - much appreciated for sharing.

Peter

Ben Nadel Jul 12, 2010 at 10:45 AM

15,688 Comments

@Peter,

Glad to help. The StringBuffer object can definitely be a great improvement; although, I have to tell you that I've also seen great speed improvements with building CSV data using arrays and then the arrayToList() function to join them. In either case, however, it sounds like you're already seeing significant improvement!

Mike Maddaloni - @thehotiron Jun 7, 2011 at 1:30 AM

3 Comments

Thank you, thank you!

I have been becoming a big fan of your blog, and thought I'd finally respond to something, especially something that saved my butt!

mp/m

Chris Johansson Dec 6, 2011 at 10:14 AM

1 Comments

I have to add a Thank you too! This saved me tons of headache, I was reaching the time limit writing 4500 lines to an xml file. "subscribing"

Thanks!

Craig Ebbott Jan 26, 2012 at 10:48 AM

1 Comments

Maybe this was already brought up with using the string buffer (i didn't read all the comments; i know lazy....)

My simple method turned a 10 minute write to a 10 second write by just taking each record and saving it to a variable and the appending the next record on to it. After 10 records have been processed, i dump it to the file with an append.

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.