XML Building / Parsing / Traversing Speed In ColdFusion
After my little discovery yesterday about the relative speed of XML to ColdFusion custom tags in regards to structured data collection, I decided to do a little more investigation into XML performance in ColdFusion 8. In the following demonstration, I am testing three different aspects of ColdFusion:
Building an XML string using CFSaveContent. I decided against testing CFXML because I think it is less flexible in that having an XML string first allows us to take more actions than just creating an XML document. Plus, I think it is probably just using a data buffer and then parsing the resulting string afterwards (just a guess).
Parsing an XML string using XmlParse(). Basically, taking the above string and parsing it into a ColdFusion XML document.
Traversing an XML document. Taking the resultant ColdFusion XML document from above and walking over each node, getting the value of the leaf-nodes.
The following code does all three of these in a row, each test building on the results of the previous one. I tried with a variety of data set sizes, which I will review afterwards:
<!--- Increase page running time. ---> <cfsetting requesttimeout="200" /> <!--- Create a blank ColdFusion query object. ---> <cfset qData = QueryNew( "" ) /> <!--- Create an array of values. We are going to use this array to populate N columns in the query object. Sure, they values will all be uniform, but this is the easiest method. ---> <cfset arrColumnData = ListToArray( RepeatString( "v#RandRange( 1111, 9999 )#,", 10000 ) ) /> <!--- Add columns to this query. ---> <cfloop index="intColumn" from="1" to="50" step="1"> <!--- Add the data array, from above, as the default data for this new column (this will add a row for each array index). ---> <cfset QueryAddColumn( qData, "column#intColumn#", "cf_sql_varchar", arrColumnData ) /> </cfloop> <!--- Get the column list as an array. This will allow us to loop over it faster which will cut down on XML creation time amoratized over all the rows. ---> <cfset arrColumns = ListToArray( qData.ColumnList ) /> <!--- Keep track of how long it takes to build XML string. ---> <cftimer label="Building XML String" type="outline"> <!--- Create an XML string for the given query. ---> <cfsavecontent variable="strQueryAsXML"> <cfoutput> <query> <!--- Create a row for each record. ---> <cfloop query="qData"> <row index="#qData.CurrentRow#"> <!--- Loop over the columns. ---> <cfloop index="strColumnName" array="#arrColumns#"> <column name="#strColumnName#">#qData[ strColumnName ][ qData.CurrentRow ]#</column> </cfloop> </row> </cfloop> </query> </cfoutput> </cfsavecontent> <p> Done building XML string. </p> </cftimer> <!--- Keep track of how long it takes to parse XML. ---> <cftimer label="Parsing XML String" type="outline"> <!--- Parse into a ColdFusion XML document. ---> <cfset xmlQuery = XmlParse( strQueryAsXML ) /> <p> Done parsing XML string. </p> </cftimer> <!--- Keep track of how long it takes to traverse the XML document (assuming that it is the above query format). ---> <cftimer label="Traversing XML Document" type="outline"> <!--- Kill any output that is caused from the xml document traversal. Killing the white space has a HUGE impact on performance because it ignores any buffering updates (I assume). ---> <cfsilent> <cfloop index="intRow" from="1" to="#ArrayLen( xmlQuery.query.XmlChildren )#" step="1"> <!--- Get a pointer to the current row. ---> <cfset xmlRow = xmlQuery.query.XmlChildren[ intRow ] /> <!--- Loop over each column. ---> <cfloop index="intChild" from="1" to="#ArrayLen( xmlRow.XmlChildren )#" step="1"> <!--- Get a pointer to current child. ---> <cfset xmlChild = xmlRow.XmlChildren[ intChild ] /> <!--- Get node value. ---> <cfset strValue = xmlChild.XmlText /> <!--- We are not going to output any value at this point since that will only slow things down unnecessarily. ---> </cfloop> </cfloop> </cfsilent> <p> Done traversing XML document. </p> </cftimer>
As you can see, the test is running off of manually created ColdFusion query object. While the code in the demo create a query with a set height (row count) and width (column count), I ran is many times with different dimensions. Here are the results that I saw:
1,000 Rows x 50 Cells (50,000 Values)
Building XML String: 289.25ms on average.
Parsing XML String: 836ms on average.
Traversing XML Document: 519.5ms on average.
5,000 Rows x 50 Cells (250,000 Values)
Building XML String: 1,082.88ms on average.
Parsing XML String: 4,465ms on average.
Traversing XML Document: 2,617.25ms on average.
10,000 Rows x 50 Cells (500,000 Values)
Building XML String: 2,835.75ms on average.
Parsing XML String: 9,000ms on average.
Traversing XML Document: 7,668.25ms on average.
20,000 Rows x 50 Cells (1,000,000 Values)
Java heap space null
60,000 Rows x 10 Cells (600,000 Values)
Java heap space null
60,000 Rows x 5 Cells (300,000 Values)
Building XML String: 6,375ms on average.
Parsing XML String: 7,179.75ms on average.
Traversing XML Document: 173,242.25ms on average.
There's a couple of things to make note of here. For starters, ColdFusion 8 can parse XML really fast. A 50,000 leaf-node XML tree parses in under a second and can be fully traversed in just over half a second. That's pretty awesome!
Of course, XML parsing does have its limits - as you can see, once we get over like 500,000 records, ColdFusion simply does not have enough memory to do the XML Parsing (and yes, it is the XmlParse() line that cause the heap space error).
Here's the really interesting thing, though - the depth and breadth of the XML tree each has a different impact on traversal performance. If you look at our third test with 10,000 rows and 500,000 leaf nodes, the tree can be fully traversed in less than 8 seconds. However, if we have 60,000 rows with only 300,000 leaf-nodes (200,000 less that the previous example), it takes 173 seconds to traverse! So, it looks like the 60,000 rows has a greater impact that the 50 columns amortized over the rows.
And finally, while you can't see this in the demo because as it was too slow to use, Named pseudo-arrays that ColdFusion allows when dealing with XML documents is extremely slow! Using the actual XmlChildren arrays was orders of magnitude faster. Moral of the story - use the XmlChildren array.
Ok, I'm exhausted so that's all I'm gonna review for the moment. Have a great weekend.
Want to use code from this post? Check out the license.
One thing I've observed when building large XML documents, is that using CF's XML functions like xmlElemNew() and its ilk were much less mean on memory than creating the thing using CFXML (which is just CFSAVECONTENT with an implicit xmlParse() added in, as far as I can tell).
I suppose it's obvious (?) that CFXML requires the whole string to be complete before converting it to XML, compared to the function-based approach that construct the document ad-hoc.
Another thing that could be worth measuring is the performance of accessing an XML doc via xpath, rather than "brute force" CF constructs.
CF XML processing is slow and memory intensive because it uses a DOM processor (Xerces), which has to represent the entire document in memory to work with it. My preferred alternative is XOM (http://www.xom.nu/), which provides a very nice API atop a SAX processor instead, which is considerably faster and uses less memory to perform the same operations, such as building, parsing, traversing and querying XML documents.
I'll have to try that out, although I think it would make sense that building the DOM manually would be faster; as you have pointed out, CFXML / XmlParse() needs to have the entire XML string in memory before it does anything.
As far as XPath, in my recent experience, that has proved to be extremely slow. In fact, in one of my previous posts, I found that XPath slowed down my testing by like 13 seconds. It was extremely slow.
I'll have to take a look at these types of solutions one day. I think I tried once to get a SAX parser to work, but was having trouble building my event listener as a CFC.
Using XOM you don't deal with the SAX API. Instead you use XOM's API which is similar to how Coldfusion works with XML, but has the advantage of being backed by SAX parser instead, which is more memory efficient than the DOM parser CF is using.
XOM looks interesting. I tried looking through some of their sample code. I found some of it hard to follow, but I didn't really give it that much looking into. When I have some more time, I'll check it out.
So when it breaks due to being too big, is there anyway of splitting it up into smaller chunks?
I'm thinking of taking the raw data before xmlParsing it and seeing if I can't break it up. But if somebody has experience with this already it would save me the trouble. :)
I've played around with a couple of approaches to dealing with XML documents that are too large to be parsed in one shot. In one, approach, I use Regular Expression to try and parse one tag at a time:
In another approach, similar to that one, I tried to implement a ColdFusion XML listener like the SAX parser:
That was more of an experiment, though.
Ben your comments about named-psuedo arrays speed up a parsing process I had by about 95%. If your ever in Virginia Beach I owe you a beer
That's awesome my man! 95% is beasty! Always glad I could help out.
Have you had problems with this in CF9? I'm trying to build an XML string using cfsavecontent and it's a mare.
Basically, I've lifted the code that works fine on my CF8 box but it's decided it doesn't want to play on the CF9.
My hair's going grey (UK ;o))
The performance difference with XmlChildren and named pseudo-arrays is surprising. I have an xml feed that has grown from 100kb to 8mb over the years. It's become time consuming to run it and I figured XmlParse was the bottleneck. Turns out that arrays were the problem. Switching to XmlChildren improved the script run time by about 20 minutes. Wow.
Great post! We were using cffile action="read" and xmlparse but then our xml file got too big to read. The method you wrote about solved our problems for a while but now we are getting a new java error "Stream closed"
java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145)
Any idea how to fix this? Thanks so much for all of your great posts!