Ask Ben: Parsing Very Large XML Documents In ColdFusion

By Ben Nadel

Published 2008-09-08 in Ask Ben, ColdFusion — Comments (37)

Hello Ben, your website has come up numerous times in Google for my search to an answer that I cannot seem to find anywhere! You do however have related posts to my question - which is [drum roll]:

How do I read and parse large XML files in CF8!? I have multiple xml files of up to 135MB(!) each that I need to parse and INSERT into SQL. The problem appears to be XMLParse. I can read the XML file in via CFFILE no problems, however the XMLParse seems to max out the CF heap space (even after increasing it to 1024MB). From the reading I have done, it appears that because CF8 uses a DOM based approach, it must read in and parse the entire XML file into memory first - which is OK for small XML files, but absolutely kills the server on a 135MB file. People seem to suggest either:

Using SAX(?)

Changing the default XML parser within CF8 (which I fail to see how this would work as wouldn't it still need to read it into RAM?)

Anyway, I am hoping that you may already solved this in the past? Any help would be greatly appreciated!

This is perhaps the biggest problem with the way ColdFusion parses XML documents; it needs to able to load and parse the entire document in memory before it can return a result to you. I don't think there's anything that you can tweak in the settings to get around this - it's a property of the underlying Java library they use (Xerces I think).

Sure, you could use the SAX XML library, but then you have to start dealing with much more complicated parsing techniques. Plus, you know me - I like to build everything that I can in ColdFusion. It might not always be as fast as the pure Java solution, but when it comes to readability and maintainability, I don't think you can beat a single-technology solution.

So what can we do to get around the parsing limitation of large XML files in ColdFusion? To me, the most obvious solution is to rely on the fact that XML documents follow patterns; XML isn't just a random collection of data - it's structured data with extremely strict rules regarding formatting. And as always, when I think about patterns in text, I think of our very sexy friend, the regular expression.

What if, instead of parsing out an entire document as if it were XML, we looked for sub nodes of that document using XML patterns and then parsed those substrings into XML nodes. Sure, we wouldn't be able to pass around the XML document as a whole, but chances are, especially with extremely large documents, we don't need the information as a whole - we need it piece-wise anyway.

To develop and test our solution, first we need to create our massive XML document:

  
          <!--- Create a very large XML file. --->
        
          <cfsavecontent variable="strXML">
        
          	<cfoutput>
        
          		<order>
        
          			<!--- Order properties. --->
        
          			<properties
        
          				date="September 8, 2008"
        
          				time="13:42"
        
          				vendor="Kinky Solutions"
        
          				/>
        
          			<!--- Properties in order. --->
        
          			<products>
        
          				<!---
        
          					Loop over a large number of "products" to
        
          					create a long XML file.
        
          				--->
        
          				<cfloop
        
          					index="intI"
        
          					from="1"
        
          					to="10000"
        
          					step="1">
        
          					<product>
        
          						<sku>SKU#intI#</sku>
        
          						<name>Product #intI#</name>
        
          						<price>#RandRange( 1, 99 )#.99</price>
        
          						<quantity>#RandRange( 1, 5 )#</quantity>
        
          					</product>
        
          				</cfloop>
        
          			</products>
        
          		</order>
        
          	</cfoutput>
        
          </cfsavecontent>
        
          <!--- Write the XML data to the file. --->
        
          <cffile
        
          	action="write"
        
          	file="#ExpandPath( './products.xml' )#"
        
          	output="#strXML#"
        
          	/>

view raw code-1.cfm hosted with ❤ by GitHub

Here, we are creating an ORDER XML document that starts off with a properties node and is followed by 10,000 PRODUCT nodes. I don't even know if this scenario necessarily makes sense, but it creates a large document, and that's really all that I need.

To make our solution more usable, we are going to wrap it up in a ColdFusion component, SubNodeXmlParser.cfc. However, before we get into how that component works, let's take a look at how we will be using it. Remember, we can't parse the entire XML document at once, so we need to attack it a node at a time:

  
          <!---
        
          	Create the Sub XML node parser. We are going to have this
        
          	parser look for both the PROPERTIES and the PRODUCT nodes
        
          	(by passing in a comma delimited list of node names).
        
          --->
        
          <cfset objParser = CreateObject(
        
          	"component",
        
          	"SubNodeXmlParser"
        
          	).Init(
        
          		"properties, product",
        
          		ExpandPath( "./products.xml" )
        
          		)
        
          	/>
        
          <!---
        
          	Output the names of all the nodes found. We need to use a
        
          	conditional loop since we don't know how many nodes there
        
          	will be.
        
          --->
        
          <cfloop condition="true">
        
          	<!--- Get the next node. --->
        
          	<cfset VARIABLES.Node = objParser.GetNextNode() />
        
          	<!---
        
          		Check to see if the node was found. If not, then the
        
          		variable, Node, will have been destroyed and will no
        
          		longer exist in its parent scope.
        
          	--->
        
          	<cfif StructKeyExistS( VARIABLES, "Node" )>
        
          		<!--- Output name of node. --->
        
          		#VARIABLES.Node.XmlName#<br />
        
          	<cfelse>
        
          		<!--- We are done finding nodes so break out. --->
        
          		<cfbreak />
        
          	</cfif>
        
          </cfloop>

view raw code-2.cfm hosted with ❤ by GitHub

Notice first that our initialization method takes a comma delimited list of node names. This allows us skip over large parts of the XML document, concentrating purely on the nodes for which we have an interest. To get at these nodes, we use the GetNextNode() method. This will scan the XML file as a text document and look for the next XML node pattern. Finding it, it will parse it into a small ColdFusion XML document and return the XML node.

Running the above code, we get the following output:

properties
product
product
product
product
product
.... a few thousand more times ....

As you can see, it found the Properties node as well as all of the Product nodes. When running this code, we have to run in a Conditional loop since we have no idea how large the XML document will be. Essentially, we have to keep asking the parser for more data until it run out (and returns a VOID response).

So again, we do lose something with not being able to see the entire XML document in one view, but since you need to be inserting the data into a database, I am guessing that the piece-wise fashion will suite you just fine.

Ok, so now let's take a look at the ColdFusion component that makes this possible:

  
          <cfcomponent
        
          	output="false"
        
          	hint="I help to parse large XML files by matching patterns and then only parsing sub-nodes of the document.">
        
          	<cffunction
        
          		name="Init"
        
          		access="public"
        
          		returntype="any"
        
          		output="false"
        
          		hint="I return an intialized object.">
        
          		<!--- Define arguments. --->
        
          		<cfargument
        
          			name="Nodes"
        
          			type="string"
        
          			required="true"
        
          			hint="I am the list of node names that will be parsed using regular expressions."
        
          			/>
        
          		<cfargument
        
          			name="XmlFilePath"
        
          			type="string"
        
          			required="true"
        
          			hint="I am the file path for the large XML file to be parsed."
        
          			/>
        
          		<cfargument
        
          			name="BufferSize"
        
          			type="numeric"
        
          			required="false"
        
          			default="#(1024 * 1024 * 5)#"
        
          			hint="I am the size of the buffer which will be used to make reads to the input stream."
        
          			/>
        
          		<!--- Define the local scope. --->
        
          		<cfset var LOCAL = {} />
        
          		<!---
        
          			Create the regular expression pattern based on the
        
          			node list. We have to match both standard nodes and
        
          			self-closing nodes. The first thing we have to do is
        
          			clean up the node list.
        
          		--->
        
          		<cfset LOCAL.Nodes = ListChangeDelims(
        
          			ARGUMENTS.Nodes,
        
          			"|",
        
          			", "
        
          			) />
        
          		<!--- Define the pattern. --->
        
          		<cfset LOCAL.Pattern = (
        
          			"(?i)" &
        
          			"<(#LOCAL.Nodes#)\b[^>]*(?<=/)>|" &
        
          			"<(#LOCAL.Nodes#)\b[^>]*>[\w\W]*?</\2>"
        
          			) />
        
          		<!--- Set up the instance variables. --->
        
          		<cfset VARIABLES.Instance = {
        
          			<!---
        
          				This the compiled version of our regular
        
          				expression pattern. By compiling the pattern,
        
          				it allows us to access the Matcher functionality
        
          				later on.
        
          			--->
        
          			Pattern = CreateObject(
        
          				"java",
        
          				"java.util.regex.Pattern"
        
          				).Compile(
        
          					JavaCast( "string", LOCAL.Pattern )
        
          					),
        
          			<!---
        
          				This is the data buffer that will hold our
        
          				partial XML file data.
        
          			--->
        
          			DataBuffer = "",
        
          			<!---
        
          				The transfer buffer is what we will use to
        
          				transfer data from the input file stream into
        
          				our data buffer. It is this buffer that will
        
          				determine the size of each file read.
        
          			--->
        
          			TransferBuffer = RepeatString( " ", ARGUMENTS.BufferSize ).GetBytes(),
        
          			<!---
        
          				This will be our buffered file input stream
        
          				which let us read in the large XML file a
        
          				chunk at a time.
        
          			--->
        
          			InputStream = ""
        
          			} />
        
          		<!---
        
          			Setup the file intput stream. This buffere input
        
          			stream will all us to read in the XML file in
        
          			chunks rather than as a whole.
        
          		--->
        
          		<cfset VARIABLES.Instance.InputStream = CreateObject(
        
          			"java",
        
          			"java.io.BufferedInputStream"
        
          			).Init(
        
          				CreateObject(
        
          					"java",
        
          					"java.io.FileInputStream"
        
          					).Init(
        
          						JavaCast(
        
          							"string",
        
          							ARGUMENTS.XmlFilePath
        
          							)
        
          						)
        
          				)
        
          			/>
        
          		<!--- Return an intialized object. --->
        
          		<cfreturn THIS />
        
          	</cffunction>
        
          	<cffunction
        
          		name="Close"
        
          		access="public"
        
          		returntype="void"
        
          		output="false"
        
          		hint="This closes the input file stream. It is recommended that you call this if you finish before all nodes have been matched.">
        
          		<!--- Close the file input stream. --->
        
          		<cfset VARIABLES.Instance.InputStream.Close() />
        
          		<!--- Return out. --->
        
          		<cfreturn />
        
          	</cffunction>
        
          	<cffunction
        
          		name="GetNextNode"
        
          		access="public"
        
          		returntype="any"
        
          		output="false"
        
          		hint="I return the next node in the XML document. If no node can be found, I return VOID.">
        
          		<!--- Define the local scope. --->
        
          		<cfset var LOCAL = {} />
        
          		<!--- Create a matcher for our current buffer. --->
        
          		<cfset LOCAL.Matcher = VARIABLES.Instance.Pattern.Matcher(
        
          			JavaCast( "string", VARIABLES.Instance.DataBuffer )
        
          			) />
        
          		<!--- Try to find the next node. --->
        
          		<cfif LOCAL.Matcher.Find()>
        
          			<!---
        
          				The matcher found a pattern match. Let's pull out
        
          				the matching XML.
        
          			--->
        
          			<cfset LOCAL.XMLData = LOCAL.Matcher.Group() />
        
          			<!---
        
          				Now that we have the pattern matched, we need to
        
          				figure out how many characters to leave in our
        
          				buffer.
        
          			--->
        
          			<cfset LOCAL.CharsToLeave = (
        
          				Len( VARIABLES.Instance.DataBuffer ) -
        
          				(LOCAL.Matcher.Start() + Len( LOCAL.XMLData ))
        
          				) />
        
          			<!---
        
          				Check to see if we have any characters to leave
        
          				in the buffer after this match.
        
          			--->
        
          			<cfif LOCAL.CharsToLeave>
        
          				<!--- Trim the buffer. --->
        
          				<cfset VARIABLES.Instance.DataBuffer = Right(
        
          					VARIABLES.Instance.DataBuffer,
        
          					LOCAL.CharsToLeave
        
          					) />
        
          			<cfelse>
        
          				<!---
        
          					No character data should be left in the
        
          					buffer. Just set it to empyt string.
        
          				--->
        
          				<cfset VARIABLES.Instance.DataBuffer = "" />
        
          			</cfif>
        
          			<!---
        
          				Now that we have the buffer updated, parse the
        
          				XML data and return the root element.
        
          			--->
        
          			<cfreturn
        
          				XmlParse( Trim( LOCAL.XMLData ) )
        
          					.XmlRoot
        
          				/>
        
          		<cfelse>
        
          			<!---
        
          				The pattern matcher could not find the next node.
        
          				This might be because our buffer does contain
        
          				enough information. Let's try to read more of our
        
          				XML file into the buffer.
        
          			--->
        
          			<!--- Read input stream into local buffer. --->
        
          			<cfset LOCAL.BytesRead = VARIABLES.Instance.InputStream.Read(
        
          				VARIABLES.Instance.TransferBuffer,
        
          				JavaCast( "int", 0 ),
        
          				JavaCast( "int", ArrayLen( VARIABLES.Instance.TransferBuffer ) )
        
          				) />
        
          			<!---
        
          				Check to see if we read any bytes. If we didn't
        
          				then we have run out of data to read and cannot
        
          				possibly match any more node patterns; just
        
          				return void.
        
          			--->
        
          			<cfif (LOCAL.BytesRead EQ -1)>
        
          				<!--- Release the file input stream. --->
        
          				<cfset THIS.Close() />
        
          				<!--- No more data to be matched. --->
        
          				<cfreturn />
        
          			<cfelse>
        
          				<!---
        
          					We have read data in from the buffered file
        
          					input stream. Now, let's append that to our
        
          					internal buffer. Be sure to only move over
        
          					the bytes that were read - this might not
        
          					include the whole buffer contents.
        
          				--->
        
          				<cfset VARIABLES.Instance.DataBuffer &= Left(
        
          					ToString( VARIABLES.Instance.TransferBuffer ),
        
          					LOCAL.BytesRead
        
          					) />
        
          			</cfif>
        
          			<!---
        
          				Now that we have updated our buffer, we want to
        
          				give the pattern matcher another change to find
        
          				the node pattern.
        
          			--->
        
          			<cfreturn GetNextNode() />
        
          		</cfif>
        
          	</cffunction>
        
          </cfcomponent>

view raw code-3.cfm hosted with ❤ by GitHub

The code for this is quite small and straightforward. The ColdFusion component basically opens up the file as a buffered input streams and makes repeated reads to the stream until it can match a node pattern. Once it matches the node pattern, it parses it out into an XML document and returns the root node (the target node). It then goes back to the buffered input stream for more data. When it has no more data to read and no more pattern matches to make, it simply returns VOID signaling the end of the search.

This solution may not be exactly what you were looking for, but at the least, I hope that it has given you some ideas.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/1345

Reader Comments

O?uz Demirkap? Sep 8, 2008 at 7:41 PM

8 Comments

Hi Ben,

Great example and tutorial, thanks!

I also just want to let you know that we had same kind of issue with CF7 for a client in D.C. area 2 years ago and we preferred SAX with Java libraries and as I remember it was not as complicated as expected.

Just my 2 cent. :)

Kurt Wiersma Sep 8, 2008 at 10:35 PM

3 Comments

I have had success using Apache Digester (which comes with CF 7+) in CF to parse a 300MB+ file. Apache Digester is an easy to use SAX parser.

Adam Cameron Sep 9, 2008 at 3:46 AM

67 Comments

Hi Ben
Question. You state "I like to build everything that I can in ColdFusion", but you use Java to do your regexes and file ops, both of which could have been done in CF, and more simply to boot. Any reason for that?

--
Adam

Paul McCombie Sep 9, 2008 at 6:14 AM

4 Comments

re Adam:

I'm guessing that Coldfusion's string handling is particularly slow when it comes to large streams such as the one being discussed. Using the Java string handling expidites the process dramatically.

Or I could be talking nonsense :(

Paul McCombie Sep 9, 2008 at 6:46 AM

4 Comments

Strewth; my spelling's bad.

Ben Nadel Sep 9, 2008 at 8:34 AM

16,058 Comments

@Oguz, Kurt,

I will get around to trying those libraries one of these days. My biggest gripe, and this may be totally unfounded, is that I am not sure that they can work using ColdFusion listeners. Again, I am not talking from experience, but from what I have read, these types of libraries use the event listener model; and, since I believe it is quite difficult to invoke a ColdFusion component method from within a Java object, I assume that the listeners passed in have to be actual Java objects, not CFCs. If that is not the case, then it would make trying this much easier.

@Adam,

I am using a buffered input stream so that I don't have to read in the entire file into memory at one time. I guess I could have used a FileRead() action to do this, but frankly, I forgot that ColdFusion has that newer functionality.

As far as the regular expression parsing, using Java's Pattern and Matcher objects is actually much faster and easier to use than a pure CF solution. In ColdFusion, the regular expression find only returns the position and length of the match, and you have to manually keep looping over it with an explicit start value to get at all the matched patterns; using the Java regex utilities, looping over and getting access to all the patterns is extremely straightforward. I have to disagree with you from lots of experience that this would be easier in a pure CF solution.

I would say that using FileRead() might have been a bit easier, though.

Adam Cameron Sep 9, 2008 at 6:36 PM

67 Comments

Yeah, it was the fileRead() thing instead of <cffile> I meant there.

In regards to the regex side of things, I realise the Java implementation offers much more power, but how you're using it here doesn't seem to be any different (except in a more convoluted way) than using a single reFind(). You're not doing multiple find() calls on the Matcher, so the benefit of having the Matcher keep track of how far down the string the find() is at is irrelevant.

For a lot of things, the Java implementation seems like a lot of unnecessary horsing around to me, but I suppose if one is using 'em all the time, it becomes second nature. I guess I need some practise!

Still, I converted your code to use native CF and the Java really is an awful lot faster (35sec to 200sec, averaged out!). Also I note that you're using a zero-width positive look-behind (*) in your regex, which CF won't accept. It seems strange that CF isn't just passing the regex straight to the underlying Java regex processor. I wonder why it sticks its beak in? Oh well.

All interesting stuff.

--
Adam

(*) I have no idea what one of those is: I just looked it up when CF errored. I was able to simplify the regex a bit so it worked with reFind()...

Ben Nadel Sep 9, 2008 at 6:43 PM

16,058 Comments

@Adam,

Yeah, that's true - in this scenario, I am not really taking full advantage of the pattern matcher. However, as you point out, Java regular expressions are simply faster and I am using the positive look behind which.... (?<=/) simply means that the character "/" must exist just before this "point" in the pattern.

I agree that it does seem silly that ColdFusion doesn't just off the regular expression stuff to Java. Not sure why.

At the end of the day, this could have been done other ways, but I suppose I am so used to the Java pattern matcher that it just pops into my head as the first tool to try.

Crania Jul 18, 2009 at 3:47 AM

2 Comments

I was able to parse approx 5,000,000 "rows/node" from a 13 GB XML file using <cfloop file="file.xml" index="currentLine"> XML parse and such </cfloop> in CF 8. My understanding is that this uses Java file streaming to get to the meat.

In this instance, the 100+ MB file sound simple. Java burps at about 5 million with an out of memory error, however. .NET has the same problem at about the same place so not sure if there is some funky line or the lack of memory management (always an issue in CF) control is creating the problem.

Ben Nadel Jul 18, 2009 at 12:19 PM

16,058 Comments

@Crania,

Sounds like it might be a garbage collection problem. On requests that handle a large amount of information within a current request, even in small increments, I have found that ColdFusion has some trouble with garbage clean up. It seems to require a new page request to clean up some of the memory used up in the previous request. As such, I generally have to break mammoth tasks up across various page requests.

Crania Jul 21, 2009 at 9:45 PM

2 Comments

Thanks for the reply.

I am still battling the file. I thought I would pit .Net against cf but both seem to quit at just over 5 million rows in.

Perhaps there is a utility out there which will split the file up based on newline chars.

Will post back here when I find a solution. A 1 mil line XML file worked great with the cfloop. It could be that line 5 million has a flaw which blow up heap size. :)

Ben Nadel Jul 22, 2009 at 8:09 AM

16,058 Comments

@Crania,

Good luck! If you attack this in multiple pages requests, I am sure you can get something working.

bill Aug 1, 2009 at 1:55 PM

2 Comments

Given a Sql Server back end, seems like it would be simpler and faster to load the xml doc into Sql Server and then parse it out to an 'edge table' with openXml(). Sql Server can quickly parse very large xml docs, and edge tables, like flat files, are easy to work with.

Since you can easily import xml docs into Sql Server from the query pane, I assume that it can also be done through cfquery, so grabbing the xml doc & parsing it in Sql Server is managed through cfquery with a few lines of t-sql.

Ben Nadel Aug 5, 2009 at 9:28 AM

16,058 Comments

@Bill,

That sounds pretty cool. I have not used SQL server to parse XML documents before. Thanks for the tip.

Crania Aug 6, 2009 at 3:36 PM

1 Comments

Some great ideas. I actually tried to break the doc up initially and found something stupid - after row 5,000,000 (ish) the file was full of [spaces] - so the "line" that bot .NET and CF were blowing up on was one 5GB line of spaces. Once I split the file up and opened up some of the smaller files in a text editor it was easy to spot.

I'm quite happy with how the cfloop performed on a large file - was pretty skeptical, but it made it through about 5GB of data without a hiccup.

Sonya Oct 12, 2009 at 3:17 PM

1 Comments

Crania, Do you have any examples of this. The XML feed that I am being given does not contain any line breaks and so it does not seem to want to loop around it correctly.

Martin Sep 28, 2010 at 10:19 AM

5 Comments

@Crania - that's great but it's not actually handling any of the data just spitting out the individual lines. Both CF8 and Railo can do this this very efficiently like this:

<cfscript>
myfile = FileOpen(ExpandPath( "./myFile.xml" ));
while(NOT FileIsEOF(myfile))
	{
	WriteOutput(FileReadLine(myfile)&"<br>");
	}
FileClose(myfile);
</cfscript>

@Ben is there a way of doing getPrevious() in your code?

If I can read one xml "node" at a time, it would be nice to find a way to paginate back and forth between nodes.

I was wondering about finding a way to "index" a large XML file and then retrieve a specific node. A bit like a Master/Detail view, using XML (not by choice I must add) as my data set.

Ben Nadel Oct 3, 2010 at 10:37 PM

16,058 Comments

@Martin,

I suppose you could keep a prevNode reference after every parsing; it might be simple, or it might get complicated depending on your needs.

Thanks for pointing out those file-based methods, though. I am not sure that I have used those ones before.

Martin Oct 4, 2010 at 10:54 AM

5 Comments

@Ben,

I found some new file functions that were added in CF8: http://livedocs.adobe.com/coldfusion/8/htmldocs/functions-pt0_20.html#1100017

It appears reading individual XML lines (or any other "BIG" file) with CF is really quick and easy HOWEVER getting ColdFusion to recognise those lines as XML nodes without parsing the whole document is much harder.
Your Java regex search pattern is obviously the way to go Ben - very smart indeed.

Ben Nadel Oct 4, 2010 at 9:31 PM

16,058 Comments

@Martin,

CF8 definitely had some sweet updates. That kind of file looping can now, also being done using the CFLoop tag. I assume these functions were simply the script-based equivalent to what the CFLoop tag is doing. Good to know them.

Ben Nadel Oct 6, 2010 at 10:42 AM

16,058 Comments

@Martin,

Thanks again for the reminder - I took it and ran with it:

www.bennadel.com/blog/2027-Reading-In-File-Data-Using-ColdFusion-8-s-New-File-Functions.htm

It's nice to finally use of those functions; although, I have to say that I think the tag-based equivalents are a little bit easier to use.

Paolo Groppo Oct 29, 2010 at 6:03 AM

3 Comments

Simply genial!
My previous solution to parse large xml file was to develop a dll in visual basic calling Microsoft SAXXMLReader60. It works, but Windows dipendent and extremely difficult to manage. In the previous scenario every time I had to apply a modify, I had to open the vb project, recompile the dll, deinstall old dll, install the new dll and reload coldfusion! Your solution is a bit slower, but really, Really, REALLY easy to manage.

Thanks

P.S. free beers for you if you come in Italy someday!

Paolo

Ben Nadel Nov 1, 2010 at 9:17 PM

16,058 Comments

@Paolo,

Yeah, I've been told that using Java or some compiled DLL is going to be faster; but, I am glad that you are finding this to be more management as it is in the native language! Awesome.

Ted Daniels Jan 15, 2011 at 10:13 PM

6 Comments

Ben:

I have tried your code example and always get an error.

With all the code, the error is 'Invalid CFML Construct.CFML was looking at the following text...<' on the line where <cfcomponent...> starts, indicating an error in the earlier code.

If I remove the code from <cfcomponent...> to the end, then the error is

"Could not find the ColdFusion Component or Interface SubNodeXmlParser"

Any suggestions as to what is causing this?

Ted

Rox Productions Apr 15, 2011 at 11:07 AM

1 Comments

I tried adding to the DB during the loop like follows... but it only adds a few products in the loop... Seems like it's not getting enough time to write to DB before the loop "move on". Is that possible?

<cfloop condition="true">
<cfset VARIABLES.Node = objParser.GetNextNode() />
<cfif StructKeyExistS( VARIABLES, "Node" )>

<cfset product = Node.XmlChildren>
<cfset ppVar = "#product[10].xmlText#">


<cfif ppVar GTE 1>

<cfscript>
//writeDump(product);
//writeDump(product[1].xmlText);
item = new products();
//item.setextproductid(product[7].xmlText);
item.setname(product[1].xmlText);
item.setdescription(product[4].xmlText);
item.setprice(product[5].xmlText);
item.setpreviousprice(product[10].xmlText);
item.setproducturl(product[2].xmlText);
entitySave(item);
writeOutput(item.getname() & " saved...<br/>");
</cfscript>

<cfelse>
</cfif>
<cfelse>
<cfbreak />
</cfif>

Pat Barbano Jul 20, 2011 at 4:19 PM

5 Comments

Thanks Ben, I was looking into a using a SAX parser, but this is solution works for me. I was able to parse a 7400 node XML file, build and populate a db schema in under 10 seconds using your method.

Fritz Jun 20, 2012 at 9:35 AM

1 Comments

Ben,

Let me add my voice to the chorus of people who found this solution immensely helpful. I lifted your regular expression from the cfc and applied the Java Pattern Matcher technique over a number of iterations to break large XML files into database-sized chunks.

It's been working great for me in 99% of cases, but I've got a corner case that has me stumped.
I'm using the following expressions for the Matcher:

and, for list nodes returned, I then do

This works great for pulling most of the list items and their associated headers and commentary, except in one recursive case where list items contain other lists that also have items. The first expression doesn't see the list node at all in this case.

An abbreviated version of the offending section of XML follows. When I parse it using the first Pattern, only the head node is returned by the Matcher. Can you suggest a way to get the Matcher to find the list and its items? I'm OK with not parsing the inner lists and just returning the outer list items, but the recursive structure makes the outer list unmatchable.

Ultimately I'm planning to put these into a simple database table that, beyond primary and foreign keys, just a varchar field to hold the text content of head, p, and item tags, an increment to keep lists separate, a line number to keep the list order, and a bit field that tells me if I'm in a sub-list and need to indent deeper.

The XML:

<odd type="index">
<head>Richard Aldington Collection--Index of Works</head>
<list type="simple">
<item>
<title render="italic" linktype="simple">Enquiry</title>
<list type="simple">
<item> Review of
<title render="italic" linktype="simple"> T.E. Lawrence: A Biographical Enquiry </title>
by Richard Aldington--2.9 </item>
<item> Summary of Wills and Settlements made by various members of the Chapman family--2.9 </item>
</list>
</item>
<item> Patmore, Derek
<list type="simple">
<item>
<title render="doublequote" linktype="simple">The Poetry and Prose of Richard
Aldington</title> (radio broadcast)--2.9 </item>
</list>
</item>
</list>
</odd>

Thanks again for your time and help,

Laura Oct 22, 2012 at 11:27 AM

1 Comments

I also had problems with parsing large XML files. I tried Expresso XML Parser. It can parse files up to 35GB and it's really fast. It's really easy to use. You set up parsing rules on a website and test your file online and then use their client code to access your parsing rules from java or javascript. They have a free developer version at www.sxml.com.au

Dan Mather Apr 9, 2013 at 5:32 PM

1 Comments

Hi Ben,
I find your blog stimulating and informative. It seems most of the time when I am looking for an idea on how approach a particular problem --you have already tackled it.

I am working on parsing a rather large XML document so I naturally found this page. I have one small wrinkle to add to the problem. In my case the XML may contain sub nodes of the same type. E.g.

<value type="object">
<options>
	<list>
		<value type="object">
			<vendor>Google</vendor>
			<id type="int">101</id>
		</value>
		</list>
</options>
</value>

When I try parsing with your code here, I bomb out when I hit the embedded value tag. The code tries to pass the partial node to XmlParse which complains that the start and end tags don't match.

Do you see any way to handle this sort of parsing chore on large XML documents?

Thanks for your thoughts

Sebastiaan May 6, 2013 at 12:40 PM

61 Comments

Ben 2 the rescue! Grrrreat! This works like a charm on a 190MB large XML-file ;-) Thanx a million Ben for this insightful article and completely different take on parsing XML-documents. Now I just have to figure out how to extract the data into the database without doing the cumbersome VARIABLES.Node.tagname.XmlText all the time ;-)

Ted Daniels May 21, 2013 at 11:51 AM

6 Comments

Looking at my first ever XML document that I have to parse and put into MS SQL 2000 with CF8.

I get it to list the desired Field name, many times over, and have a long list of this field name displayed, like you show using your SubNodeXMLParser (once I added a <CFOUTPUT> tag.

Can you explain where I go from here to get the actual data out of the XML file into the database in, I presume, some sort of a loop process.

Jonathan van Zuijlekom Jun 5, 2013 at 6:01 PM

4 Comments

I had a problem with reading an UTF-8 XML file. This was only within a chthread with Railo 4, but I'd still like to share what I did to fix it and speed up the XML reading in the process.

I used BufferedReader instead of BufferedInputStream. This main difference is it takes chars instead of raw bytes. This way we can specify which charset should be used.

I added this argument:
<cfargument name="CharSet" type="string" required="false" default="UTF-8" hint="chatset of the xml file">

Changed this:
TransferBuffer = RepeatString(" ",ARGUMENTS.BufferSize).toCharArray(),

And this:
<cfset VARIABLES.Instance.InputStream = CreateObject("java","java.io.BufferedReader").Init(
CreateObject("java","java.io.InputStreamReader").Init(
CreateObject("java","java.io.FileInputStream").Init(JavaCast("string",ARGUMENTS.XmlFilePath)),JavaCast("string",ARGUMENTS.CharSet)),ARGUMENTS.BufferSize)>

Robert Aug 2, 2013 at 12:24 PM

2 Comments

Having a bit of an issue. Everything runs fine until it gets to the last entry in the XML file and then I get a 'Stream closed' error.

Any ideas?

Thanks everyone!

Jonathan van Zuijlekom Aug 6, 2013 at 8:30 AM

4 Comments

@Robert, have you tried my modifications? It uses the java BufferedReader instead of the BufferedInputStream.

Robert Aug 6, 2013 at 2:41 PM

2 Comments

Thanks for the suggestion.

Doesn't look like it's an issue of XML size. I ran the code using an XML file with 10 entries and received the error.

'Stream closed'

Jonathan van Zuijlekom Aug 8, 2013 at 2:55 AM

4 Comments

@Robert, I'm not saying it is an issue with the XML size. I'm saying I'm using a different way of reading the XML file, and maybe that different way (BufferedReader) helps you with your "stream closed" issue.

Derek Versteegen Aug 8, 2013 at 11:08 AM

1 Comments

I noticed in all the examples the XML document is on accessed locally. My problem is that that I am accessing a 3MB XML document from a URL. So it would seem my long processing time is coming from cfhttp downloading the file into memory before it writes it locally where I then use the before mentioned code to process the file.

Is it possible to "stream" from a URL as it has been demonstrated from a physical file?

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.

	<!--- Create a very large XML file. --->
	<cfsavecontent variable="strXML">
	<cfoutput>
	<order>

	<!--- Order properties. --->
	<properties
	date="September 8, 2008"
	time="13:42"
	vendor="Kinky Solutions"
	/>

	<!--- Properties in order. --->
	<products>

	<!---
	Loop over a large number of "products" to
	create a long XML file.
	--->
	<cfloop
	index="intI"
	from="1"
	to="10000"
	step="1">

	<product>
	<sku>SKU#intI#</sku>
	<name>Product #intI#</name>
	<price>#RandRange( 1, 99 )#.99</price>
	<quantity>#RandRange( 1, 5 )#</quantity>
	</product>

	</cfloop>

	</products>

	</order>
	</cfoutput>
	</cfsavecontent>


	<!--- Write the XML data to the file. --->
	<cffile
	action="write"
	file="#ExpandPath( './products.xml' )#"
	output="#strXML#"
	/>

	<!---
	Create the Sub XML node parser. We are going to have this
	parser look for both the PROPERTIES and the PRODUCT nodes
	(by passing in a comma delimited list of node names).
	--->
	<cfset objParser = CreateObject(
	"component",
	"SubNodeXmlParser"
	).Init(
	"properties, product",
	ExpandPath( "./products.xml" )
	)
	/>


	<!---
	Output the names of all the nodes found. We need to use a
	conditional loop since we don't know how many nodes there
	will be.
	--->
	<cfloop condition="true">

	<!--- Get the next node. --->
	<cfset VARIABLES.Node = objParser.GetNextNode() />

	<!---
	Check to see if the node was found. If not, then the
	variable, Node, will have been destroyed and will no
	longer exist in its parent scope.
	--->
	<cfif StructKeyExistS( VARIABLES, "Node" )>

	<!--- Output name of node. --->
	#VARIABLES.Node.XmlName#<br />

	<cfelse>

	<!--- We are done finding nodes so break out. --->
	<cfbreak />

	</cfif>

	</cfloop>

	<cfcomponent
	output="false"
	hint="I help to parse large XML files by matching patterns and then only parsing sub-nodes of the document.">


	<cffunction
	name="Init"
	access="public"
	returntype="any"
	output="false"
	hint="I return an intialized object.">

	<!--- Define arguments. --->
	<cfargument
	name="Nodes"
	type="string"
	required="true"
	hint="I am the list of node names that will be parsed using regular expressions."
	/>

	<cfargument
	name="XmlFilePath"
	type="string"
	required="true"
	hint="I am the file path for the large XML file to be parsed."
	/>

	<cfargument
	name="BufferSize"
	type="numeric"
	required="false"
	default="#(1024 * 1024 * 5)#"
	hint="I am the size of the buffer which will be used to make reads to the input stream."
	/>

	<!--- Define the local scope. --->
	<cfset var LOCAL = {} />

	<!---
	Create the regular expression pattern based on the
	node list. We have to match both standard nodes and
	self-closing nodes. The first thing we have to do is
	clean up the node list.
	--->
	<cfset LOCAL.Nodes = ListChangeDelims(
	ARGUMENTS.Nodes,
	"\|",
	", "
	) />

	<!--- Define the pattern. --->
	<cfset LOCAL.Pattern = (
	"(?i)" &
	"<(#LOCAL.Nodes#)\b[^>]*(?<=/)>\|" &
	"<(#LOCAL.Nodes#)\b[^>]>[\w\W]?</\2>"
	) />

	<!--- Set up the instance variables. --->
	<cfset VARIABLES.Instance = {

	<!---
	This the compiled version of our regular
	expression pattern. By compiling the pattern,
	it allows us to access the Matcher functionality
	later on.
	--->
	Pattern = CreateObject(
	"java",
	"java.util.regex.Pattern"
	).Compile(
	JavaCast( "string", LOCAL.Pattern )
	),

	<!---
	This is the data buffer that will hold our
	partial XML file data.
	--->
	DataBuffer = "",

	<!---
	The transfer buffer is what we will use to
	transfer data from the input file stream into
	our data buffer. It is this buffer that will
	determine the size of each file read.
	--->
	TransferBuffer = RepeatString( " ", ARGUMENTS.BufferSize ).GetBytes(),

	<!---
	This will be our buffered file input stream
	which let us read in the large XML file a
	chunk at a time.
	--->
	InputStream = ""

	} />

	<!---
	Setup the file intput stream. This buffere input
	stream will all us to read in the XML file in
	chunks rather than as a whole.
	--->
	<cfset VARIABLES.Instance.InputStream = CreateObject(
	"java",
	"java.io.BufferedInputStream"
	).Init(
	CreateObject(
	"java",
	"java.io.FileInputStream"
	).Init(
	JavaCast(
	"string",
	ARGUMENTS.XmlFilePath
	)
	)
	)
	/>

	<!--- Return an intialized object. --->
	<cfreturn THIS />
	</cffunction>


	<cffunction
	name="Close"
	access="public"
	returntype="void"
	output="false"
	hint="This closes the input file stream. It is recommended that you call this if you finish before all nodes have been matched.">

	<!--- Close the file input stream. --->
	<cfset VARIABLES.Instance.InputStream.Close() />

	<!--- Return out. --->
	<cfreturn />
	</cffunction>


	<cffunction
	name="GetNextNode"
	access="public"
	returntype="any"
	output="false"
	hint="I return the next node in the XML document. If no node can be found, I return VOID.">

	<!--- Define the local scope. --->
	<cfset var LOCAL = {} />

	<!--- Create a matcher for our current buffer. --->
	<cfset LOCAL.Matcher = VARIABLES.Instance.Pattern.Matcher(
	JavaCast( "string", VARIABLES.Instance.DataBuffer )
	) />


	<!--- Try to find the next node. --->
	<cfif LOCAL.Matcher.Find()>

	<!---
	The matcher found a pattern match. Let's pull out
	the matching XML.
	--->
	<cfset LOCAL.XMLData = LOCAL.Matcher.Group() />

	<!---
	Now that we have the pattern matched, we need to
	figure out how many characters to leave in our
	buffer.
	--->
	<cfset LOCAL.CharsToLeave = (
	Len( VARIABLES.Instance.DataBuffer ) -
	(LOCAL.Matcher.Start() + Len( LOCAL.XMLData ))
	) />

	<!---
	Check to see if we have any characters to leave
	in the buffer after this match.
	--->
	<cfif LOCAL.CharsToLeave>

	<!--- Trim the buffer. --->
	<cfset VARIABLES.Instance.DataBuffer = Right(
	VARIABLES.Instance.DataBuffer,
	LOCAL.CharsToLeave
	) />

	<cfelse>

	<!---
	No character data should be left in the
	buffer. Just set it to empyt string.
	--->
	<cfset VARIABLES.Instance.DataBuffer = "" />

	</cfif>

	<!---
	Now that we have the buffer updated, parse the
	XML data and return the root element.
	--->
	<cfreturn
	XmlParse( Trim( LOCAL.XMLData ) )
	.XmlRoot
	/>

	<cfelse>

	<!---
	The pattern matcher could not find the next node.
	This might be because our buffer does contain
	enough information. Let's try to read more of our
	XML file into the buffer.
	--->

	<!--- Read input stream into local buffer. --->
	<cfset LOCAL.BytesRead = VARIABLES.Instance.InputStream.Read(
	VARIABLES.Instance.TransferBuffer,
	JavaCast( "int", 0 ),
	JavaCast( "int", ArrayLen( VARIABLES.Instance.TransferBuffer ) )
	) />

	<!---
	Check to see if we read any bytes. If we didn't
	then we have run out of data to read and cannot
	possibly match any more node patterns; just
	return void.
	--->
	<cfif (LOCAL.BytesRead EQ -1)>

	<!--- Release the file input stream. --->
	<cfset THIS.Close() />

	<!--- No more data to be matched. --->
	<cfreturn />

	<cfelse>

	<!---
	We have read data in from the buffered file
	input stream. Now, let's append that to our
	internal buffer. Be sure to only move over
	the bytes that were read - this might not
	include the whole buffer contents.
	--->
	<cfset VARIABLES.Instance.DataBuffer &= Left(
	ToString( VARIABLES.Instance.TransferBuffer ),
	LOCAL.BytesRead
	) />

	</cfif>


	<!---
	Now that we have updated our buffer, we want to
	give the pattern matcher another change to find
	the node pattern.
	--->
	<cfreturn GetNextNode() />

	</cfif>
	</cffunction>

	</cfcomponent>