Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at cf.Objective() 2012 (Minneapolis, MN) with:

Ask Ben: Parsing Very Large XML Documents In ColdFusion

By Ben Nadel on

Hello Ben, your website has come up numerous times in Google for my search to an answer that I cannot seem to find anywhere! You do however have related posts to my question - which is [drum roll]:

How do I read and parse large XML files in CF8!? I have multiple xml files of up to 135MB(!) each that I need to parse and INSERT into SQL. The problem appears to be XMLParse. I can read the XML file in via CFFILE no problems, however the XMLParse seems to max out the CF heap space (even after increasing it to 1024MB). From the reading I have done, it appears that because CF8 uses a DOM based approach, it must read in and parse the entire XML file into memory first - which is OK for small XML files, but absolutely kills the server on a 135MB file. People seem to suggest either:

1. Using SAX(?)
2. Changing the default XML parser within CF8 (which I fail to see how this would work as wouldn't it still need to read it into RAM?)

Anyway, I am hoping that you may already solved this in the past? Any help would be greatly appreciated!

This is perhaps the biggest problem with the way ColdFusion parses XML documents; it needs to able to load and parse the entire document in memory before it can return a result to you. I don't think there's anything that you can tweak in the settings to get around this - it's a property of the underlying Java library they use (Xerces I think).

Sure, you could use the SAX XML library, but then you have to start dealing with much more complicated parsing techniques. Plus, you know me - I like to build everything that I can in ColdFusion. It might not always be as fast as the pure Java solution, but when it comes to readability and maintainability, I don't think you can beat a single-technology solution.

So what can we do to get around the parsing limitation of large XML files in ColdFusion? To me, the most obvious solution is to rely on the fact that XML documents follow patterns; XML isn't just a random collection of data - it's structured data with extremely strict rules regarding formatting. And as always, when I think about patterns in text, I think of our very sexy friend, the regular expression.

What if, instead of parsing out an entire document as if it were XML, we looked for sub nodes of that document using XML patterns and then parsed those substrings into XML nodes. Sure, we wouldn't be able to pass around the XML document as a whole, but chances are, especially with extremely large documents, we don't need the information as a whole - we need it piece-wise anyway.

To develop and test our solution, first we need to create our massive XML document:

  • <!--- Create a very large XML file. --->
  • <cfsavecontent variable="strXML">
  • <cfoutput>
  • <order>
  •  
  • <!--- Order properties. --->
  • <properties
  • date="September 8, 2008"
  • time="13:42"
  • vendor="Kinky Solutions"
  • />
  •  
  • <!--- Properties in order. --->
  • <products>
  •  
  • <!---
  • Loop over a large number of "products" to
  • create a long XML file.
  • --->
  • <cfloop
  • index="intI"
  • from="1"
  • to="10000"
  • step="1">
  •  
  • <product>
  • <sku>SKU#intI#</sku>
  • <name>Product #intI#</name>
  • <price>#RandRange( 1, 99 )#.99</price>
  • <quantity>#RandRange( 1, 5 )#</quantity>
  • </product>
  •  
  • </cfloop>
  •  
  • </products>
  •  
  • </order>
  • </cfoutput>
  • </cfsavecontent>
  •  
  •  
  • <!--- Write the XML data to the file. --->
  • <cffile
  • action="write"
  • file="#ExpandPath( './products.xml' )#"
  • output="#strXML#"
  • />

Here, we are creating an ORDER XML document that starts off with a properties node and is followed by 10,000 PRODUCT nodes. I don't even know if this scenario necessarily makes sense, but it creates a large document, and that's really all that I need.

To make our solution more usable, we are going to wrap it up in a ColdFusion component, SubNodeXmlParser.cfc. However, before we get into how that component works, let's take a look at how we will be using it. Remember, we can't parse the entire XML document at once, so we need to attack it a node at a time:

  • <!---
  • Create the Sub XML node parser. We are going to have this
  • parser look for both the PROPERTIES and the PRODUCT nodes
  • (by passing in a comma delimited list of node names).
  • --->
  • <cfset objParser = CreateObject(
  • "component",
  • "SubNodeXmlParser"
  • ).Init(
  • "properties, product",
  • ExpandPath( "./products.xml" )
  • )
  • />
  •  
  •  
  • <!---
  • Output the names of all the nodes found. We need to use a
  • conditional loop since we don't know how many nodes there
  • will be.
  • --->
  • <cfloop condition="true">
  •  
  • <!--- Get the next node. --->
  • <cfset VARIABLES.Node = objParser.GetNextNode() />
  •  
  • <!---
  • Check to see if the node was found. If not, then the
  • variable, Node, will have been destroyed and will no
  • longer exist in its parent scope.
  • --->
  • <cfif StructKeyExistS( VARIABLES, "Node" )>
  •  
  • <!--- Output name of node. --->
  • #VARIABLES.Node.XmlName#<br />
  •  
  • <cfelse>
  •  
  • <!--- We are done finding nodes so break out. --->
  • <cfbreak />
  •  
  • </cfif>
  •  
  • </cfloop>

Notice first that our initialization method takes a comma delimited list of node names. This allows us skip over large parts of the XML document, concentrating purely on the nodes for which we have an interest. To get at these nodes, we use the GetNextNode() method. This will scan the XML file as a text document and look for the next XML node pattern. Finding it, it will parse it into a small ColdFusion XML document and return the XML node.

Running the above code, we get the following output:

properties
product
product
product
product
product
.... a few thousand more times ....

As you can see, it found the Properties node as well as all of the Product nodes. When running this code, we have to run in a Conditional loop since we have no idea how large the XML document will be. Essentially, we have to keep asking the parser for more data until it run out (and returns a VOID response).

So again, we do lose something with not being able to see the entire XML document in one view, but since you need to be inserting the data into a database, I am guessing that the piece-wise fashion will suite you just fine.

Ok, so now let's take a look at the ColdFusion component that makes this possible:

  • <cfcomponent
  • output="false"
  • hint="I help to parse large XML files by matching patterns and then only parsing sub-nodes of the document.">
  •  
  •  
  • <cffunction
  • name="Init"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="I return an intialized object.">
  •  
  • <!--- Define arguments. --->
  • <cfargument
  • name="Nodes"
  • type="string"
  • required="true"
  • hint="I am the list of node names that will be parsed using regular expressions."
  • />
  •  
  • <cfargument
  • name="XmlFilePath"
  • type="string"
  • required="true"
  • hint="I am the file path for the large XML file to be parsed."
  • />
  •  
  • <cfargument
  • name="BufferSize"
  • type="numeric"
  • required="false"
  • default="#(1024 * 1024 * 5)#"
  • hint="I am the size of the buffer which will be used to make reads to the input stream."
  • />
  •  
  • <!--- Define the local scope. --->
  • <cfset var LOCAL = {} />
  •  
  • <!---
  • Create the regular expression pattern based on the
  • node list. We have to match both standard nodes and
  • self-closing nodes. The first thing we have to do is
  • clean up the node list.
  • --->
  • <cfset LOCAL.Nodes = ListChangeDelims(
  • ARGUMENTS.Nodes,
  • "|",
  • ", "
  • ) />
  •  
  • <!--- Define the pattern. --->
  • <cfset LOCAL.Pattern = (
  • "(?i)" &
  • "<(#LOCAL.Nodes#)\b[^>]*(?<=/)>|" &
  • "<(#LOCAL.Nodes#)\b[^>]*>[\w\W]*?</\2>"
  • ) />
  •  
  • <!--- Set up the instance variables. --->
  • <cfset VARIABLES.Instance = {
  •  
  • <!---
  • This the compiled version of our regular
  • expression pattern. By compiling the pattern,
  • it allows us to access the Matcher functionality
  • later on.
  • --->
  • Pattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • JavaCast( "string", LOCAL.Pattern )
  • ),
  •  
  • <!---
  • This is the data buffer that will hold our
  • partial XML file data.
  • --->
  • DataBuffer = "",
  •  
  • <!---
  • The transfer buffer is what we will use to
  • transfer data from the input file stream into
  • our data buffer. It is this buffer that will
  • determine the size of each file read.
  • --->
  • TransferBuffer = RepeatString( " ", ARGUMENTS.BufferSize ).GetBytes(),
  •  
  • <!---
  • This will be our buffered file input stream
  • which let us read in the large XML file a
  • chunk at a time.
  • --->
  • InputStream = ""
  •  
  • } />
  •  
  • <!---
  • Setup the file intput stream. This buffere input
  • stream will all us to read in the XML file in
  • chunks rather than as a whole.
  • --->
  • <cfset VARIABLES.Instance.InputStream = CreateObject(
  • "java",
  • "java.io.BufferedInputStream"
  • ).Init(
  • CreateObject(
  • "java",
  • "java.io.FileInputStream"
  • ).Init(
  • JavaCast(
  • "string",
  • ARGUMENTS.XmlFilePath
  • )
  • )
  • )
  • />
  •  
  • <!--- Return an intialized object. --->
  • <cfreturn THIS />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="Close"
  • access="public"
  • returntype="void"
  • output="false"
  • hint="This closes the input file stream. It is recommended that you call this if you finish before all nodes have been matched.">
  •  
  • <!--- Close the file input stream. --->
  • <cfset VARIABLES.Instance.InputStream.Close() />
  •  
  • <!--- Return out. --->
  • <cfreturn />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="GetNextNode"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="I return the next node in the XML document. If no node can be found, I return VOID.">
  •  
  • <!--- Define the local scope. --->
  • <cfset var LOCAL = {} />
  •  
  • <!--- Create a matcher for our current buffer. --->
  • <cfset LOCAL.Matcher = VARIABLES.Instance.Pattern.Matcher(
  • JavaCast( "string", VARIABLES.Instance.DataBuffer )
  • ) />
  •  
  •  
  • <!--- Try to find the next node. --->
  • <cfif LOCAL.Matcher.Find()>
  •  
  • <!---
  • The matcher found a pattern match. Let's pull out
  • the matching XML.
  • --->
  • <cfset LOCAL.XMLData = LOCAL.Matcher.Group() />
  •  
  • <!---
  • Now that we have the pattern matched, we need to
  • figure out how many characters to leave in our
  • buffer.
  • --->
  • <cfset LOCAL.CharsToLeave = (
  • Len( VARIABLES.Instance.DataBuffer ) -
  • (LOCAL.Matcher.Start() + Len( LOCAL.XMLData ))
  • ) />
  •  
  • <!---
  • Check to see if we have any characters to leave
  • in the buffer after this match.
  • --->
  • <cfif LOCAL.CharsToLeave>
  •  
  • <!--- Trim the buffer. --->
  • <cfset VARIABLES.Instance.DataBuffer = Right(
  • VARIABLES.Instance.DataBuffer,
  • LOCAL.CharsToLeave
  • ) />
  •  
  • <cfelse>
  •  
  • <!---
  • No character data should be left in the
  • buffer. Just set it to empyt string.
  • --->
  • <cfset VARIABLES.Instance.DataBuffer = "" />
  •  
  • </cfif>
  •  
  • <!---
  • Now that we have the buffer updated, parse the
  • XML data and return the root element.
  • --->
  • <cfreturn
  • XmlParse( Trim( LOCAL.XMLData ) )
  • .XmlRoot
  • />
  •  
  • <cfelse>
  •  
  • <!---
  • The pattern matcher could not find the next node.
  • This might be because our buffer does contain
  • enough information. Let's try to read more of our
  • XML file into the buffer.
  • --->
  •  
  • <!--- Read input stream into local buffer. --->
  • <cfset LOCAL.BytesRead = VARIABLES.Instance.InputStream.Read(
  • VARIABLES.Instance.TransferBuffer,
  • JavaCast( "int", 0 ),
  • JavaCast( "int", ArrayLen( VARIABLES.Instance.TransferBuffer ) )
  • ) />
  •  
  • <!---
  • Check to see if we read any bytes. If we didn't
  • then we have run out of data to read and cannot
  • possibly match any more node patterns; just
  • return void.
  • --->
  • <cfif (LOCAL.BytesRead EQ -1)>
  •  
  • <!--- Release the file input stream. --->
  • <cfset THIS.Close() />
  •  
  • <!--- No more data to be matched. --->
  • <cfreturn />
  •  
  • <cfelse>
  •  
  • <!---
  • We have read data in from the buffered file
  • input stream. Now, let's append that to our
  • internal buffer. Be sure to only move over
  • the bytes that were read - this might not
  • include the whole buffer contents.
  • --->
  • <cfset VARIABLES.Instance.DataBuffer &= Left(
  • ToString( VARIABLES.Instance.TransferBuffer ),
  • LOCAL.BytesRead
  • ) />
  •  
  • </cfif>
  •  
  •  
  • <!---
  • Now that we have updated our buffer, we want to
  • give the pattern matcher another change to find
  • the node pattern.
  • --->
  • <cfreturn GetNextNode() />
  •  
  • </cfif>
  • </cffunction>
  •  
  • </cfcomponent>

The code for this is quite small and straightforward. The ColdFusion component basically opens up the file as a buffered input streams and makes repeated reads to the stream until it can match a node pattern. Once it matches the node pattern, it parses it out into an XML document and returns the root node (the target node). It then goes back to the buffered input stream for more data. When it has no more data to read and no more pattern matches to make, it simply returns VOID signaling the end of the search.

This solution may not be exactly what you were looking for, but at the least, I hope that it has given you some ideas.




Reader Comments

Hi Ben,

Great example and tutorial, thanks!

I also just want to let you know that we had same kind of issue with CF7 for a client in D.C. area 2 years ago and we preferred SAX with Java libraries and as I remember it was not as complicated as expected.

Just my 2 cent. :)

Reply to this Comment

I have had success using Apache Digester (which comes with CF 7+) in CF to parse a 300MB+ file. Apache Digester is an easy to use SAX parser.

Reply to this Comment

Hi Ben
Question. You state "I like to build everything that I can in ColdFusion", but you use Java to do your regexes and file ops, both of which could have been done in CF, and more simply to boot. Any reason for that?

--
Adam

Reply to this Comment

re Adam:

I'm guessing that Coldfusion's string handling is particularly slow when it comes to large streams such as the one being discussed. Using the Java string handling expidites the process dramatically.

Or I could be talking nonsense :(

Reply to this Comment

@Oguz, Kurt,

I will get around to trying those libraries one of these days. My biggest gripe, and this may be totally unfounded, is that I am not sure that they can work using ColdFusion listeners. Again, I am not talking from experience, but from what I have read, these types of libraries use the event listener model; and, since I believe it is quite difficult to invoke a ColdFusion component method from within a Java object, I assume that the listeners passed in have to be actual Java objects, not CFCs. If that is not the case, then it would make trying this much easier.

@Adam,

I am using a buffered input stream so that I don't have to read in the entire file into memory at one time. I guess I could have used a FileRead() action to do this, but frankly, I forgot that ColdFusion has that newer functionality.

As far as the regular expression parsing, using Java's Pattern and Matcher objects is actually much faster and easier to use than a pure CF solution. In ColdFusion, the regular expression find only returns the position and length of the match, and you have to manually keep looping over it with an explicit start value to get at all the matched patterns; using the Java regex utilities, looping over and getting access to all the patterns is extremely straightforward. I have to disagree with you from lots of experience that this would be easier in a pure CF solution.

I would say that using FileRead() might have been a bit easier, though.

Reply to this Comment

Yeah, it was the fileRead() thing instead of <cffile> I meant there.

In regards to the regex side of things, I realise the Java implementation offers much more power, but how you're using it here doesn't seem to be any different (except in a more convoluted way) than using a single reFind(). You're not doing multiple find() calls on the Matcher, so the benefit of having the Matcher keep track of how far down the string the find() is at is irrelevant.

For a lot of things, the Java implementation seems like a lot of unnecessary horsing around to me, but I suppose if one is using 'em all the time, it becomes second nature. I guess I need some practise!

Still, I converted your code to use native CF and the Java really is an awful lot faster (35sec to 200sec, averaged out!). Also I note that you're using a zero-width positive look-behind (*) in your regex, which CF won't accept. It seems strange that CF isn't just passing the regex straight to the underlying Java regex processor. I wonder why it sticks its beak in? Oh well.

All interesting stuff.

--
Adam

(*) I have no idea what one of those is: I just looked it up when CF errored. I was able to simplify the regex a bit so it worked with reFind()...

Reply to this Comment

@Adam,

Yeah, that's true - in this scenario, I am not really taking full advantage of the pattern matcher. However, as you point out, Java regular expressions are simply faster and I am using the positive look behind which.... (?<=/) simply means that the character "/" must exist just before this "point" in the pattern.

I agree that it does seem silly that ColdFusion doesn't just off the regular expression stuff to Java. Not sure why.

At the end of the day, this could have been done other ways, but I suppose I am so used to the Java pattern matcher that it just pops into my head as the first tool to try.

Reply to this Comment

I was able to parse approx 5,000,000 "rows/node" from a 13 GB XML file using <cfloop file="file.xml" index="currentLine"> XML parse and such </cfloop> in CF 8. My understanding is that this uses Java file streaming to get to the meat.

In this instance, the 100+ MB file sound simple. Java burps at about 5 million with an out of memory error, however. .NET has the same problem at about the same place so not sure if there is some funky line or the lack of memory management (always an issue in CF) control is creating the problem.

Reply to this Comment

@Crania,

Sounds like it might be a garbage collection problem. On requests that handle a large amount of information within a current request, even in small increments, I have found that ColdFusion has some trouble with garbage clean up. It seems to require a new page request to clean up some of the memory used up in the previous request. As such, I generally have to break mammoth tasks up across various page requests.

Reply to this Comment

Thanks for the reply.

I am still battling the file. I thought I would pit .Net against cf but both seem to quit at just over 5 million rows in.

Perhaps there is a utility out there which will split the file up based on newline chars.

Will post back here when I find a solution. A 1 mil line XML file worked great with the cfloop. It could be that line 5 million has a flaw which blow up heap size. :)

Reply to this Comment

@Crania,

Good luck! If you attack this in multiple pages requests, I am sure you can get something working.

Reply to this Comment

Given a Sql Server back end, seems like it would be simpler and faster to load the xml doc into Sql Server and then parse it out to an 'edge table' with openXml(). Sql Server can quickly parse very large xml docs, and edge tables, like flat files, are easy to work with.

Since you can easily import xml docs into Sql Server from the query pane, I assume that it can also be done through cfquery, so grabbing the xml doc & parsing it in Sql Server is managed through cfquery with a few lines of t-sql.

Reply to this Comment

@Bill,

That sounds pretty cool. I have not used SQL server to parse XML documents before. Thanks for the tip.

Reply to this Comment

Some great ideas. I actually tried to break the doc up initially and found something stupid - after row 5,000,000 (ish) the file was full of [spaces] - so the "line" that bot .NET and CF were blowing up on was one 5GB line of spaces. Once I split the file up and opened up some of the smaller files in a text editor it was easy to spot.

I'm quite happy with how the cfloop performed on a large file - was pretty skeptical, but it made it through about 5GB of data without a hiccup.

Reply to this Comment

Crania, Do you have any examples of this. The XML feed that I am being given does not contain any line breaks and so it does not seem to want to loop around it correctly.

Reply to this Comment

@Crania - that's great but it's not actually handling any of the data just spitting out the individual lines. Both CF8 and Railo can do this this very efficiently like this:

  • <cfscript>
  • myfile = FileOpen(ExpandPath( "./myFile.xml" ));
  • while(NOT FileIsEOF(myfile))
  • {
  • WriteOutput(FileReadLine(myfile)&"<br>");
  • }
  • FileClose(myfile);
  • </cfscript>

@Ben is there a way of doing getPrevious() in your code?

If I can read one xml "node" at a time, it would be nice to find a way to paginate back and forth between nodes.

I was wondering about finding a way to "index" a large XML file and then retrieve a specific node. A bit like a Master/Detail view, using XML (not by choice I must add) as my data set.

Cheers
Marty

Reply to this Comment

@Martin,

I suppose you could keep a prevNode reference after every parsing; it might be simple, or it might get complicated depending on your needs.

Thanks for pointing out those file-based methods, though. I am not sure that I have used those ones before.

Reply to this Comment

@Ben,

I found some new file functions that were added in CF8: http://livedocs.adobe.com/coldfusion/8/htmldocs/functions-pt0_20.html#1100017

It appears reading individual XML lines (or any other "BIG" file) with CF is really quick and easy HOWEVER getting ColdFusion to recognise those lines as XML nodes without parsing the whole document is much harder.
Your Java regex search pattern is obviously the way to go Ben - very smart indeed.

Reply to this Comment

@Martin,

CF8 definitely had some sweet updates. That kind of file looping can now, also being done using the CFLoop tag. I assume these functions were simply the script-based equivalent to what the CFLoop tag is doing. Good to know them.

Reply to this Comment

Simply genial!
My previous solution to parse large xml file was to develop a dll in visual basic calling Microsoft SAXXMLReader60. It works, but Windows dipendent and extremely difficult to manage. In the previous scenario every time I had to apply a modify, I had to open the vb project, recompile the dll, deinstall old dll, install the new dll and reload coldfusion! Your solution is a bit slower, but really, Really, REALLY easy to manage.

Thanks

P.S. free beers for you if you come in Italy someday!

Paolo

Reply to this Comment

@Paolo,

Yeah, I've been told that using Java or some compiled DLL is going to be faster; but, I am glad that you are finding this to be more management as it is in the native language! Awesome.

Reply to this Comment

Ben:

I have tried your code example and always get an error.

With all the code, the error is 'Invalid CFML Construct.CFML was looking at the following text...<' on the line where <cfcomponent...> starts, indicating an error in the earlier code.

If I remove the code from <cfcomponent...> to the end, then the error is

"Could not find the ColdFusion Component or Interface SubNodeXmlParser"

Any suggestions as to what is causing this?

Ted

Reply to this Comment

I tried adding to the DB during the loop like follows... but it only adds a few products in the loop... Seems like it's not getting enough time to write to DB before the loop "move on". Is that possible?

<cfloop condition="true">
<cfset VARIABLES.Node = objParser.GetNextNode() />
<cfif StructKeyExistS( VARIABLES, "Node" )>

<cfset product = Node.XmlChildren>
<cfset ppVar = "#product[10].xmlText#">

<!--- Save to DB if previousPrice is defined --->
<cfif ppVar GTE 1>

<cfscript>
//writeDump(product);
//writeDump(product[1].xmlText);
item = new products();
//item.setextproductid(product[7].xmlText);
item.setname(product[1].xmlText);
item.setdescription(product[4].xmlText);
item.setprice(product[5].xmlText);
item.setpreviousprice(product[10].xmlText);
item.setproducturl(product[2].xmlText);
entitySave(item);
writeOutput(item.getname() & " saved...<br/>");
</cfscript>

<cfelse>
</cfif>
<cfelse>
<cfbreak />
</cfif>

Reply to this Comment

Thanks Ben, I was looking into a using a SAX parser, but this is solution works for me. I was able to parse a 7400 node XML file, build and populate a db schema in under 10 seconds using your method.

Reply to this Comment

Ben,

Let me add my voice to the chorus of people who found this solution immensely helpful. I lifted your regular expression from the cfc and applied the Java Pattern Matcher technique over a number of iterations to break large XML files into database-sized chunks.

It's been working great for me in 99% of cases, but I've got a corner case that has me stumped.
I'm using the following expressions for the Matcher:

<cfset oddREpattern = ("(?i)" & "<(head|p|list)\b[^>]*(?<=/)>|" & "<(head|p|list)\b[^>]*>[\w\W]*?</\2>")>

and, for list nodes returned, I then do

<cfset oddListREpattern = ("(?i)" & "<(head|p|item)\b[^>]*(?<=/)>|" & "<(head|p|item)\b[^>]*>[\w\W]*?</\2>")>

This works great for pulling most of the list items and their associated headers and commentary, except in one recursive case where list items contain other lists that also have items. The first expression doesn't see the list node at all in this case.

An abbreviated version of the offending section of XML follows. When I parse it using the first Pattern, only the head node is returned by the Matcher. Can you suggest a way to get the Matcher to find the list and its items? I'm OK with not parsing the inner lists and just returning the outer list items, but the recursive structure makes the outer list unmatchable.

Ultimately I'm planning to put these into a simple database table that, beyond primary and foreign keys, just a varchar field to hold the text content of head, p, and item tags, an increment to keep lists separate, a line number to keep the list order, and a bit field that tells me if I'm in a sub-list and need to indent deeper.

The XML:

<odd type="index">
<head>Richard Aldington Collection--Index of Works</head>
<list type="simple">
<item>
<title render="italic" linktype="simple">Enquiry</title>
<list type="simple">
<item> Review of
<title render="italic" linktype="simple"> T.E. Lawrence: A Biographical Enquiry </title>
by Richard Aldington--2.9 </item>
<item> Summary of Wills and Settlements made by various members of the Chapman family--2.9 </item>
</list>
</item>
<item> Patmore, Derek
<list type="simple">
<item>
<title render="doublequote" linktype="simple">The Poetry and Prose of Richard
Aldington</title> (radio broadcast)--2.9 </item>
</list>
</item>
</list>
</odd>

Thanks again for your time and help,

Reply to this Comment

I also had problems with parsing large XML files. I tried Expresso XML Parser. It can parse files up to 35GB and it's really fast. It's really easy to use. You set up parsing rules on a website and test your file online and then use their client code to access your parsing rules from java or javascript. They have a free developer version at www.sxml.com.au

Reply to this Comment

Hi Ben,
I find your blog stimulating and informative. It seems most of the time when I am looking for an idea on how approach a particular problem --you have already tackled it.

I am working on parsing a rather large XML document so I naturally found this page. I have one small wrinkle to add to the problem. In my case the XML may contain sub nodes of the same type. E.g.

  • <value type="object">
  • <options>
  • <list>
  • <value type="object">
  • <vendor>Google</vendor>
  • <id type="int">101</id>
  • </value>
  • </list>
  • </options>
  • </value>

When I try parsing with your code here, I bomb out when I hit the embedded value tag. The code tries to pass the partial node to XmlParse which complains that the start and end tags don't match.

Do you see any way to handle this sort of parsing chore on large XML documents?

Thanks for your thoughts

Reply to this Comment

Ben 2 the rescue! Grrrreat! This works like a charm on a 190MB large XML-file ;-) Thanx a million Ben for this insightful article and completely different take on parsing XML-documents. Now I just have to figure out how to extract the data into the database without doing the cumbersome VARIABLES.Node.tagname.XmlText all the time ;-)

Reply to this Comment

Looking at my first ever XML document that I have to parse and put into MS SQL 2000 with CF8.

I get it to list the desired Field name, many times over, and have a long list of this field name displayed, like you show using your SubNodeXMLParser (once I added a <CFOUTPUT> tag.

Can you explain where I go from here to get the actual data out of the XML file into the database in, I presume, some sort of a loop process.

Reply to this Comment

I had a problem with reading an UTF-8 XML file. This was only within a chthread with Railo 4, but I'd still like to share what I did to fix it and speed up the XML reading in the process.

I used BufferedReader instead of BufferedInputStream. This main difference is it takes chars instead of raw bytes. This way we can specify which charset should be used.

I added this argument:
<cfargument name="CharSet" type="string" required="false" default="UTF-8" hint="chatset of the xml file">

Changed this:
TransferBuffer = RepeatString(" ",ARGUMENTS.BufferSize).toCharArray(),

And this:
<cfset VARIABLES.Instance.InputStream = CreateObject("java","java.io.BufferedReader").Init(
CreateObject("java","java.io.InputStreamReader").Init(
CreateObject("java","java.io.FileInputStream").Init(JavaCast("string",ARGUMENTS.XmlFilePath)),JavaCast("string",ARGUMENTS.CharSet)),ARGUMENTS.BufferSize)>

Reply to this Comment

Having a bit of an issue. Everything runs fine until it gets to the last entry in the XML file and then I get a 'Stream closed' error.

Any ideas?

Thanks everyone!

Reply to this Comment

Thanks for the suggestion.

Doesn't look like it's an issue of XML size. I ran the code using an XML file with 10 entries and received the error.

'Stream closed'

Reply to this Comment

@Robert, I'm not saying it is an issue with the XML size. I'm saying I'm using a different way of reading the XML file, and maybe that different way (BufferedReader) helps you with your "stream closed" issue.

Reply to this Comment

I noticed in all the examples the XML document is on accessed locally. My problem is that that I am accessing a 3MB XML document from a URL. So it would seem my long processing time is coming from cfhttp downloading the file into memory before it writes it locally where I then use the before mentioned code to process the file.

Is it possible to "stream" from a URL as it has been demonstrated from a physical file?

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.