ColdFusion 10 - Parsing Dirty HTML Into Valid XML Documents
As I blogged earlier, ColdFusion 10 now supports XPath 2.0 in the xmlSearch() and xmlTransform() functions. This might not sound like a very exciting upgrade; however, when you realize that ColdFusion 10 now enables the parsing of "dirty" HTML code into valid XML documents, suddenly, the world of XML becomes a lot more interesting.
NOTE: At the time of this writing, ColdFusion 10 was in public beta.
ColdFusion 10 doesn't provide a native htmlParse() method; however, ColdFusion 10 now ships with the TagSoup 1.2 library pre-installed. This means that we can now instantiate the TagSoup classes and use them to convert our HTML documents into valid XML documents. And, of course, once we do that, we can use xmlSearch() to easily extract elements from our target HTML source code.
To demonstrate this functionality, I'm going to create "dirty" HTML content and then parse it into a searchable XML document. When I use the term "dirty," I simply mean that the HTML will have things like missing close-tags, missing attribute quotes, poor nesting, and upper-case element and attribute names.
As you can see, the HTML code is pretty sloppy. And still, we take our HTML document, run it through htmlParse(), and then search the resultant XML document for various elements. When we run the above code, we get the following page output:

As you can see, the dirty HTML was successfully parsed into a valid ColdFusion XML document which we were able to search with XPath 2.0 and xmlSearch(). The TagSoup library was able to convert our element and attribute names to lowercase, handle tags that don't require closing (ie. BR and IMG), and close tags that were improperly left open.
The TagSoup library, on its own, is nothing new. I tried playing around with it a few years ago, loading it into the ColdFusion context with a Groovy class loader. The difference here is that TagSoup now ships with ColdFusion 10. Of course, now that ColdFusion 10 allows per-application Java Class loading, this becomes much less of an issue. But still, pretty cool!
Want to use code from this post? Check out the license.
Reader Comments
@All,
ColdFusion 10 also appears to ship with the NekoHTML parser as well:
http://nekohtml.sourceforge.net/
However, from some brief experimentation, I was getting better results with less effort from the TagSoup parser.
Very cool! I love the idea of being able to reliably parse HTML pages into XML data.
Really Cool! That's really useful for all your scraping needs!
@Ben,
This may give XHTML the biggest boost it's ever gotten! Rapid Application Development + cleanup = actual use!
@Ben,
"Of course, now that ColdFusion 10 allows per-application Java Class loading"
What is this new witchcraft you mention ?
Very cool. However I have been using Railo for a while, and it also comes with a function htmlParse() to convert string to XML Doc.
Anyway, it is a bit annoying that when you have a XML Doc, you cannot really convert it back to original html string perfectly with function toString(xml) due to the fact that toString() gives the XML indentation and line breaks.
Hi Ben,
I know all the buzz is about CF10 right now but readers may be interested to know that your example works under CF9 using JavaLoader to load TagSoup.
I'm getting errors on larger documents but will need to test with CF10 to see if that's version specific or just an issue with the library.
Great post!
Rob
Ben I'm getting this error from your code.
Unable to find a constructor for class com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM that accepts parameters of type ( '' ).
I'm guessing the path in
var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();
is incorrect for my server. How do I find out what is should be? Or what is my issue? Thanks
Hi Ben - you code works in CF10 Ent with a slight update to the line:
var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();
Edit this to read ...init(true);
What I found with TagSoup is it fails when the DOM element has an ID tag.
For example <table id="mytable"> will result in an error.
Here is also a good article on comparisons:
http://www.benmccann.com/blog/java-html-parsing-library-comparison/
Thanks