Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at Scotch On The Rock (SOTR) 2010 (London) with: Guust Nieuwenhuis

ColdFusion 10 - Parsing Dirty HTML Into Valid XML Documents

By Ben Nadel on
Tags: ColdFusion

As I blogged earlier, ColdFusion 10 now supports XPath 2.0 in the xmlSearch() and xmlTransform() functions. This might not sound like a very exciting upgrade; however, when you realize that ColdFusion 10 now enables the parsing of "dirty" HTML code into valid XML documents, suddenly, the world of XML becomes a lot more interesting.

NOTE: At the time of this writing, ColdFusion 10 was in public beta.

ColdFusion 10 doesn't provide a native htmlParse() method; however, ColdFusion 10 now ships with the TagSoup 1.2 library pre-installed. This means that we can now instantiate the TagSoup classes and use them to convert our HTML documents into valid XML documents. And, of course, once we do that, we can use xmlSearch() to easily extract elements from our target HTML source code.

To demonstrate this functionality, I'm going to create "dirty" HTML content and then parse it into a searchable XML document. When I use the term "dirty," I simply mean that the HTML will have things like missing close-tags, missing attribute quotes, poor nesting, and upper-case element and attribute names.

  • <!---
  • Create our "dirty" HTML document. Dirty in the sense that it
  • cannot be parsed as valid XML. In order to make this document
  • "bad", we'll have tags that don't self-close and perhaps a
  • missing close-tag or two.
  • --->
  • <cfsavecontent variable="dirtyHtml">
  •  
  • <!doctype html>
  • <html xmlns="http://www.w3.org/1999/xhtml">
  • <head>
  • <title>Dana Linn Bailey</title>
  • <meta name="description" content="Strong female muscle, FTW!">
  • <meta name="keywords" content="female muscle,femmuscle,sexy">
  • </head>
  • <body>
  •  
  • <h1>
  • Dana Linn Bailey
  • </h1>
  •  
  • <h2>
  • Professional Bodybuilder
  • </h2>
  •  
  • <p>
  • <IMG
  • SRC="//www.danalinn.com/images/photos/DanaLinnBailey_3.jpg"
  • ALT="Dana Linn Bailey"
  • HEIGHT=250>
  • <br>
  • </p>
  •  
  • <h3>
  • Professional Services
  • </h3>
  •  
  • <ul>
  • <li>Full Contest Preparation
  • <li>12-Week Weight Management Program
  • <li>ONE-TIME Personalized Diet Plan
  • <li>ONE-TIME Personalized Week Training Program
  • <li>Train with DLB herself!!!
  • </ul>
  •  
  • <h2>
  • Biography
  • </h2>
  •  
  • <p>
  • I grew up a jock. At age 6, I was already on the swim
  • team, waking up and going to practice just like the big
  • kids. Up until high school, I was a 6-sport athlete all
  • year round, playing soccer, basketball, field hockey,
  • softball, running track and also swim team. In high
  • school I continued with my 3 favorite sports, soccer,
  • basketball, and field hockey and excelled in all with
  • many awards.
  •  
  • <p>
  • <a href=http://www.danalinn.com/about.html>Read More</a>.
  • </p>
  •  
  • </body>
  • </html>
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <cfscript>
  •  
  •  
  • // I take an HTML string and parse it into an XML(XHTML)
  • // document. This is returned as a standard ColdFusion XML
  • // document.
  • function htmlParse( htmlContent, disableNamespaces = true ){
  •  
  • // Create an instance of the Xalan SAX2DOM class as the
  • // recipient of the TagSoup SAX (Simple API for XML) compliant
  • // events. TagSoup will parse the HTML and announce events as
  • // it encounters various HTML nodes. The SAX2DOM instance will
  • // listen for such events and construct a DOM tree in response.
  • var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();
  •  
  • // Create our TagSoup parser.
  • var tagSoupParser = createObject( "java", "org.ccil.cowan.tagsoup.Parser" ).init();
  •  
  • // Check to see if namespaces are going to be disabled in the
  • // parser. If so, then they will not be added to elements.
  • if (disableNamespaces){
  •  
  • // Turn off namespaces - they are lame an nobody likes
  • // to perform xmlSearch() methods with them in place.
  • tagSoupParser.setFeature(
  • tagSoupParser.namespacesFeature,
  • javaCast( "boolean", false )
  • );
  •  
  • }
  •  
  • // Set our DOM builder to be the listener for SAX-based
  • // parsing events on our HTML.
  • tagSoupParser.setContentHandler( saxDomBuilder );
  •  
  • // Create our content input. The InputSource encapsulates the
  • // means by which the content is read.
  • var inputSource = createObject( "java", "org.xml.sax.InputSource" ).init(
  • createObject( "java", "java.io.StringReader" ).init( htmlContent )
  • );
  •  
  • // Parse the HTML. This will trigger events which the SAX2DOM
  • // builder will translate into a DOM tree.
  • tagSoupParser.parse( inputSource );
  •  
  • // Now that the HTML has been parsed, we have to get a
  • // representation that is similar to the XML document that
  • // ColdFusion users are used to having. Let's search for the
  • // ROOT document and return is.
  • return(
  • xmlSearch( saxDomBuilder.getDom(), "/node()" )[ 1 ]
  • );
  •  
  • }
  •  
  •  
  • // ------------------------------------------------------ //
  • // ------------------------------------------------------ //
  •  
  •  
  • // Parse the "dirty" HTML into a valid XML document.
  • xhtml = htmlParse( dirtyHtml );
  •  
  • // Query for the head contents.
  • headContents = xmlSearch( xhtml, "/html/head/*" );
  •  
  • // Query for the body contents.
  • bodyContents = xmlSearch( xhtml, "/html/body/*" );
  •  
  • // Output the two values.
  • writeDump( headContents );
  • writeDump( bodyContents );
  •  
  •  
  • </cfscript>

As you can see, the HTML code is pretty sloppy. And still, we take our HTML document, run it through htmlParse(), and then search the resultant XML document for various elements. When we run the above code, we get the following page output:


 
 
 

 
 Parsing HTML code into XML documents using ColdFusion 10 and TagSoup. 
 
 
 

As you can see, the dirty HTML was successfully parsed into a valid ColdFusion XML document which we were able to search with XPath 2.0 and xmlSearch(). The TagSoup library was able to convert our element and attribute names to lowercase, handle tags that don't require closing (ie. BR and IMG), and close tags that were improperly left open.

The TagSoup library, on its own, is nothing new. I tried playing around with it a few years ago, loading it into the ColdFusion context with a Groovy class loader. The difference here is that TagSoup now ships with ColdFusion 10. Of course, now that ColdFusion 10 allows per-application Java Class loading, this becomes much less of an issue. But still, pretty cool!




Reader Comments

@Ben,

"Of course, now that ColdFusion 10 allows per-application Java Class loading"

What is this new witchcraft you mention ?

Reply to this Comment

Very cool. However I have been using Railo for a while, and it also comes with a function htmlParse() to convert string to XML Doc.

Anyway, it is a bit annoying that when you have a XML Doc, you cannot really convert it back to original html string perfectly with function toString(xml) due to the fact that toString() gives the XML indentation and line breaks.

Reply to this Comment

Hi Ben,

I know all the buzz is about CF10 right now but readers may be interested to know that your example works under CF9 using JavaLoader to load TagSoup.

I'm getting errors on larger documents but will need to test with CF10 to see if that's version specific or just an issue with the library.

Great post!

Rob

Reply to this Comment

Ben I'm getting this error from your code.

Unable to find a constructor for class com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM that accepts parameters of type ( '' ).

I'm guessing the path in

var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

is incorrect for my server. How do I find out what is should be? Or what is my issue? Thanks

Reply to this Comment

Hi Ben - you code works in CF10 Ent with a slight update to the line:
var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

Edit this to read ...init(true);

What I found with TagSoup is it fails when the DOM element has an ID tag.
For example <table id="mytable"> will result in an error.

Here is also a good article on comparisons:
http://www.benmccann.com/blog/java-html-parsing-library-comparison/

Thanks

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.