ColdFusion 10 - Parsing Dirty HTML Into Valid XML Documents

Posted February 28, 2012 at 9:50 AM by Ben Nadel

Tags: ColdFusion

As I blogged earlier, ColdFusion 10 now supports XPath 2.0 in the xmlSearch() and xmlTransform() functions. This might not sound like a very exciting upgrade; however, when you realize that ColdFusion 10 now enables the parsing of "dirty" HTML code into valid XML documents, suddenly, the world of XML becomes a lot more interesting.

NOTE: At the time of this writing, ColdFusion 10 was in public beta.

ColdFusion 10 doesn't provide a native htmlParse() method; however, ColdFusion 10 now ships with the TagSoup 1.2 library pre-installed. This means that we can now instantiate the TagSoup classes and use them to convert our HTML documents into valid XML documents. And, of course, once we do that, we can use xmlSearch() to easily extract elements from our target HTML source code.

To demonstrate this functionality, I'm going to create "dirty" HTML content and then parse it into a searchable XML document. When I use the term "dirty," I simply mean that the HTML will have things like missing close-tags, missing attribute quotes, poor nesting, and upper-case element and attribute names.

  • <!---
  • Create our "dirty" HTML document. Dirty in the sense that it
  • cannot be parsed as valid XML. In order to make this document
  • "bad", we'll have tags that don't self-close and perhaps a
  • missing close-tag or two.
  • --->
  • <cfsavecontent variable="dirtyHtml">
  •  
  • <!doctype html>
  • <html xmlns="http://www.w3.org/1999/xhtml">
  • <head>
  • <title>Dana Linn Bailey</title>
  • <meta name="description" content="Strong female muscle, FTW!">
  • <meta name="keywords" content="female muscle,femmuscle,sexy">
  • </head>
  • <body>
  •  
  • <h1>
  • Dana Linn Bailey
  • </h1>
  •  
  • <h2>
  • Professional Bodybuilder
  • </h2>
  •  
  • <p>
  • <IMG
  • SRC="//www.danalinn.com/images/photos/DanaLinnBailey_3.jpg"
  • ALT="Dana Linn Bailey"
  • HEIGHT=250>
  • <br>
  • </p>
  •  
  • <h3>
  • Professional Services
  • </h3>
  •  
  • <ul>
  • <li>Full Contest Preparation
  • <li>12-Week Weight Management Program
  • <li>ONE-TIME Personalized Diet Plan
  • <li>ONE-TIME Personalized Week Training Program
  • <li>Train with DLB herself!!!
  • </ul>
  •  
  • <h2>
  • Biography
  • </h2>
  •  
  • <p>
  • I grew up a jock. At age 6, I was already on the swim
  • team, waking up and going to practice just like the big
  • kids. Up until high school, I was a 6-sport athlete all
  • year round, playing soccer, basketball, field hockey,
  • softball, running track and also swim team. In high
  • school I continued with my 3 favorite sports, soccer,
  • basketball, and field hockey and excelled in all with
  • many awards.
  •  
  • <p>
  • <a href=http://www.danalinn.com/about.html>Read More</a>.
  • </p>
  •  
  • </body>
  • </html>
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <cfscript>
  •  
  •  
  • // I take an HTML string and parse it into an XML(XHTML)
  • // document. This is returned as a standard ColdFusion XML
  • // document.
  • function htmlParse( htmlContent, disableNamespaces = true ){
  •  
  • // Create an instance of the Xalan SAX2DOM class as the
  • // recipient of the TagSoup SAX (Simple API for XML) compliant
  • // events. TagSoup will parse the HTML and announce events as
  • // it encounters various HTML nodes. The SAX2DOM instance will
  • // listen for such events and construct a DOM tree in response.
  • var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();
  •  
  • // Create our TagSoup parser.
  • var tagSoupParser = createObject( "java", "org.ccil.cowan.tagsoup.Parser" ).init();
  •  
  • // Check to see if namespaces are going to be disabled in the
  • // parser. If so, then they will not be added to elements.
  • if (disableNamespaces){
  •  
  • // Turn off namespaces - they are lame an nobody likes
  • // to perform xmlSearch() methods with them in place.
  • tagSoupParser.setFeature(
  • tagSoupParser.namespacesFeature,
  • javaCast( "boolean", false )
  • );
  •  
  • }
  •  
  • // Set our DOM builder to be the listener for SAX-based
  • // parsing events on our HTML.
  • tagSoupParser.setContentHandler( saxDomBuilder );
  •  
  • // Create our content input. The InputSource encapsulates the
  • // means by which the content is read.
  • var inputSource = createObject( "java", "org.xml.sax.InputSource" ).init(
  • createObject( "java", "java.io.StringReader" ).init( htmlContent )
  • );
  •  
  • // Parse the HTML. This will trigger events which the SAX2DOM
  • // builder will translate into a DOM tree.
  • tagSoupParser.parse( inputSource );
  •  
  • // Now that the HTML has been parsed, we have to get a
  • // representation that is similar to the XML document that
  • // ColdFusion users are used to having. Let's search for the
  • // ROOT document and return is.
  • return(
  • xmlSearch( saxDomBuilder.getDom(), "/node()" )[ 1 ]
  • );
  •  
  • }
  •  
  •  
  • // ------------------------------------------------------ //
  • // ------------------------------------------------------ //
  •  
  •  
  • // Parse the "dirty" HTML into a valid XML document.
  • xhtml = htmlParse( dirtyHtml );
  •  
  • // Query for the head contents.
  • headContents = xmlSearch( xhtml, "/html/head/*" );
  •  
  • // Query for the body contents.
  • bodyContents = xmlSearch( xhtml, "/html/body/*" );
  •  
  • // Output the two values.
  • writeDump( headContents );
  • writeDump( bodyContents );
  •  
  •  
  • </cfscript>

As you can see, the HTML code is pretty sloppy. And still, we take our HTML document, run it through htmlParse(), and then search the resultant XML document for various elements. When we run the above code, we get the following page output:


 
 
 

 
 Parsing HTML code into XML documents using ColdFusion 10 and TagSoup. 
 
 
 

As you can see, the dirty HTML was successfully parsed into a valid ColdFusion XML document which we were able to search with XPath 2.0 and xmlSearch(). The TagSoup library was able to convert our element and attribute names to lowercase, handle tags that don't require closing (ie. BR and IMG), and close tags that were improperly left open.

The TagSoup library, on its own, is nothing new. I tried playing around with it a few years ago, loading it into the ColdFusion context with a Groovy class loader. The difference here is that TagSoup now ships with ColdFusion 10. Of course, now that ColdFusion 10 allows per-application Java Class loading, this becomes much less of an issue. But still, pretty cool!


You Might Also Be Interested In:



Reader Comments

Feb 28, 2012 at 10:18 AM // reply »
11,238 Comments

@All,

ColdFusion 10 also appears to ship with the NekoHTML parser as well:

http://nekohtml.sourceforge.net/

However, from some brief experimentation, I was getting better results with less effort from the TagSoup parser.


Feb 28, 2012 at 10:19 AM // reply »
26 Comments

Very cool! I love the idea of being able to reliably parse HTML pages into XML data.


Feb 28, 2012 at 2:51 PM // reply »
13 Comments

Really Cool! That's really useful for all your scraping needs!


Feb 28, 2012 at 7:07 PM // reply »
270 Comments

@Ben,

This may give XHTML the biggest boost it's ever gotten! Rapid Application Development + cleanup = actual use!


Feb 29, 2012 at 6:15 AM // reply »
27 Comments

@Ben,

"Of course, now that ColdFusion 10 allows per-application Java Class loading"

What is this new witchcraft you mention ?


Feb 29, 2012 at 6:16 AM // reply »
3 Comments

Very cool. However I have been using Railo for a while, and it also comes with a function htmlParse() to convert string to XML Doc.

Anyway, it is a bit annoying that when you have a XML Doc, you cannot really convert it back to original html string perfectly with function toString(xml) due to the fact that toString() gives the XML indentation and line breaks.


Mar 5, 2012 at 3:33 PM // reply »
1 Comments

Hi Ben,

I know all the buzz is about CF10 right now but readers may be interested to know that your example works under CF9 using JavaLoader to load TagSoup.

I'm getting errors on larger documents but will need to test with CF10 to see if that's version specific or just an issue with the library.

Great post!

Rob


Oct 12, 2012 at 6:24 PM // reply »
5 Comments

Ben I'm getting this error from your code.

Unable to find a constructor for class com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM that accepts parameters of type ( '' ).

I'm guessing the path in

var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

is incorrect for my server. How do I find out what is should be? Or what is my issue? Thanks


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 21, 2013 at 9:25 AM
Turning Off and On Identity Column in SQL Server
you are awesome..i am lucky to get this blog between such a garbage one....Thanks, Prashant ... read »
May 20, 2013 at 4:38 PM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Dana, Your confusion is well founded, since this is a very confusing features. In fact, it ONLY works if you use array notation. Meaning, that this: arrayToList( query[ "columnName" ] ) ... read »
May 20, 2013 at 4:34 PM
Using A Dynamic Column Name With ValueList() In ColdFusion
I was thinking chicken and the egg, I wouldn't have expected it to work in the valuelist going in I guess. Maybe I just need a beer, long day :) ... read »
May 20, 2013 at 4:29 PM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Dana, That's if you're trying to reference a specific row. In this case, we're trying to reference the entire query column as one cohesive value. So, you are correct that if you wanted to output a ... read »
May 20, 2013 at 4:24 PM
Using A Dynamic Column Name With ValueList() In ColdFusion
I thought when you used array notation to reference queries you always had to have the row or it would throw a similar error as well? ... read »
May 20, 2013 at 11:45 AM
Using jQuery's Animate() Step Callback Function To Create Custom Animations
This is really useful. I found out that you don't actually have to use a dummy css property (surprisingly). To animate a property in a linear-gradient for instance I did this this.css('someLinearGra ... read »
May 20, 2013 at 10:51 AM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Josh, Oh snap! You're totally right! I'm not sure I've ever tried that. I did know that you can call a number of other array-methods on ColdFusion query columns: http://www.bennadel.com/blog/167 ... read »
May 20, 2013 at 10:45 AM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Ben - I believe you can achieve the same functionality with ColdFusion's built in ArrayToList() function. ArrayToList( users[ "id" ] ); ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools