ColdFusion 10 - Parsing Dirty HTML Into Valid XML Documents

Posted February 28, 2012 at 9:50 AM by Ben Nadel

Tags: ColdFusion

As I blogged earlier, ColdFusion 10 now supports XPath 2.0 in the xmlSearch() and xmlTransform() functions. This might not sound like a very exciting upgrade; however, when you realize that ColdFusion 10 now enables the parsing of "dirty" HTML code into valid XML documents, suddenly, the world of XML becomes a lot more interesting.

NOTE: At the time of this writing, ColdFusion 10 was in public beta.

ColdFusion 10 doesn't provide a native htmlParse() method; however, ColdFusion 10 now ships with the TagSoup 1.2 library pre-installed. This means that we can now instantiate the TagSoup classes and use them to convert our HTML documents into valid XML documents. And, of course, once we do that, we can use xmlSearch() to easily extract elements from our target HTML source code.

To demonstrate this functionality, I'm going to create "dirty" HTML content and then parse it into a searchable XML document. When I use the term "dirty," I simply mean that the HTML will have things like missing close-tags, missing attribute quotes, poor nesting, and upper-case element and attribute names.

  • <!---
  • Create our "dirty" HTML document. Dirty in the sense that it
  • cannot be parsed as valid XML. In order to make this document
  • "bad", we'll have tags that don't self-close and perhaps a
  • missing close-tag or two.
  • --->
  • <cfsavecontent variable="dirtyHtml">
  •  
  • <!doctype html>
  • <html xmlns="http://www.w3.org/1999/xhtml">
  • <head>
  • <title>Dana Linn Bailey</title>
  • <meta name="description" content="Strong female muscle, FTW!">
  • <meta name="keywords" content="female muscle,femmuscle,sexy">
  • </head>
  • <body>
  •  
  • <h1>
  • Dana Linn Bailey
  • </h1>
  •  
  • <h2>
  • Professional Bodybuilder
  • </h2>
  •  
  • <p>
  • <IMG
  • SRC="//www.danalinn.com/images/photos/DanaLinnBailey_3.jpg"
  • ALT="Dana Linn Bailey"
  • HEIGHT=250>
  • <br>
  • </p>
  •  
  • <h3>
  • Professional Services
  • </h3>
  •  
  • <ul>
  • <li>Full Contest Preparation
  • <li>12-Week Weight Management Program
  • <li>ONE-TIME Personalized Diet Plan
  • <li>ONE-TIME Personalized Week Training Program
  • <li>Train with DLB herself!!!
  • </ul>
  •  
  • <h2>
  • Biography
  • </h2>
  •  
  • <p>
  • I grew up a jock. At age 6, I was already on the swim
  • team, waking up and going to practice just like the big
  • kids. Up until high school, I was a 6-sport athlete all
  • year round, playing soccer, basketball, field hockey,
  • softball, running track and also swim team. In high
  • school I continued with my 3 favorite sports, soccer,
  • basketball, and field hockey and excelled in all with
  • many awards.
  •  
  • <p>
  • <a href=http://www.danalinn.com/about.html>Read More</a>.
  • </p>
  •  
  • </body>
  • </html>
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <cfscript>
  •  
  •  
  • // I take an HTML string and parse it into an XML(XHTML)
  • // document. This is returned as a standard ColdFusion XML
  • // document.
  • function htmlParse( htmlContent, disableNamespaces = true ){
  •  
  • // Create an instance of the Xalan SAX2DOM class as the
  • // recipient of the TagSoup SAX (Simple API for XML) compliant
  • // events. TagSoup will parse the HTML and announce events as
  • // it encounters various HTML nodes. The SAX2DOM instance will
  • // listen for such events and construct a DOM tree in response.
  • var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();
  •  
  • // Create our TagSoup parser.
  • var tagSoupParser = createObject( "java", "org.ccil.cowan.tagsoup.Parser" ).init();
  •  
  • // Check to see if namespaces are going to be disabled in the
  • // parser. If so, then they will not be added to elements.
  • if (disableNamespaces){
  •  
  • // Turn off namespaces - they are lame an nobody likes
  • // to perform xmlSearch() methods with them in place.
  • tagSoupParser.setFeature(
  • tagSoupParser.namespacesFeature,
  • javaCast( "boolean", false )
  • );
  •  
  • }
  •  
  • // Set our DOM builder to be the listener for SAX-based
  • // parsing events on our HTML.
  • tagSoupParser.setContentHandler( saxDomBuilder );
  •  
  • // Create our content input. The InputSource encapsulates the
  • // means by which the content is read.
  • var inputSource = createObject( "java", "org.xml.sax.InputSource" ).init(
  • createObject( "java", "java.io.StringReader" ).init( htmlContent )
  • );
  •  
  • // Parse the HTML. This will trigger events which the SAX2DOM
  • // builder will translate into a DOM tree.
  • tagSoupParser.parse( inputSource );
  •  
  • // Now that the HTML has been parsed, we have to get a
  • // representation that is similar to the XML document that
  • // ColdFusion users are used to having. Let's search for the
  • // ROOT document and return is.
  • return(
  • xmlSearch( saxDomBuilder.getDom(), "/node()" )[ 1 ]
  • );
  •  
  • }
  •  
  •  
  • // ------------------------------------------------------ //
  • // ------------------------------------------------------ //
  •  
  •  
  • // Parse the "dirty" HTML into a valid XML document.
  • xhtml = htmlParse( dirtyHtml );
  •  
  • // Query for the head contents.
  • headContents = xmlSearch( xhtml, "/html/head/*" );
  •  
  • // Query for the body contents.
  • bodyContents = xmlSearch( xhtml, "/html/body/*" );
  •  
  • // Output the two values.
  • writeDump( headContents );
  • writeDump( bodyContents );
  •  
  •  
  • </cfscript>

As you can see, the HTML code is pretty sloppy. And still, we take our HTML document, run it through htmlParse(), and then search the resultant XML document for various elements. When we run the above code, we get the following page output:


 
 
 

 
 Parsing HTML code into XML documents using ColdFusion 10 and TagSoup. 
 
 
 

As you can see, the dirty HTML was successfully parsed into a valid ColdFusion XML document which we were able to search with XPath 2.0 and xmlSearch(). The TagSoup library was able to convert our element and attribute names to lowercase, handle tags that don't require closing (ie. BR and IMG), and close tags that were improperly left open.

The TagSoup library, on its own, is nothing new. I tried playing around with it a few years ago, loading it into the ColdFusion context with a Groovy class loader. The difference here is that TagSoup now ships with ColdFusion 10. Of course, now that ColdFusion 10 allows per-application Java Class loading, this becomes much less of an issue. But still, pretty cool!


You Might Also Be Interested In:



Reader Comments

Feb 28, 2012 at 10:18 AM // reply »
11,246 Comments

@All,

ColdFusion 10 also appears to ship with the NekoHTML parser as well:

http://nekohtml.sourceforge.net/

However, from some brief experimentation, I was getting better results with less effort from the TagSoup parser.


Feb 28, 2012 at 10:19 AM // reply »
26 Comments

Very cool! I love the idea of being able to reliably parse HTML pages into XML data.


Feb 28, 2012 at 2:51 PM // reply »
13 Comments

Really Cool! That's really useful for all your scraping needs!


Feb 28, 2012 at 7:07 PM // reply »
272 Comments

@Ben,

This may give XHTML the biggest boost it's ever gotten! Rapid Application Development + cleanup = actual use!


Feb 29, 2012 at 6:15 AM // reply »
27 Comments

@Ben,

"Of course, now that ColdFusion 10 allows per-application Java Class loading"

What is this new witchcraft you mention ?


Feb 29, 2012 at 6:16 AM // reply »
3 Comments

Very cool. However I have been using Railo for a while, and it also comes with a function htmlParse() to convert string to XML Doc.

Anyway, it is a bit annoying that when you have a XML Doc, you cannot really convert it back to original html string perfectly with function toString(xml) due to the fact that toString() gives the XML indentation and line breaks.


Mar 5, 2012 at 3:33 PM // reply »
1 Comments

Hi Ben,

I know all the buzz is about CF10 right now but readers may be interested to know that your example works under CF9 using JavaLoader to load TagSoup.

I'm getting errors on larger documents but will need to test with CF10 to see if that's version specific or just an issue with the library.

Great post!

Rob


Oct 12, 2012 at 6:24 PM // reply »
5 Comments

Ben I'm getting this error from your code.

Unable to find a constructor for class com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM that accepts parameters of type ( '' ).

I'm guessing the path in

var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

is incorrect for my server. How do I find out what is should be? Or what is my issue? Thanks


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 23, 2013 at 9:52 PM
Preventing Links In Standalone iPhone Applications From Opening In Mobile Safari
@Muhmmadibn Did you figure out a solution to launching PDFs? I am running into the same issues myself. There is no way to close the PDF or go back once you launch it. Thanks in advance! ... read »
May 23, 2013 at 6:06 PM
The Girl Who Broke My Heart, And Made Me A Better Person
Good day,ladies and gentle men, my name is Dr AMADI the great spell caster in Africa, i have help so many people for different kind of problems,who say there is no solution to problems on earth, that ... read »
May 23, 2013 at 4:26 PM
ColdFusion QueryAppend( qOne, qTwo )
@Heather, Glad people are still getting value out of this! ... read »
May 23, 2013 at 3:49 PM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@WebManWalking, I meant the code at the bottom (not the video). I did try to experiment with an intermediary variable, like: value = users.id[ i ]; arrayContains( userIDs, value ); ... but t ... read »
May 23, 2013 at 11:06 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben, Are you talking about As Number: YES As String: YES As Java: YES? If so, that's with 3 different ways of referencing the constant 1, not users.id[1]. Query object references(*) are what seem ... read »
May 23, 2013 at 9:55 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Dan, According to the CF Admin, I'm running Java "1.6.0_45". As far as the DB column, in the database it's an INT. I'll see if I can dig into what CF sees it as. @WebManWalking, But h ... read »
May 23, 2013 at 9:49 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben, I think the problem is that we're used to loose typing in ColdFusion, like JavaScript. If a value is a number but it's needed in an expression to be a string, noooo problem. I've encountered ... read »
May 23, 2013 at 9:47 AM
ColdFusion QueryAppend( qOne, qTwo )
You rock! Thank you, thank you, thank you!!! ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools