Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at cf.Objective() 2010 (Minneapolis, MN) with:

Parsing Invalid HTML Into XML Using ColdFusion, Groovy, And TagSoup

By Ben Nadel on
Tags: ColdFusion

I try to write my HTML as XHTML-compliant as possible, which makes it a subset of XML; but, that's not always easy or possible and often times the HTML that we deal with is downright dirty. As such, it makes parsing the HTML into a usable data structure a total nightmare! As part of my exploration of Groovy, I wanted to see if ColdFusion could leverage Groovy in such a way that the responsibility of HTML cleanup could be out-sourced.

After a little bit of Googling, I came across the Java package, TagSoup. TagSoup is a SAX-compliant parser that will read HTML in and clean it up in the process such that the resultant HTML document is actually an XML document (XHTML). Once the HTML is XML, we could then easily parse it and extract the data.

To play with this concept, I created an invalid XHTML document:

web.htm

  • <!DOCTYPE HTML>
  • <html>
  • <head>
  • <title>ColdFusion And Groovy HTML Parsing</title>
  • </head>
  • <body>
  •  
  • <h1>
  • ColdFusion And Groovy HTML Parsing
  • </h1>
  •  
  • <p>
  •  
  • The following are all properties of
  • <strong class="girl">Joanna</strong>
  •  
  • <ul>
  • <li>Athletic
  • <li>Curvy
  • <li>Sexy
  • <li>Brunette
  • </ul>
  •  
  • </body>
  • </html>

As you can see above, the paragraph tag (P) has no closing tag and none of the list item tags (LI) have closing tags. There is no way that this HTML could be parsed into an XML document; and therefore, our ability to treat this HTML as a structured document from which we could extract information is rather limited. That's where TagSoup comes into play. In the following code, we're going to load up TagSoup in a Groovy context and let it convert the raw HTML into an XML document (which we will then re-serialized for ColdFusion).

  • <!--- Import the CFGroovy tag library. --->
  • <cfimport prefix="g" taglib="../cfgroovy/" />
  •  
  • <!---
  • Get the current directory path. This will be used so we
  • don't have to use expandPath() on the JAR or HTML file paths.
  • --->
  • <cfset currentDirectory = getDirectoryFromPath(
  • getCurrentTemplatePath()
  • ) />
  •  
  • <!---
  • Get the file path to the TagSoup jar file (which will be
  • loaded by the Groovy script engine.
  • --->
  • <cfset tagSoupJarFile = (currentDirectory & "tagsoup-1.2.jar") />
  •  
  • <!--- Read in the raw HTML file. --->
  • <cfset html = fileRead( currentDirectory & "web.htm" ) />
  •  
  •  
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  • <g:script>
  •  
  • <!---
  • Get the class loader being used by the Groovy script
  • engine. We will need this to load classes in the TagSoup
  • JAR file.
  • --->
  • def classLoader = this.getClass().getClassLoader();
  •  
  • <!---
  • Add the TagSoup JAR file to the class loader's list of
  • classes that it can instantiate.
  • --->
  • classLoader.addURL(
  • new URL( "file:///" + variables.tagSoupJarFile )
  • );
  •  
  • <!---
  • Get an instance of the the tag soup parser HTML parser
  • from the class loader. This is a SAX-compliant parser.
  • --->
  • def tagSoupParser = Class.forName(
  • "org.ccil.cowan.tagsoup.Parser",
  • true,
  • classLoader
  • )
  • .newInstance()
  • ;
  •  
  • <!---
  • Create an instance of the Groovy XML Slurper using the
  • TagSoup parsing engine.
  • --->
  • def htmlParser = new XmlSlurper( tagSoupParser );
  •  
  • <!---
  • Parse the raw HTML text into a valid XHTML document using
  • the TagSoup parsing engine. This will give us GPathResult
  • XML Document.
  • --->
  • def xhtml = htmlParser.parseText( variables.html );
  •  
  • <!---
  • Now that we have an XHTML (XML) document, we need to
  • serialize that back into HTML mark up.
  • --->
  • def cleanHtmlWriter = new StringWriter()
  •  
  • <!---
  • This builds the markup in the string writer using the
  • Streaming markup builder.
  •  
  • NOTE: This step loses me a bit. I have not been able to
  • find any great documentation on how this is uses or what
  • exactly it does. But, it looks like somehow the XHTML
  • document is being bound to the markup builder, which is
  • then searialized to the string writer.
  • --->
  • cleanHtmlWriter << new groovy.xml.StreamingMarkupBuilder().bind(
  • {
  • mkp.declareNamespace( '': 'http://www.w3.org/1999/xhtml' );
  • mkp.yield( xhtml );
  • }
  • );
  •  
  • <!---
  • Now that we have our X(HTML) document serialized in our
  • string writer, let's convert it to a string and store it
  • back into the ColdFusion varaibles scope.
  • --->
  • variables.xhtml = cleanHtmlWriter.toString().trim();
  •  
  • </g:script>
  • <!--- ----------------------------------------------------- --->
  • <!--- ----------------------------------------------------- --->
  •  
  •  
  • <!---
  • Now that we have cleaned the HTML into XHTML, we can parse it
  • using XmlParse().
  • --->
  • <cfset xml = xmlParse( xhtml ) />
  •  
  • <!---
  • Now that the HTML is parsed into an XML document, we can
  • output it the way we would normally with any XML document.
  • --->
  • <cfoutput>
  •  
  • #xml.html.body.p.strong.xmlText# is:<br />
  •  
  • <!--- Loop over LI elements. --->
  • <cfloop
  • index="attribute"
  • array="#xml.html.body.ul.xmlChildren#">
  •  
  • - #attribute.xmlText#<br />
  •  
  • </cfloop>
  •  
  • </cfoutput>

I am still extremely new to Groovy, so the last part of the code above is a bit beyond my full comprehension. I grasp that the parsed XHTML document is being bound to the StreamingMarkupBuilder instance, which somehow allows it to be transformed and written to a string buffer; but, the actual workings of it are not quite clicking in my head just yet. Believe it or not, that piece of code took me about three hours to research and write! But, in the end I got it working; and, as you can see, once the serialized XHTML is passed back to ColdFusion, we can easily parse it back into a ColdFusion XML document from which we can extract targeted information.

Running the code above, we get the following output:

Joanna is:
- Athletic
- Curvy
- Sexy
- Brunette

I don't know about you, but I think that is hella sexy! I only wish that I understood the Groovy syntax a bit better; I am finding my unfamiliarity with it to be a large stumbling block in my learning process. I think it's time to actually get a good book and learn my stuff. Unfortunately, it looks like some of the books about Groovy are already out of date; for example, I came across a post in my research in which someone explained that the popular book, Groovy in Action, didn't even support MKP in the builder classes, which was required for the last, crucial portion of the code above. But, that's a whole other discussion.




Reader Comments

That's slick, Ben. I haven't played with XmlSlurper much because I never have valid HTML, but wrapping it around TagSoup is hella clever.

Reply to this Comment

@Jim,

Oh cool, I was not aware of that; a quick Google search and I see Brian Rinaldi has it in his open source list. I'll have to take a look. Thanks!

@Barney,

Thanks man. I'm still enjoying a lot of the Groovy stuff. It's odd - I actually get conflicted about how much / little definition to use when I declare things. It seems that Groovy works with practically nothing - it's making me lazy :)

Reply to this Comment

Hi Ben.

I'm curious why you've used Groovy instead of just using the Tag Soup via Java?

I have to admit I don't know a lot about Groovy yet. What benefit has it brought to this?

Reply to this Comment

@Gareth,

This experiment started out simply as a way to experiment with Groovy - the TagSoup part of it was just incidental.

I would be happy to try it in the Java way. I'll probably take a look at what @Jim did with his Crouton project, as I would guess that it is integrating directly with Java.

From what I read in the TagSoup docs (of which are almost none), you needed to create a SAX-compliant parser to consume the TagSoup parser; I don't know how to do that in Java.... of course, when I started this, I didn't know how to do it in Groovy either - it just so happens that the XmlSlurper() takes a SAXParser as a potential constructor argument.

Perhaps trying this in Java would be a great follow up post :)

Reply to this Comment

@Ben,
You seem to be running into quite a few of the things that I came across while beginning with Groovy. It's been out for many years, but doesn't seem to have as many bloggers pumping out code (like bennadel.com has done for CF over the years), thus making searches a little bit more time consuming and painful. It took me a great deal of trial and error with the code to get some things to work, and there was much less copy and paste, and figuring out how someone else managed to accomplish the same thing. Having said that, though, there's definitely a great sense of accomplishment when you figure out "ah, so *that's* how you do it" :) I've found that the Groovy community at large is willing to help out if you do run into an issue that's causing you grief. Anyway, keep up with the Groovy posts, as I'm always interested in some new techniques to integrate with CF.

Reply to this Comment

@Gareth,

Thanks for the encouragement. I have a few eBooks that I'm gonna try to make my way though.

Reply to this Comment

@Alessandro,

That's pretty cool! I was not aware that the W3C offered any web services. It looks like you need to have a membership to use these APIs - is this true? I had trouble finding the service description on the W3C site.

Reply to this Comment

@Jim,

Very cool component. It's funny to now come across it after working on creating a CFC myself. I see with yours you are using the JavaClassLoader which is great for those not able to add the JAR to their classpath. I was looking in to doing the very same thing, but stumbled upon the JAR already in our classpath here.

@Ben,

I was playing around with this for a form validator I have developed where you pass attributes to the HTML tags such as: <input name="firstname" validation="required,min_length[3]" ... /> and have a custom tag parse the HTML as XML so I can grab the tag attributes and handle it accordingly. I tried jTidy at first, but even the slighest corrupt HTML would just return a blank string for me so that's why I read into Tagsoup. I'm not sure how long this will be around for, but it might be worth checking out just to get an idea: http://pastie.org/1359532

Jim used some classes from the javax.xml.transform package, but that didn't work well for me. I actually used the TransformHandler as the ContentHandler and it caused issues with <input ... checked="checked" /> becoming <input ... checked /> which would not parse. I see Jim used SAX2DOM which I haven't had experience with. I just used the XMLWriter class that came Tagsoup and that worked just fine.

I may have to go back through though and add in some accessors like Jim used if this CFC ever gets used by anyone but myself at work. I do a bit of cleaning up of the parsed content just for the specific use here. When I pass in HTML, I don't need any namespace or html/body tags. I just want the input source the same as the output source except cleaned up.

Good article!

Reply to this Comment

@Tristan,
I'm glad you found it useful. I should probably go back and revisit it at some point. My skills are a lot stronger now than they were back then.

Reply to this Comment

@Ben,
I am writing a script which parses html file and findouts how many table tags are there based on this i would do some operation.

When i convert html into proper xml using text inside <p> tag wont have the tag so it is failing.. in the above e.g <p>

The following are all properties of
<strong class="girl">Joanna

How did your code removed the text "The following are all properties of" or how it supress xmlslurper..

Here is my code

import org.ccil.*


def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)

def htmlParser = slurper.parse("D:\\Temp\\t.html")

htmlParser.'**'.findAll{it.@class.toString().contains("table")}.each {
println it
}

Note :- tagsoup-1.2.1.jar is added into the groovy script classpath

Reply to this Comment

This takes almost any URL and formats it to a way XMLParse works
<cfhttp method="POST" url="http://services.w3.org/tidy/tidy" resolveurl="yes"><cfhttpparam type="formfield" value="#myURL#" name="docAddr"></cfhttp>

The following line is a fix you suggested in another post Ben
<cfset xmlResult = XmlParse(REReplace(cfhttp.FileContent, "^[^<]*", "", "all" )) />

Finally, parse the result into an XML structure
<cfset myXML = xmlParse(xmlResult)>

This works pretty well for web scrapping. However, there are occasions I do get blocked. I stumbled upon your post looking for alternatives than using the w3.org tidy service.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.