Skip to main content
Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.

Parsing HTML Natively With htmlParse() In Lucee

By Ben Nadel on
Tags: ColdFusion

Parsing HTML isn't a task that I often have to perform during the "normal operation" of a ColdFusion application. However, parsing HTML can be a helpful feature when it comes to data migration. For example, when migrating from an old, HTML-based content management system (CMS) to a Markdown-based content management system. And, as someone who recently started using Flexmark and Markdown to author blog posts, this might be a migration that I attempt to take-on. In the past, I would have reached for TagSoup or jSoup to perform such parsing. But, it turns out that Lucee 5.3 now provides HTML-parsing natively with the htmlParse() function. This function accepts a string and returns an XML document.

The htmlParse() function is simple. You pass it a String; it returns an XML document. You can then traverse the returned XML data structure manually; or, you can use XPath to query for target elements. The complexity with the XPath option is that the htmlParse() function applies a Name Space to the document. Which means that instead of using simple queries like: "//p" )

... you have to use somewhat janky queries like this: "//*[ local-name() = 'p' ]" )

The degree to which this bothers you is strictly personal. That said, to make this exploration a bit more interesting, I wanted to create a wrapper function for htmlParse() that uses XSLT (XML Transforms) to remove name spaces from the resultant data structure.

I created a function called, htmlParseNoNamespaces(). To see it in action, I'm going to parse a simple (but invalid) HTML string and then query for the paragraphs:

	CAUTION: The htmlParse() function is very forgiving with invalid markup. But, it
	seems to work most consistently when there is a single ROOT NODE in the given markup.
	For example, in the following markup, if I remove the "body" wrapper, the structure
	of the parsed document completely changes, placing one SECTION element inside the
	other SECTION element (presumably because I am not closing the P-tag).
<cfsavecontent variable="markup">

	<body class=dark-mode>
			<!-- Testing some cool things. -->
			<p id=intro class='content'>This is very interesting!
			<p>I agree, this is <u class="em">player!</u>
			<p>But, will it <strong>work</strong>!</p>
		<p>One wonders.


	doc = htmlParseNoNamespaces( markup );

	// Gather all of the text from the P-elements. Since the parsed HTML document is
	// returned as an XML document, we can use XPath to locate the paragraphs.
	paragraphs = doc
		.search( "//p" )
			( node ) => {

				// For each paragraph, aggregate the string value of the entire node,
				// which will include all of the descendant nodes as well.
				return( "normalize-space( string( . ) )" ) );


	dump( label = "P-Text", var = paragraphs );
	echo( "<br />" );
	dump( label = "HtmlDoc", var = "//body" )[ 1 ] );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	* When the native htmlParse() function runs, it includes XML name-spaces which make
	* it much harder to search the subsequent document using XPath. This method strips
	* those XML name-spaces from the parsed document, allowing XPath to target node-names
	* more directly.
	* @htmlMarkup I am the HTML string being parsed.
	* @output false
	public any function htmlParseNoNamespaces( required string htmlMarkup ) {

		// To strip out the name-spaces, we're going to use XSLT (XML Transforms). The
		// following XSLT document will traverse the parsed HTML document and copy nodes
		// over to a new output string using only the node names.
		// --
		// Read More:
		var removeNamespacesXSLT = trim('


			 	<!-- Keep comment nodes. -->
			 	<xsl:template match="comment()">
						<xsl:apply-templates />

				<!-- Keep element nodes. -->
				<xsl:template match="*">
					<!-- Remove element prefix. -->
					<xsl:element name="{ local-name() }">
						<!-- Process attributes. -->
						<xsl:for-each select="@*">
							<!-- Remove attribute prefix. -->
							<xsl:attribute name="{ local-name() }">
								<xsl:value-of select="." />

						<!-- Copy child nodes. -->
						<xsl:apply-templates />


		// In order to remove the name-spaces, we have to parse the document twice -
		// once to parse the HTML into an XML document. Then, once again to parse the
		// transformed XML string (less the name-spaces) back into an actual XML
		// document that we can search using XPath.
		return( xmlParse( htmlParse( htmlMarkup ).transform( removeNamespacesXSLT ) ) );



As you can see, I'm using the htmlParse() function to parse the incoming HTML string. Then, I use the .transform() method to remove the name spaces (using the given XSLT document), which produces a serialized XML string. I then parse that XML string back into an XML Document which be searched easily with XPath and simple Element selectors.

Now, when I run this Lucee CFML document, I get the following output:

Parsing HTML with Lucee 5.3 can be queried with XPath.

As you can see, one the non-name-space XML document is produced, I can use a simple XPath query like //p to search for the Paragraph nodes. I then use a .map() method to map the P-nodes onto String values.

Again, parsing HTML isn't something that I have to do very often in my ColdFusion code. That said, it's really cool that parsing HTML is now a native feature of Lucee It will making the few times that I do need the functionality all that much easier to consume.

Reader Comments

This is amazing. I actually never got on too well with jSoup, so I am looking forward to using:


And, I actually like using 'XPath'. It's also great that it still parses everything, even when some of the paragraph tags are not closed.
Impressive stuff...


Pretty exciting, right? One of the things that I would like to be able to do with it eventually is extract fenced-code-blocks from my markdown content. I explored this idea this morning:

Having to use XPath is not great, when compared with somethings like CSS-based selectors. But, for small things, like my aforementioned exploration, it's a perfectly enjoyable option.

Hi Ben,

I read this article with great interest and it brings me to a thought about a problem that has been bothering me for a long time.

There is a table. Rows (first column) are attribute of something, the columns (first row) are numbers / ratings. The position of an X in the table determines the rating of an attribute.

Is it possible to extract the information out of the table like that?

Attribute A - 2
Attribute B - 3
Attribute C - 1
Attribute D - 2


Sorry for the late reply. But, I assume (without knowing the exact data model here), that you could .map() the rows onto the given structure you are looking for. So, some pseudo-code for this might look like:

var data = htmlParse( your_html ).search( "//tr" ).map(
	( row ) => {
		// Find the values in each row that you are looking for.
		var attribute = "./....." ).xmlText;
		var rating = "./....." ).xmlText;
			attribute: attribute,
			rating: rating

I don't know what those internal .search() values would actually look like since I don't know what kind of HTML you are working with. But, essentially, each .map() iteration would give you access to the parsed TR row. Then, you can perform a .search() off of the TR context to get your specific values (and maybe apply some different logic to translate locations to ratings).

Hopefully that helps a little bit.


After working extensively with htmlParse() over the weekend, I stumbled upon something interesting - escaped HTML entities are no longer escaped when I read them out through the .xmlText property of the resultant XML document:

.... I had to get around this by using the toString() function on the given TEXT_NODE and then stripping off the XML DOCTYPE.