Skip to main content
Ben Nadel at BFusion / BFLEX 2009 (Bloomington, Indiana) with: Bob Burns
Ben Nadel at BFusion / BFLEX 2009 (Bloomington, Indiana) with: Bob Burns

Avoiding Self-Closing IFRAME Tags Using htmlParse() In Lucee CFML 5.3.4.80

By on
Tags:

Over the past week, I've been working to retrofit Markdown onto all of my old blog content using Lucee CFML. It's been an exciting journey with a lot of trial and error. For example, the other day, I realized the .xmlText property wasn't giving me escaped HTML entities; and, just this morning, I realized that iframe tags with no content were getting re-serialized as self-closing tags. While this is valid for XML - any tag with no children can be self-closing - only certain tags in HTML can be self-closing. And, the iframe is not one of them. As such, I had to re-process all of my posts, ensuring that iframe tags were serialized using both an Open and Close tag in Lucee CFML 5.3.4.80.

To see the issue I was running into, let's look at a stand-alone example. In the following ColdFusion code, we're going to parse an HTML snippet using htmlParse(). And then, simply serialize it back to HTML using toString():

<cfscript>

	```
	<cfsavecontent variable="htmlContent">

		<p>
			Heck, checkout this video:
		</p>

		<p>
			<iframe src="video.mp4"></iframe>
		</p>

	</cfsavecontent>
	```

	// The htmlParse() function parses the HTML into an XML document. The rules for XML
	// documents are different than the rules for HTML documents. This can cause a
	// re-serialization problem for non-self-closing tags with empty-content.
	xmlContent = htmlParse( htmlContent );

	// Because the IFRAME element has no child-nodes, stringification of the XML document
	// will render the IFRAME as a SELF-CLOSING tag. This is valid for XML but is NOT
	// valid for HTML.
	echo( encodeForHtml( toString( xmlContent.html.body ) ) );

</cfscript>

As you can see, the HTML content being parsed contains an iframe tag with no children:

<iframe src="video.mp4"></iframe>

And, when we serialize this using toString(), we get the following markup (I've manually added white-space to make it more readable):

<?xml version="1.0" encoding="UTF-8"?>
<body xmlns="http://www.w3.org/1999/xhtml">
	<p>
		Heck, checkout this video:
	</p>
	<p>
		<iframe frameborder="1" scrolling="auto" src="video.mp4"/>
	</p>
</body>

As you can see, the iframe tag is being serialized as a self-closing tag, in that it now ends with /> rather than with </iframe>. If I were to try and get the browser to render this iframe, the page would break. It wouldn't throw an error, it would simply hit the <iframe/> tag and stop rendering the rest of the page output.

NOTE: Literally, as I am writing this, I am just noticed that the htmlParse() method seems to have injected frameborder and scrolling attributes into my iframe tag.

To get around this, I have to force the iframe tag to have at least one child-node. If it has one child node, then the toString() call will correctly render it with the </iframe> closing tag.

The easiest way I can think of to do this is to simply append an empty HTML comment to the iframe content. This shouldn't have any bearing on the visual rendering of the page; but, it will force the iframe tree-fragment to be non-empty. I'm going to do this before I run the HTML content through htmlParse():

<cfscript>

	```
	<cfsavecontent variable="htmlContent">

		<p>
			Heck, checkout this video:
		</p>

		<p>
			<iframe src="video.mp4"></iframe>
		</p>

	</cfsavecontent>
	```

	// In order to get IFRAME tags to re-serialize with the desired, two-tag format, we
	// have to ensure that the IFRAME contains at least one child-node. In this case, we
	// can use the innocuous COMMENT node to force children.
	htmlContent = htmlContent
		.reReplaceNoCase( "></iframe>", "><!-- --></iframe>", "all" )
	;

	// With the inserted COMMENT, our IFRAME element in the resultant XML document will
	// no longer be empty.
	xmlContent = htmlParse( htmlContent );

	// ... which means, when re-serialized, it will render as <iframe>....</iframe>.
	echo( encodeForHtml( xmlContent.html.body ) );

</cfscript>

As you can see, before I call htmlParse(), I'm finding any iframe closing tag that butts-up against another tag artifact (angle bracket) and I'm inserting an empty HTML comment. Now, when we re-serialize the content using the toString() function, we get the following markup (again, I've manually added white-space to make it more readable):

<?xml version="1.0" encoding="UTF-8"?>
<body xmlns="http://www.w3.org/1999/xhtml">
	<p>
		Heck, checkout this video:
	</p>
	<p>
		<iframe frameborder="1" scrolling="auto" src="video.mp4"><!-- --></iframe>
	</p>
</body>

As you can see, because we force the iframe tag to have at least one child node, it now gets re-serialized with the </iframe> closing tag.

To be clear, I'm talking about the iframe tag in this case because that's the tag that caused my page-rendering issues. However, this same rule applies to any HTML tag that has no children. Of course, tags like img and meta are allowed to be self-closing and won't be a problem. It just happens that the iframe tag will break the page if it self-closing.

Ultimately, there may be other ways to deal with HTML parsing and sanitization; such as by using XSLT and xmlTransform() in ColdFusion. However, htmlParse() feels like a nice combination of ease-of-use and powerful functionality. It just happens that it has caveats that you have to watch out for in Lucee CFML.

Want to use code from this post? Check out the license.

Reader Comments

15,329 Comments

@Johannes,

I don't know about any bug reports. The core of the problem is that Lucee is parsing HTML into XML, and XML just has different rules about what is and is not valid. Lately, I've been using jSoup to do the HTML parsing. Yes, it's a 3rd-party library that I now have to pull-in; but, it was designed to parse and render HTML specifically, so it's knows how to do the right thing.

I was actually just using it the other day to fix some names before rendering some stored content. You might just be curious to see it in action:

www.bennadel.com/blog/4315-using-jsoup-to-fix-post-marriage-name-changes-in-coldfusion-2021.htm

Post A Comment — I'd Love To Hear From You!

Oops!
NEW: Some basic markdown formatting is now supported: bold, italic, blockquotes, lists, fenced code-blocks. Read more about markdown syntax »
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.