Over the past week, I've been working to retrofit Markdown onto all of my old blog content using Lucee CFML. It's been an exciting journey with a lot of trial and error. For example, the other day, I realized the
.xmlText property wasn't giving me escaped HTML entities; and, just this morning, I realized that
iframe tags with no content were getting re-serialized as self-closing tags. While this is valid for XML - any tag with no children can be self-closing - only certain tags in HTML can be self-closing. And, the
iframe is not one of them. As such, I had to re-process all of my posts, ensuring that
iframe tags were serialized using both an Open and Close tag in Lucee CFML 126.96.36.199.
To see the issue I was running into, let's look at a stand-alone example. In the following ColdFusion code, we're going to parse an HTML snippet using
htmlParse(). And then, simply serialize it back to HTML using
<cfscript> ``` <cfsavecontent variable="htmlContent"> <p> Heck, checkout this video: </p> <p> <iframe src="video.mp4"></iframe> </p> </cfsavecontent> ``` // The htmlParse() function parses the HTML into an XML document. The rules for XML // documents are different than the rules for HTML documents. This can cause a // re-serialization problem for non-self-closing tags with empty-content. xmlContent = htmlParse( htmlContent ); // Because the IFRAME element has no child-nodes, stringification of the XML document // will render the IFRAME as a SELF-CLOSING tag. This is valid for XML but is NOT // valid for HTML. echo( encodeForHtml( toString( xmlContent.html.body ) ) ); </cfscript>
As you can see, the HTML content being parsed contains an
iframe tag with no children:
And, when we serialize this using
toString(), we get the following markup (I've manually added white-space to make it more readable):
<?xml version="1.0" encoding="UTF-8"?> <body xmlns="http://www.w3.org/1999/xhtml"> <p> Heck, checkout this video: </p> <p> <iframe frameborder="1" scrolling="auto" src="video.mp4"/> </p> </body>
As you can see, the
iframe tag is being serialized as a self-closing tag, in that it now ends with
/> rather than with
</iframe>. If I were to try and get the browser to render this
iframe, the page would break. It wouldn't throw an error, it would simply hit the
<iframe/> tag and stop rendering the rest of the page output.
NOTE: Literally, as I am writing this, I am just noticed that the
htmlParse()method seems to have injected
scrollingattributes into my
To get around this, I have to force the
iframe tag to have at least one child-node. If it has one child node, then the
toString() call will correctly render it with the
</iframe> closing tag.
The easiest way I can think of to do this is to simply append an empty HTML comment to the
iframe content. This shouldn't have any bearing on the visual rendering of the page; but, it will force the
iframe tree-fragment to be non-empty. I'm going to do this before I run the HTML content through
<cfscript> ``` <cfsavecontent variable="htmlContent"> <p> Heck, checkout this video: </p> <p> <iframe src="video.mp4"></iframe> </p> </cfsavecontent> ``` // In order to get IFRAME tags to re-serialize with the desired, two-tag format, we // have to ensure that the IFRAME contains at least one child-node. In this case, we // can use the innocuous COMMENT node to force children. htmlContent = htmlContent .reReplaceNoCase( "></iframe>", "><!-- --></iframe>", "all" ) ; // With the inserted COMMENT, our IFRAME element in the resultant XML document will // no longer be empty. xmlContent = htmlParse( htmlContent ); // ... which means, when re-serialized, it will render as <iframe>....</iframe>. echo( encodeForHtml( xmlContent.html.body ) ); </cfscript>
As you can see, before I call
htmlParse(), I'm finding any
iframe closing tag that butts-up against another tag artifact (angle bracket) and I'm inserting an empty HTML comment. Now, when we re-serialize the content using the
toString() function, we get the following markup (again, I've manually added white-space to make it more readable):
<?xml version="1.0" encoding="UTF-8"?> <body xmlns="http://www.w3.org/1999/xhtml"> <p> Heck, checkout this video: </p> <p> <iframe frameborder="1" scrolling="auto" src="video.mp4"><!-- --></iframe> </p> </body>
As you can see, because we force the
iframe tag to have at least one child node, it now gets re-serialized with the
</iframe> closing tag.
To be clear, I'm talking about the
iframe tag in this case because that's the tag that caused my page-rendering issues. However, this same rule applies to any HTML tag that has no children. Of course, tags like
meta are allowed to be self-closing and won't be a problem. It just happens that the
iframe tag will break the page if it self-closing.
Ultimately, there may be other ways to deal with HTML parsing and sanitization; such as by using XSLT and
xmlTransform() in ColdFusion. However,
htmlParse() feels like a nice combination of ease-of-use and powerful functionality. It just happens that it has caveats that you have to watch out for in Lucee CFML.
Want to use code from this post? Check out the license.
Hi, I had the same problem. Self-closing iframes are invalid HTML. I wonder if there is a bug report anywhere.
I found this solution also very helpful: https://stackoverflow.com/q/41890415/1337474
Thanks for sharing!
I don't know about any bug reports. The core of the problem is that Lucee is parsing HTML into XML, and XML just has different rules about what is and is not valid. Lately, I've been using jSoup to do the HTML parsing. Yes, it's a 3rd-party library that I now have to pull-in; but, it was designed to parse and render HTML specifically, so it's knows how to do the right thing.
I was actually just using it the other day to fix some names before rendering some stored content. You might just be curious to see it in action: