Over the past week, I've been working to retrofit Markdown onto all of my old blog content using Lucee CFML. It's been an exciting journey with a lot of trial and error. For example, the other day, I realized the
.xmlText property wasn't giving me escaped HTML entities; and, just this morning, I realized that
iframe tags with no content were getting re-serialized as self-closing tags. While this is valid for XML - any tag with no children can be self-closing - only certain tags in HTML can be self-closing. And, the
iframe is not one of them. As such, I had to re-process all of my posts, ensuring that
iframe tags were serialized using both an Open and Close tag in Lucee CFML 188.8.131.52.
To see the issue I was running into, let's look at a stand-alone example. In the following ColdFusion code, we're going to parse an HTML snippet using
htmlParse(). And then, simply serialize it back to HTML using
<cfscript> ``` <cfsavecontent variable="htmlContent"> <p> Heck, checkout this video: </p> <p> <iframe src="video.mp4"></iframe> </p> </cfsavecontent> ``` // The htmlParse() function parses the HTML into an XML document. The rules for XML // documents are different than the rules for HTML documents. This can cause a // re-serialization problem for non-self-closing tags with empty-content. xmlContent = htmlParse( htmlContent ); // Because the IFRAME element has no child-nodes, stringification of the XML document // will render the IFRAME as a SELF-CLOSING tag. This is valid for XML but is NOT // valid for HTML. echo( encodeForHtml( toString( xmlContent.html.body ) ) ); </cfscript>
As you can see, the HTML content being parsed contains an
iframe tag with no children:
And, when we serialize this using
toString(), we get the following markup (I've manually added white-space to make it more readable):
<?xml version="1.0" encoding="UTF-8"?> <body xmlns="http://www.w3.org/1999/xhtml"> <p> Heck, checkout this video: </p> <p> <iframe frameborder="1" scrolling="auto" src="video.mp4"/> </p> </body>
As you can see, the
iframe tag is being serialized as a self-closing tag, in that it now ends with
/> rather than with
</iframe>. If I were to try and get the browser to render this
iframe, the page would break. It wouldn't throw an error, it would simply hit the
<iframe/> tag and stop rendering the rest of the page output.
NOTE: Literally, as I am writing this, I am just noticed that the
htmlParse()method seems to have injected
scrollingattributes into my
To get around this, I have to force the
iframe tag to have at least one child-node. If it has one child node, then the
toString() call will correctly render it with the
</iframe> closing tag.
The easiest way I can think of to do this is to simply append an empty HTML comment to the
iframe content. This shouldn't have any bearing on the visual rendering of the page; but, it will force the
iframe tree-fragment to be non-empty. I'm going to do this before I run the HTML content through
<cfscript> ``` <cfsavecontent variable="htmlContent"> <p> Heck, checkout this video: </p> <p> <iframe src="video.mp4"></iframe> </p> </cfsavecontent> ``` // In order to get IFRAME tags to re-serialize with the desired, two-tag format, we // have to ensure that the IFRAME contains at least one child-node. In this case, we // can use the innocuous COMMENT node to force children. htmlContent = htmlContent .reReplaceNoCase( "></iframe>", "><!-- --></iframe>", "all" ) ; // With the inserted COMMENT, our IFRAME element in the resultant XML document will // no longer be empty. xmlContent = htmlParse( htmlContent ); // ... which means, when re-serialized, it will render as <iframe>....</iframe>. echo( encodeForHtml( xmlContent.html.body ) ); </cfscript>
As you can see, before I call
htmlParse(), I'm finding any
iframe closing tag that butts-up against another tag artifact (angle bracket) and I'm inserting an empty HTML comment. Now, when we re-serialize the content using the
toString() function, we get the following markup (again, I've manually added white-space to make it more readable):
<?xml version="1.0" encoding="UTF-8"?> <body xmlns="http://www.w3.org/1999/xhtml"> <p> Heck, checkout this video: </p> <p> <iframe frameborder="1" scrolling="auto" src="video.mp4"><!-- --></iframe> </p> </body>
As you can see, because we force the
iframe tag to have at least one child node, it now gets re-serialized with the
</iframe> closing tag.
To be clear, I'm talking about the
iframe tag in this case because that's the tag that caused my page-rendering issues. However, this same rule applies to any HTML tag that has no children. Of course, tags like
meta are allowed to be self-closing and won't be a problem. It just happens that the
iframe tag will break the page if it self-closing.
Ultimately, there may be other ways to deal with HTML parsing and sanitization; such as by using XSLT and
xmlTransform() in ColdFusion. However,
htmlParse() feels like a nice combination of ease-of-use and powerful functionality. It just happens that it has caveats that you have to watch out for in Lucee CFML.