Retrofitting Markdown Onto 15-Years Of Articles Using htmlParse(), XPath, And Lucee CFML 5.3.4.80

By Ben Nadel

Published 2020-03-16 in ColdFusion — Comments (5)

In 2019, I finally dumped my Windows VirtualBox and XStandard ActiveX blog authoring and moved my content-creation workflow over to using Markdown in ColdFusion. Markdown has been a total joy to work with for new articles; however, I still have 15-years of old content that is hard-coded as HTML in my database. In order to make those old articles editable as Markdown, I wanted to see if I could programmatically convert the HTML content over to Markdown using Lucee CFML 5.3.4.80.

I currently use Flexmark to convert my Markdown content into HTML in ColdFusion. The Flexmark library is awesome! And, happens to have an extension that can be used to convert HTML to Markdown. However, customizing the logic of said extension requires Java expertise - expertise that I do not have. As such, I knew that whatever solution I came up with had to entail raw Lucee CFML / ColdFusion logic.

Historically, when using Adobe ColdFusion, I would parse HTML using something like TagSoup or jSoup. But, now that I've embraced the unyielding power of Lucee CFML, I can parse HTML natively with the htmlParse() function.

The Document Object Model (DOM) produced by the htmlParse() function is XML. XML is not the most enjoyable document-type to work with; but, ColdFusion included XPath search functionality, which means that we should be able to move around the and extract portions of the generated XML document pretty easily.

Now, while I use Lucee CFML at work, I still use Adobe ColdFusion for my blog. This is because I pay for managed-hosting (which has better tooling and support for Adobe ColdFusion). That said, through the miracle of CommandBox, I am still using Lucee CFML in my local development environment to run this task against a local copy of my database.

The ColdFusion code that powers this migration took me all weekend to write. In fact, the code that I'm sharing below is my second attempt that, more-or-less, completely replaced my first attempt. The output of this transformation is not perfect. The hope was just to get 90% of the way there such that whenever I need to go back and edit an old post, I'd only have to tweak the Markdown in order to get it ready for re-processing.

The algorithm below is not generic in any way. It is completely custom-tailored for my content and looks for CSS classes and other DOM-related "hints" that I've used over the years in my HTML.

As you look through the code, you'll see that it has explicit checks for Elements and CSS classes; and, throws an error whenever it encounters anything it doesn't expect. I took this approach so that the code would break every time it encountered something new. This would give me a chance to look at the given HTML content and figure out what kind of Markdown it should produce.

Note that I am using Tag Islands to write a CFQuery tag in my CFScript. I freaking love that feature of Lucee CFML so hard!

With that said, here's my ColdFusion code. It finds all of the blog-entries that lack Markdown, loops over them, and converts them in turn. Right now, I'm just writing the Markdown to .md files; but, eventually, this will turn into an UPDATE SQL statement.

<cfscript>

	param name="url.id" type="numeric" default="0";

	```
	<cfquery name="posts" datasource="bennadel" returntype="array">
		SELECT
			e.id,
			e.name,
			e.content,
			e.formatted_content
		FROM
			blog_entry e
		WHERE
			NOT LENGTH( LEFT( e.content_markdown, 10 ) )

			<cfif url.id>
				AND
					e.id = <cfqueryparam value="#url.id#" sqltype="integer" />
			</cfif>
		ORDER BY
			e.id ASC

		<cfif ! url.id>
			LIMIT
				100
			OFFSET
				0
		</cfif>
		;
	</cfquery>
	```

	// Setup some short-hands.
	newline = chr( 10 );
	newline2 = newline.repeatString( 2 );
	tab = chr( 9 );

	// For each post, write-out the markdown version to a file to help with debugging
	// the conversion process.
	for ( post in posts ) {

		echo( "<a href='https://localhost/blog/#post.id#-my-post.htm##site-content' target='preview'>" );
		echo( "Processing #post.id# -- #encodeForHtml( post.name )#" );
		echo( "</a> <br />" );

		// Over time, the storage of the content moved around depending on what level of
		// formatting was being applied.
		htmlContent = ( post.formatted_content.len() )
			? post.formatted_content
			: post.content
		;

		try {

			markdownContent = convertToMarkdown( htmlContent );

			// Write the converted content to file.
			// --
			// NOTE: Eventually, this will be a database UPDATE; but, for now, let's
			// write it to a file so we can review the result without committing to it.
			fileWrite(
				"./output/#numberFormat( post.id, '00000' )#.md",
				markdownContent,
				"utf-8"
			);

			// If we're limiting scope to a single post, output the results to the page.
			if ( posts.len() == 1 ) {

				echo( "<a href='./index.cfm?id=#( url.id + 1 )#'>next</a> <br />" );
				echo( "<br />" );
				echo( "<pre style='white-space: pre-wrap ; tab-size: 4 ;'>" );
				echo( encodeForHtml( markdownContent ) );
				echo( "</pre>" );

			}

		} catch ( any error ) {

			echo( "<hr />" );
			dump( htmlContent );
			echo( "<hr />" );
			dump( error );
			dump( post );
			abort;

		}

	}

	// ------------------------------------------------------------------------------- //
	// CONVERSION METHODS.
	// ------------------------------------------------------------------------------- //

	/**
	* I convert the given HTML content to markdown content.
	*/
	public string function convertToMarkdown( required string htmlContent ) {

		// The HTML content of my blog entries consist of HTML fragments (not a valid
		// website). However, the htmlParse() function will automatically create a BODY
		// tag that houses those fragments. As such, we can locate the parsed markup by
		// getting the children of the BODY tag.
		var contentDom = htmlParseNoNamespaces( htmlContent );
		var bodyNode = contentDom.search( "//body" ).first();

		return( convertNodesToMarkdown( "", "", newline2, bodyNode.xmlChildren ) );

	}


	/**
	* I convert the given XML node-list to markdown content.
	*/
	public string function convertNodesToMarkdown(
		required string prefixFirst,
		required string prefixRest,
		required string infix,
		required array nodes
		) {

		var markdownNodes = nodes
			// Since we are traversing HTML markup, filter-OUT any newline character
			// nodes. They may not mean anything in HTML; but, they are "meaningful" in
			// a markdown context. As such, we don't want our output getting confused.
			.filter(
				( node ) => {

					return( ! isNewlineNode( node ) );

				}
			)
			.map(
				( node ) => {

					// SPECIAL CASE FOR NESTED LISTS. My logic around mixing inline and
					// block elements is not good - as such, I'm just hacking this
					// special case right into the core traversal.
					if (
						prefixFirst.len() &&
						( node.getNodeType() == "ELEMENT_NODE" ) &&
						( ( node.xmlName == "ol" ) || ( node.xmlName == "ul" ) )
						) {

						return( newline & convertNodeToMarkdown( node ) );

					} else {

						return( convertNodeToMarkdown( node ) );

					}

				}
			)
		;

		var markdownContent = markdownNodes.toList( infix );

		// If any of the lines have to prefixed, we have to split the content and then
		// map it back onto a prefixed version.
		if ( prefixFirst.len() || prefixRest.len() ) {

			markdownContent = markdownContent
				.listToArray( newline )
				.map(
					( markdownLine, i ) => {

						if ( i == 1 ) {

							return( prefixFirst & markdownLine );

						} else {

							return( prefixRest & markdownLine );

						}

					}
				)
				.toList( newline )
			;

		}

		return( markdownContent );

	}


	/**
	* I convert the given node to markdown.
	* 
	* CAUTION: I am being EXTREMELY explicit about which elements are expected, throwing
	* an error for any discovered element that was not expected. I did this so that the
	* code would break every time it came across something I hadn't planned-for. This
	* would give me an opportunity to examine the offending code and write an explicit
	* use-case for it. I do the same for all CSS class names as well (later on).
	*/
	public string function convertNodeToMarkdown( required xml node ) {

		if ( node.getNodeType() == "TEXT_NODE" ) {

			return( escapeMarkdown( node.xmlText ) );

		}

		switch ( node.xmlName ) {
			case "a":

				return( convertNodeToMarkdown_A( node ) );

			break;
			case "b":
			case "strong":

				return( convertNodeToMarkdown_B( node ) );

			break;
			case "br":

				return( convertNodeToMarkdown_BR( node ) );

			break;
			case "div":

				return( convertNodeToMarkdown_DIV( node ) );

			break;
			case "h1":
			case "h2":
			case "h3":
			case "h4":
			case "h5":

				return( convertNodeToMarkdown_H( node ) );

			break;
			case "i":
			case "em":

				return( convertNodeToMarkdown_I( node ) );

			break;
			case "img":

				return( serializeXmlNode( node ) );

			break;
			case "ol":

				return( convertNodeToMarkdown_OL( node ) );

			break;
			case "p":

				return( convertNodeToMarkdown_P( node ) );

			break;
			case "span":

				return( convertNodeToMarkdown_SPAN( node ) );

			break;
			case "table":

				return( convertNodeToMarkdown_TABLE( node ) );

			break;
			case "ul":

				return( convertNodeToMarkdown_UL( node ) );

			break;
			default:

				dump( node );
				throw( type = "UnxpectedNodeName" );

			break;
		}

	}


	public string function convertNodeToMarkdown_A( required xml node ) {

		var anchorText = convertNodesToMarkdown( "", "", "", node.xmlNodes );
		var anchorLink = node.xmlAttributes.href;

		return( "[#anchorText#](#anchorLink#)" );

	}


	public string function convertNodeToMarkdown_B( required xml node ) {

		var boldText = convertNodesToMarkdown( "", "", "", node.xmlNodes );

		return( "**" & boldText & "**" );

	}


	public string function convertNodeToMarkdown_BR( required xml node ) {

		return( "  " & newline );

	}


	public string function convertNodeToMarkdown_DIV( required xml node ) {

		var className = ( node.xmlAttributes.class ?: "" );

		// Special case for really really really old code formatting.
		if ( ( className == "code" ) && node.search( "./p" ).len() && ! node.search( "./ul" ).len() ) {

			return( convertNodeToMarkdown_DIV_INDENT( node ) );

		}

		switch ( className ) {
			case "code":
			case "codefixed":

				return( convertNodeToMarkdown_DIV_CODE( node ) );

			break;
			case "hrule":

				return( convertNodeToMarkdown_DIV_HRULE( node ) );

			break;
			case "seo":

				return( convertNodeToMarkdown_DIV_SEO( node ) );

			break;
			case "stacktrace":

				return( convertNodeToMarkdown_DIV_STACKTRACE( node ) );

			break;
			default:

				dump( node );
				throw( type = "UnexpectedClassName" );

			break;
		}

	}


	public string function convertNodeToMarkdown_DIV_CODE( required xml node ) {

		var linesOfCode = node.search( "./ul/li" ).map(
			( node ) => {

				if (
					( node.xmlChildren.len() == 1 ) &&
					( node.xmlChildren[ 1 ].xmlName == "br" )
					) {

					return( "" );

				}

				var tabCount = ( node.xmlAttributes.keyExists( "class" ) )
					? val( node.xmlAttributes.class.replace( "tab", "" ) )
					: 0
				;

				return( tab.repeatString( tabCount ) & unescapeCode( getNodeText( node ).trim() ) );

			}
		);

		var codeContent = linesOfCode.toList( newline );
		var fence = getCodeFence( codeContent );

		if ( node.xmlAttributes.keyExists( "data-gist-filename" ) ) {

			var fileName = node.xmlAttributes[ 'data-gist-filename' ];
			var fileExt = listLast( fileName, "." );

			return(
				"<div data-gist-filename=""#fileName#"" class=""code"">" & newline2 &
				fence & fileExt & newline &
				codeContent & newline &
				fence & newline2 &
				"</div>"
			);

		} else {

			return(
				"<div class=""code"">" & newline2 &
				fence & newline &
				codeContent & newline &
				fence & newline2 &
				"</div>"
			);

		}

	}


	public string function convertNodeToMarkdown_DIV_HRULE( required xml node ) {

		return( "----" );

	}


	public string function convertNodeToMarkdown_DIV_INDENT( required xml node ) {

		var linesOfCode = node.search( "./p" ).map(
			( node ) => {

				var tabCount = ( node.xmlAttributes.keyExists( "class" ) )
					? 1
					: 0
				;

				var lineContent = node.xmlNodes.map(
					( childNode ) => {

						if (
							( childNode.getNodeType() == "ELEMENT_NODE" ) && 
							( childNode.xmlName == "br" )
							) {

							return( newline );

						}

						return( childNode.xmlText );

					}
				).toList( tab.repeatString( tabCount ) );

				return( tab.repeatString( tabCount ) & unescapeCode( lineContent ) );

			}
		);

		var codeContent = linesOfCode.toList( newline );
		var fence = getCodeFence( codeContent );

		if ( node.xmlAttributes.keyExists( "data-gist-filename" ) ) {

			var fileName = node.xmlAttributes[ 'data-gist-filename' ];
			var fileExt = listLast( fileName, "." );

			return(
				"<div data-gist-filename=""#fileName#"" class=""code"">" & newline2 &
				fence & fileExt & newline &
				codeContent & newline &
				fence & newline2 &
				"</div>"
			);

		} else {

			return(
				"<div class=""code"">" & newline2 &
				fence & newline &
				codeContent & newline &
				fence & newline2 &
				"</div>"
			);

		}

	}


	public string function convertNodeToMarkdown_DIV_SEO( required xml node ) {

		return(
			"<!--" & newline &
			serializeXmlNode( node ) & newline &
			"-->"
		);

	}


	public string function convertNodeToMarkdown_DIV_STACKTRACE( required xml node ) {

		var fence = getCodeFence( node.xmlChildren[ 1 ].xmlText );

		return(
			fence & "txt" & newline &
			node.xmlChildren[ 1 ].xmlText & newline &
			fence
		);

	}


	public string function convertNodeToMarkdown_H( required xml node ) {

		var nodeName = node.xmlName;
		var titleHeading = val( right( nodeName, 1 ) );
		var titleText = convertNodesToMarkdown( "", "", "", node.xmlNodes );
		var className = ( node.xmlAttributes.class ?: "" );

		switch ( className ) {
			case "":

				return( "##".repeatString( titleHeading ) & " " & titleText );

			break;
			default:

				dump( node );
				throw( type = "UnexpectedClassName" );

			break;
		}

	}


	public string function convertNodeToMarkdown_I( required xml node ) {

		var italicText = convertNodesToMarkdown( "", "", "", node.xmlNodes );

		return( "_" & italicText & "_" );

	}


	public string function convertNodeToMarkdown_OL( required xml node ) {

		var className = ( node.xmlAttributes.class ?: "" );

		switch ( className ) {
			case "paragraphspacing":
			case "":

				var infix = ( className == "paragraphspacing" )
					? newline2
					: newline
				;

				var listItems = node.search( "./li" ).map(
					( childNode ) => {

						return( convertNodesToMarkdown( "1. ", "   ", "", removeTrailingBR( childNode.xmlNodes ) ) );

					}
				);

				// Add trailing comment to make sure this doesn't bleed into next list.
				listItems.append( "<!-- -->" );

				return( listItems.toList( infix ) );

			break;
			default:

				dump( node );
				throw( type = "UnexpectedClassName" );

			break;
		}

	}


	public string function convertNodeToMarkdown_P( required xml node ) {

		// Special case for really really really old code formatting.
		if ( node.search( ".//span[ @class = 'cfmarkup' ]" ).len() ) {

			return( convertNodeToMarkdown_P_MARKUP( node, "cfmarkup", "cfml" ) );

		// Special case for really really really old code formatting.
		} else if ( node.search( ".//span[ @class = 'htmlmarkup' ]" ).len() ) {

			return( convertNodeToMarkdown_P_MARKUP( node, "htmlMarkup", "html" ) );

		// Special case for image wrapper.
		} else if (
			( node.xmlChildren.len() == 1 ) &&
			( node.xmlChildren[ 1 ].xmlName == "img" )
			) {

			return( convertNodeToMarkdown_P_IMG( node ) );

		// Special case for video wrapper.
		} else if (
			( node.xmlChildren.len() == 1 ) &&
			( node.xmlChildren[ 1 ].xmlName == "object" )
			) {

			return( convertNodeToMarkdown_P_OBJECT( node ) );

		}

		var className = ( node.xmlAttributes.class ?: "" );

		switch ( className ) {
			case "blockquote":
			case "indented":
			case "question":

				return( convertNodesToMarkdown( "> ", "> ", "", node.xmlNodes ) );

			break;
			case "":

				return( convertNodesToMarkdown( "", "", "", node.xmlNodes ) );

			break;
			default:

				dump( node );
				throw( type = "UnexpectedClassName" );

			break;
		}

	}


	public string function convertNodeToMarkdown_P_IMG( required xml node ) {

		return(
			"<div class=""m-image-tile"">" & newline &
			tab & serializeXmlNode( node.xmlChildren[ 1 ] ) & newline &
			"</div>"
		);

	}


	public string function convertNodeToMarkdown_P_MARKUP(
		required xml node,
		required string className,
		required string languagePrefix
		) {

		var spanNodes = node.search( "./span" );

		if ( spanNodes.len() != 1 ) {

			dump( spanNodes );
			throw( type = "UnexpectedNodesLength" );

		}

		var linesOfCode = spanNodes.first().xmlNodes
			.filter(
				( node ) => {

					switch ( node.getNodeType() ) {
						case "ELEMENT_NODE":

							switch ( node.xmlName ) {
								case "br":
									return( false );
								break;
								default:
									dump( node );
									throw( type = "UnexpectedMarkupNodeType" );
								break;
							}

						break;
						default:
							return( true );
						break;
					}

				}
			)
			.map(
				( node ) => {

					return( unescapeCode( node.xmlText ) );

				}
			)
		;

		var codeContent = linesOfCode.toList( newline );
		var fence = getCodeFence( codeContent );

		return(
			fence & languagePrefix & newline &
			codeContent & newline &
			fence
		);

	}


	public string function convertNodeToMarkdown_P_OBJECT( required xml node ) {

		return(
			"<div class=""m-video-tile"">" & newline &
			tab & serializeXmlNode( node.xmlChildren[ 1 ] ) & newline &
			"</div>"
		);

	}


	public string function convertNodeToMarkdown_SPAN( required xml node ) {

		var className = ( node.xmlAttributes.class ?: "" );

		switch ( className ) {
			case "asdf":
			case "highlight":

				return( serializeXmlNode( node ) );

			break;
			case "red":

				return( convertNodesToMarkdown( "", "", "", node.xmlNodes ) );

			break;
			default:

				dump( node );
				throw( type = "UnexpectedClassName" );

			break;
		}

	}


	public string function convertNodeToMarkdown_TABLE( required xml node ) {

		var className = ( node.xmlAttributes.class ?: "" );

		switch ( className ) {
			case "imageborder":

				return( convertNodeToMarkdown_TABLE_IMGBORDER( node ) );

			break;
			default:

				dump( node );
				throw( type = "UnexpectedClassName" );

			break;
		}

	}


	public string function convertNodeToMarkdown_TABLE_IMGBORDER( required xml node ) {

		var imageContainer = node.search( ".//tr[ 2 ]/td[ 2 ]" ).first();

		var containerType = ( serializeXmlNode( imageContainer ).reFindNoCase( "vimeo|youtube" ) )
			? "m-video-tile"
			: "m-image-tile"
		;

		return(
			"<div class=""#containerType#"">" & newline &
			tab & serializeXmlNode( imageContainer.xmlChildren[ 1 ] ) & newline &
			"</div>"
		);

	}


	public string function convertNodeToMarkdown_UL( required xml node ) {

		var className = ( node.xmlAttributes.class ?: "" );

		switch ( className ) {
			case "--na1":
			case "--na2":
			case "--na3":
			case "--na4":
			case "paragraphspacing":
			case "":

				var infix = ( className == "paragraphspacing" )
					? newline2
					: newline
				;

				var listItems = node.search( "./li" ).map(
					( childNode ) => {

						return( convertNodesToMarkdown( "* ", "  ", "", removeTrailingBR( childNode.xmlNodes ) ) );

					}
				);

				// Add trailing comment to make sure this doesn't bleed into next list.
				listItems.append( "<!-- -->" );

				return( listItems.toList( infix ) );

			break;
			default:

				dump( node );
				throw( type = "UnexpectedClassName" );

			break;
		}

	}

	// ------------------------------------------------------------------------------- //
	// UTILITY METHODS.
	// ------------------------------------------------------------------------------- //

	/**
	* I escape any embedded special characters that may be interpreted as markdown.
	*/
	public string function escapeMarkdown( required string content ) {

		return( content.reReplace( "([\\`*_{}##()\[\]])", "\\1", "all" ) );

	}


	/**
	* I return a code-fence that will be sufficiently long for any content that contains
	* embedded code-fence syntax.
	*/
	public string function getCodeFence( required string content ) {

		var backticks = content.reMatch( "`+" );

		// Sort backtick matches by length (longest first).
		backticks.sort(
			( a, b ) => {

				return( b.len() - a.len() );

			}
		);

		// If there are no backticks in the code; or, if the longest backticks match  is
		// less than three, return three, which is the standard code fence delimiter.
		if ( ! backticks.len() || ( backticks[ 1 ].len() < 3 ) ) {

			return( "```" );

		}

		// Double whatever the longest embedded set of backticks is.
		return( backticks[ 1 ].repeatString( 2 ) );

	}


	/**
	* I "unwrap" the text from the given node, concatenating all of the text within the
	* set of nested elements.
	*/
	public string function getNodeText( required xml node ) {

		return( node.search( "normalize-space( string( . ) )" ) );

	}


	/**
	* When the native htmlParse() function runs, it includes XML name-spaces which make
	* it much harder to search the subsequent document using XPath. This method strips
	* those XML name-spaces from the parsed document, allowing XPath to target node-names
	* more directly.
	*/
	public xml function htmlParseNoNamespaces( required string htmlMarkup ) {

		// To strip out the name-spaces, we're going to use XSLT (XML Transforms). The
		// following XSLT document will traverse the parsed HTML document and copy nodes
		// over to a new output string using only the node names.
		// --
		// Read More:  https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_74/rzasp/rzaspxml4369.htm
		var removeNamespacesXSLT = trim('
			<xsl:stylesheet
				version="1.0"
				xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

				<xsl:output
					method="xml"
					version="1.0"
					encoding="UTF-8"
					indent="yes"
				/>

			 	<!-- Keep comment nodes. -->
			 	<xsl:template match="comment()">
					<xsl:copy>
						<xsl:apply-templates />
					</xsl:copy>
				</xsl:template>

				<!-- Keep element nodes. -->
				<xsl:template match="*">
					<!-- Remove element prefix. -->
					<xsl:element name="{ local-name() }">
						<!-- Process attributes. -->
						<xsl:for-each select="@*">
							<!-- Remove attribute prefix. -->
							<xsl:attribute name="{ local-name() }">
								<xsl:value-of select="." />
							</xsl:attribute>
						</xsl:for-each>

						<!-- Copy child nodes. -->
						<xsl:apply-templates />
					</xsl:element>
				</xsl:template>

			</xsl:stylesheet>
		');

		// In order to remove the name-spaces, we have to parse the document twice -
		// once to parse the HTML into an XML document. Then, once again to parse the
		// transformed XML string (less the name-spaces) back into an actual XML
		// document that we can search using XPath.
		return( xmlParse( htmlParse( htmlMarkup ).transform( removeNamespacesXSLT ) ) );

	}


	/**
	* I determine if the given node is a newline text node.
	*/
	public boolean function isNewlineNode( required xml node ) {

		return( ( node.getNodeType() == "TEXT_NODE" ) && ( node.xmlText == newline ) );

	}


	/**
	* I remove any trailing BR and subsequent white-space.
	*/
	public array function removeTrailingBR( required array nodes ) {

		var length = nodes.len();

		if (
			( length > 2 ) &&
			( nodes[ length ].getNodeType() == "TEXT_NODE" ) &&
			( nodes[ length ].xmlText.trim() == "" ) &&
			( nodes[ length - 1 ].getNodeType() == "ELEMENT_NODE" ) &&
			( nodes[ length - 1 ].xmlName == "br" )
			) {

			return( nodes.slice( 1, ( length - 2 ) ) );

		} else {

			return( nodes );

		}

	}


	/**
	* I generate the string representation of the given node.
	*/
	public string function serializeXmlNode( required xml node ) {

		var serializedNode = toString( node ).trim();

		// Removes: <?xml version="1.0" encoding="UTF-8"?>
		return( listRest( serializedNode, ">" ) );

	}


	/**
	* I unescape HTML that is embedded within in a code-block.
	*/
	public string function unescapeCode( required string escapedCode ) {

		var code =  escapedCode
			.replace( "&lt;", "<", "all" )
			.replace( "&gt;", ">", "all" )
			.replace( """, """", "all" )
		;

		return( code );

	}

</cfscript>

Whoa, that's a lot of code :D But, remember, I wrote this very iteratively as each new artifact raised an exception. I ran this in chunks for 500 using LIMIT and OFFSET in my CFQuery tag. And, when it was done, I ended up with a directory full of .md files:

A Markdown file generated using htmlParse(), XPath, and Lucee CFML.

Converting HTML to Markdown is a messy process. Especially when the source HTML contains artifacts that have evolved over a long period of time. Thankfully, Lucee CFML has some pretty powerful tooling like htmlParse() and XPath that makes this possible. Again, this algorithm isn't generic or perfect. But, it gets me 95% of the way there. I'm just thrilled to be inching towards a complete Markdown authoring solution.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/3788

Reader Comments

Ben Nadel Mar 18, 2020 at 7:41 AM

16,115 Comments

@All,

After posting this, I stumbled upon something that had not been obvious to me before. Escaped HTML entities are no longer escaped when I read them out through the .xmlText property of the resultant XML document:

www.bennadel.com/blog/3789-reading-xmltext-values-from-the-xml-document-produced-by-htmlparse-in-lucee-cfml-5-3-4-80.htm

.... I had to get around this by using the toString() function on the given TEXT_NODEand then stripping off the XML DOCTYPE.

Ben Nadel Mar 19, 2020 at 5:21 AM

16,115 Comments

@All,

After posting this, and discovering the issue with .xmlText (see comment above), I went back and updated my getNodeText() function to be this:

/**
* I "unwrap" the text from the given node, concatenating all of the text within the
* set of nested elements.
*/
public string function getNodeText( required xml node ) {

	if ( node.getNodeType() == "TEXT_NODE" ) {

		// The only way to get the raw, original text with embedded escaped entities
		// is to stringify the text-node and then strip off the XML DOCTYPE that gets
		// prepended to the result.
		return( toString( node ).listRest( ">" ) );

	}

	var buffer = node.xmlNodes.map(
		( childNode ) => {

			return( getNodeText( childNode ) );

		}
	);

	return( buffer.toList( "" ) );

}

... then, I went into the code and changed all (most) of the node.xmlText references to be getNodeText( node ) instead. I've already seen a few instances that have been fixed by this.

Ben Nadel Mar 19, 2020 at 6:02 AM

16,115 Comments

@All,

I also had to update my unesacpe-code method to include ampersands:

/**
* I unescape HTML that is embedded within in a code-block.
*/
public string function unescapeCode( required string escapedCode ) {

	var code =  escapedCode
		.replace( "&lt;", "<", "all" )
		.replace( "&gt;", ">", "all" )
		.replace( "&amp;", "&", "all" ) // ... added this.
		.replace( """, """", "all" )
	;

	return( code );

}

Ben Nadel Mar 22, 2020 at 8:12 AM

16,115 Comments

@All,

This morning, I found another issue with my sanitization approach. It seems that all my iframe elements were being rendered as self-closing tags. Example:

<iframe />

While this is valid XML, it is not valid HTML. And, the browser simply stops rendering the page when it hits this markup (I assume because it thinks that the rest of the content is a child of the iframe tag).

To get around, I need to force the iframe tags in the HTML content to have at least one child node. And, the easiest way to do that is to insert an empty comment before parsing:

www.bennadel.com/blog/3790-avoiding-self-closing-iframe-tags-using-htmlparse-in-lucee-cfml-5-3-4-80.htm

This turns content like this:

<p>
	<iframe src="..."></iframe>
</p>

Into content like this:

<p>
	<iframe src="..."><!-- --></iframe>
</p>

... such that the resultant XML document returned from the htmlParse() function has one child-node (COMMENT_NODE) within it.

Ben Nadel Feb 8, 2022 at 1:21 PM

16,115 Comments

Two years later, I'm now starting to play with jSoup on my blog to some content clean-up. In retrospect, this whole conversion to Markdown would likely have been easier with jSoup since it is intended to be an HTML DOM - unlike parseHtml() in Lucee, which is working with an XML DOM.

www.bennadel.com/blog/4201-using-jsoup-to-clean-up-and-normalize-html-in-coldfusion-2021.htm

Anyway, many ways to get it done! Learning as I go.

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.