Skip to main content
Ben Nadel at Scotch On The Rock (SOTR) 2010 (London) with: Justin Carter
Ben Nadel at Scotch On The Rock (SOTR) 2010 (London) with: Justin Carter@justincarter )

Parsing Liquid Tag Embeds With jSoup And Lucee CFML 5.3.8.201

By on
Tags:

On a recent episode of Dev Discuss, Arit Amana talked about refactoring the way Liquid Tags are processed in the Forem platform. I had never heard of Liquid Tags before. Apparently, it's a syntax that some platforms use to enable dynamic content. One subset of this syntax allows users to embed external content within their own content. This piqued my curiosity since something like this might give me a way to allow readers of this blog to embed fun things within the comments. As such, I wanted to experiment with parsing and processing Liquid Tags in Lucee CFML 5.3.8.201.

To lay the groundwork for this exploration, consider the following HTML snippet, which I'm going to assume was generated from Markdown using Flexmark and ColdFusion:

<h1>
	Testing Liquid Tag Processing In ColdFusion
</h1>
<p>
	Hello there, this is a sample input with Liquid Tags for
	YouTube and Vimeo videos. Note that the URLs here are for
	the user-facing URLs, not the embed URLs.
</p>
<p>
	{% embed https://www.youtube.com/watch?v=S9z5aWAQQ34 %}
</p>
<p>
	{% embed https://vimeo.com/712352623 %}
</p>
<p>
	I love ColdFusion!
</p>

Notice that two of the paragraph elements (<p>) in this sample input contain text in the form of:

{% embed EXTERNAL_URL %}

This {% %} notation denotes the Liquid Tag template language. Now, I don't think that the embed directive is part of the original Liquid Tag specification - I think that's the thing that Arit was building specifically for Forem: instead of having separate directives for each type of external content, she was creating a simplified embed directive that allowed users to copy-paste whatever URL they were looking at.

In order to transpile the Liquid Tag embeds into the materialized embed markup for each type of external content, we're going to have to parse the HTML. And to parse HTML in ColdFusion, I've been loving jSoup. It provides a familiar, jQuery-inspired fluent DOM (Document Object Model) manipulation API. Using jSoup, we should be able to easily locate the Liquid Tag markup and replace it with alternate HTML. The challenging part will be the parsing and translating of the embed URLs.

To keep my CFScript and my HTML markup separated, I created two ColdFusion templates: one for the YouTube embed and one for the Vimeo embed. These templates will be included into a CFScript context as part of a Function invocation. As such, they are expecting to read from and write to the local scope. This approach is a little funky; but, since this is just an exploration, I didn't feel the need to make this more elegant.

Here's the YouTube CFML template:

<!---
	CAUTION: This template is expected to be called as part of a Function execution. As
	such, it is going to use the local (var) scope to parameterize its inputs and to store
	its output.
--->

<!--- Define the dynamic inputs of this template. --->
<cfparam name="local.videoID" type="string" />

<!--- Store the rendered output of this template. --->
<cfsavecontent variable="local.embedHtml">
	<cfoutput>

		<div class="liquid-tag-embed" style="max-width: 500px ; margin: 20px 0px ;">
			<div style="padding: 56.25% 0px 0px 0px ; position: relative ;">
				<iframe
					src="https://www.youtube.com/embed/#encodeForUrl( videoID )#"
					title="YouTube video player"
					width="560"
					height="315"
					frameborder="0"
					allowfullscreen
					style="position: absolute ; top: 0px ; left: 0px ; width: 100% ; height: 100% ;"
				></iframe>
			</div>
		</div>

	</cfoutput>
</cfsavecontent>

And, here's the Vimeo CFML template:

<!---
	CAUTION: This template is expected to be called as part of a Function execution. As
	such, it is going to use the local (var) scope to parameterize its inputs and to store
	its output.
--->

<!--- Define the dynamic inputs of this template. --->
<cfparam name="local.videoID" type="string" />

<!--- Store the rendered output of this template. --->
<cfsavecontent variable="local.embedHtml">
	<cfoutput>

		<div class="liquid-tag-embed" style="max-width: 500px ; margin: 20px 0px ;">
			<div style="padding: 56.25% 0px 0px 0px ; position: relative ;">
				<iframe
					src="https://player.vimeo.com/video/#encodeForUrl( videoID )#"
					title="Vimeo video player"
					frameborder="0"
					allow="fullscreen; picture-in-picture"
					allowfullscreen
					style="position: absolute ; top: 0px ; left: 0px ; width: 100% ; height: 100% ;"
				></iframe>
			</div>
		</div>

	</cfoutput>
</cfsavecontent>

Again, since these are intended to be pulled into a CFScript context (via the CFInclude tag), they are reading from the local scope (local.videoID) and writing to the local scope (local.embedHtml). This resultant embedHtml is the HTML markup that we will replace into our final, transpiled document.

The algorithm for this tranpilation process is going to following these steps:

  • Parse the HTML into a jSoup DOM.
  • Gather all paragraph (<p>) nodes.
  • Filter results down to those nodes that contain Liquid Tag templates.
  • For each Liquid Tag template:
    • Extract and parse the embed URL.
    • Map the URL domain to the proper CFML template.
    • Replace the CFML template output into the jSoup DOM.

Unfortunately, there's no native way to parse a URL in ColdFusion. The Java layer - below the ColdFusion layer - has the java.net.URL Class. But, even this falls short since it doesn't parse the Query String (which we'll need to do in order to extract aspects of the embed URLs). As such, I created a ColdFusion component, DynamicUrl.cfc, that helps with this. I'll share that component at the end; but, just know that this is how I'm parsing the URLs.

With that said, here's the Liquid Tag parsing and transpilation algorithm that I came up with for ColdFusion:

<cfscript>

	jSoup = javaNew( "org.jsoup.Jsoup" );

	// Parse the sample input into a jSoup DOM (Document Object Model). This will allow us
	// to iterate over the content and look for fragments that look like Liquid Tag embed
	// directives. Then, using the jSoup API, we'll be able to replace those directives
	// with materialized embed markup.
	htmlContent = fileRead( "./inputs/sample.htm", "utf-8" );
	htmlDom = jSoup.parseBodyFragment( htmlContent );

	// NOTE: I'm assuming that the sample HTML was generated from MARKDOWN. As such, I'm
	// expecting each Liquid Tag to be defined on its own markdown line which will, in
	// turn, lead to them being defined within their own stand-alone Paragraph elements.
	liquidContainers = htmlDom
		.getElementsByTag( "p" )
		// Filter down to the paragraph nodes that contain the "{% embed URL %}" pattern.
		.filter(
			( node ) => {

				return( isLiquidTagText( node.text() ) );

			}
		)
		// For each {embed URL} pattern, replace the paragraph with a materialized embed.
		// --
		// NOTE: Paragraphs with unsupported embed URLs will be just be removed.
		.each(
			( node ) => {

				var embedUrl = getLiquidTagUrl( node.text() );
				var embedContent = getLiquidTagEmbed( embedUrl );

				// Just because the paragraph contains a Liquid Tag format, it doesn't
				// mean that we support said format. Any embed directive that is not
				// supported by the platform will result in an empty response. The parent
				// element of this invalid embed will be removed.
				if ( ! embedContent.len() ) {

					node.remove();
					return;

				}

				// The jSoup API doesn't provide a way to replace a node with raw HTML. As
				// such, we have to parse the embed content into a jSoup node and then use
				// that to replace the Liquid Tag directive.
				var embedNode = jSoup
					.parseBodyFragment( embedContent )
					.body()
					.child( 0 )
				;

				node.replaceWith( embedNode );

			}
		)
	;

	// Render the sample input with the replaced-in Liquid Tag embeds.
	echo( htmlDom.toString() );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I test the given input text to see if it matches a Liquid Tag directive format.
	*/
	public boolean function isLiquidTagText( required string text ) {

		return( !! text.trim().reFindNoCase( "^\{% embed https?://\S+ %\}$" ) );

	}


	/**
	* I extract the embed URL from the given Liquid Tag directive.
	* 
	* CAUTION: This assumes that the text has already been vetted to match the Liquid Tag
	* input pattern. As such, this method does no validation.
	*/
	public string function getLiquidTagUrl( required string text ) {

		return( text.trim().reMatchNoCase( "https?://\S+" ).first() );

	}


	/**
	* I get the materialized embed content for the given URL. Returns an empty string if
	* the given URL is not recognized / supported.
	*/
	public string function getLiquidTagEmbed( required string embedUrl ) {

		var uri = new DynamicUrl( embedUrl );

		switch ( uri.getHost() ) {
			case "vimeo.com":
				return( getLiquidTagEmbedForVimeo( uri ) );
			break;
			case "www.youtube.com":
				return( getLiquidTagEmbedForYouTube( uri ) );
			break;
			default:
				return( "" );
			break;
		}

	}


	/**
	* I get the materialized embed content for the given Vimeo URL. Returns an empty
	* string if the given URL is not recognized / supported.
	*/
	public string function getLiquidTagEmbedForVimeo( required any uri ) {

		var embedHtml = "";

		if ( uri.getScriptName().reFind( "^/\d+\b" ) ) {

			var videoID = uri.getScriptName().listFirst( "/" );

			include "./embeds/vimeo.cfm";

		}

		return( embedHtml );

	}


	/**
	* I get the materialized embed content for the given YouTube URL. Returns an empty
	* string if the given URL is not recognized / supported.
	*/
	public string function getLiquidTagEmbedForYouTube( required any uri ) {

		var embedHtml = "";

		if (
			uri.getScriptName().reFindNoCase( "^/watch\b" ) &&
			uri.getUrlParam( "v" ).len()
			) {

			var videoID = uri.getUrlParam( "v" );

			include "./embeds/youtube.cfm";

		}

		return( embedHtml );

	}


	/**
	* I create a new Java class wrapper using the jSoup JAR files.
	*/
	public any function javaNew( required string className ) {

		var jarPaths = [
			expandPath( "./vendor/jsoup-1.15.1.jar" )
		];

		return( createObject( "java", className, jarPaths ) );

	}

</cfscript>

One of the really cool things about jSoup is that you're working with a full Document Object Model. Which means, even when we're iterating over and manipulating individual nodes, we're still affecting the parent document. This is why the following low-level calls still affect our overall result:

  • node.remove()
  • node.replaceWith( embedNode )

Which is why, when we run this ColdFusion code, we get the following browser output:

HTML generated with Liquid Tag embeds and ColdFusion.

As you can see, our two {% embed %} directives were successfully replaced with YouTube and Vimeo embed code!

Now, just because there is a way for users to embed external content within their content, it doesn't necessarily mean that I want to allow that feature on my blog. It's one more thing - one additional point of complexity - that I would have to maintain over time. I'm not entirely against it - I just need to think it through some more.

WORST CASE: The worst case fear here is that a malicious actor would post a comment with a link to something relevant to the article. Then, after the comment was approved, they would replace the embedded content with something hateful. And, since this replacement happens in content that I don't own, there's no way that I would even know it happened.

This definitely gives me a lot to think about. I'm always on the look out for ways to improve the user experience (UX) on this blog; and, allowing for some more dynamic commenting is always something I'm curious to explore. If I were to go this route, the Liquid Tag API seems like it would be a clean approach.

And finally, I had mentioned the DynamicUrl.cfc ColdFusion component that I was using to parse the embed URLs. This uses java.net.URL under the hood; but then goes the extra mile to parse the query string as well:

component
	output = false
	hint = "I provide methods for parsing, updating, and serializing a dynamic URL."
	{

	/**
	* I initialize the dynamic URL with the given path and query-string.
	*/
	public void function init( string input ) {

		variables.protocol = "";
		variables.host = "";
		variables.scriptName = "";
		variables.urlParams = [:];
		variables.fragment = "";

		parseUrl( input );

	}

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I add the given fragment to the dynamic URL.
	*/
	public any function addFragment( required string fragment ) {

		variables.fragment = arguments.fragment;
		return( this );

	}


	/**
	* I add the given URL parameter to the dynamic URL.
	*/
	public any function addUrlParam(
		required string name,
		required string value
		) {

		urlParams[ name ] = value;
		return( this );

	}


	/**
	* I add the given URL parameter to the dynamic URL if the parameter is not-empty.
	*/
	public any function addUrlParamIfPopulated(
		required string name,
		required string value
		) {

		if ( value.len() ) {

			urlParams[ name ] = value;

		}

		return( this );

	}


	/**
	* I add the given URL parameters struct to the dynamic URL.
	*/
	public any function addUrlParams( required struct newParams ) {

		urlParams.append( newParams );
		return( this );

	}


	/**
	* I remove the given URL parameter from the dynamic URL.
	*/
	public any function deleteUrlParam( required string name ) {

		urlParams.delete( name );
		return( this );

	}


	/**
	* I get the current host value.
	*/
	public string function getHost() {

		return( host );

	}


	/**
	* I get the current script name.
	*/
	public string function getScriptName() {

		return( scriptName );

	}


	/**
	* I get the URL param with the given name; or, fallback to the given default.
	*/
	public string function getUrlParam(
		required string name,
		string defaultValue = ""
		) {

		return( urlParams[ name ] ?: defaultValue );

	}


	/**
	* I parse the given URL string and use it construct the dynamic URL components.
	*/
	public any function parseUrl( required string newUrl ) {

		var uri = createObject( "java", "java.net.URL" )
			.init( newUrl )
		;

		variables.protocol = uri.getProtocol();
		variables.host = uri.getHost();
		variables.scriptName = uri.getPath();
		variables.urlParams = parseQueryString( uri.getQuery() ?: "" );
		variables.fragment = uri.getRef();

		return( this );

	}


	/**
	* I serialize the dynamic URL into a composite string.
	*/
	public string function toUrl() {

		var queryStringPairs = [];

		loop
			key = "local.key"
			value = "local.value"
			struct = urlParams
			{

			queryStringPairs.append( encodeForUrl( key ) & "=" & encodeForUrl( value ) );

		}

		var baseUrl = ( protocol & host & scriptName );

		if ( queryStringPairs.len() ) {

			baseUrl &= ( "?" & queryStringPairs.toList( "&" ) );

		}

		if ( fragment.len() ) {

			baseUrl &= ( "##" & fragment );

		}

		return( baseUrl );

	}

	// ---
	// PRIVATE METHODS.
	// ---

	/**
	* I parse the given query-string value into a collection of key-value pairs.
	*/
	private struct function parseQueryString( required string queryString ) {

		var params = [:];

		if ( ! queryString.len() ) {

			return( params );

		}

		for ( var pair in queryString.listToArray( "&" ) ) {

			if ( pair.find( "=" ) ) {

				var encodedKey = pair.listFirst( "=" );
				var encodedValue = pair.listRest( "=" );

			} else {

				var encodedKey = pair;
				var encodedValue = "";

			}

			var decodedKey = canonicalize( encodedKey, true, true );
			// CAUTION: We need to be more relaxed with the VALUE of the URL parameter
			// since it may contain nested encodings, especially if the parameter is
			// pointing to another full URL (that may also contain encoded values).
			var decodedValue = urlDecode( encodedValue );

			params[ decodedKey ] = decodedValue;

		}

		return( params );

	}

}

CFML is life!

Want to use code from this post? Check out the license.

Reader Comments

Post A Comment — I'd Love To Hear From You!

Oops!
NEW: Some basic markdown formatting is now supported: bold, italic, blockquotes, lists, fenced code-blocks. Read more about markdown syntax »
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.