Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
Ben Nadel at InVision In Real Life (IRL) 2019 (Phoenix, AZ) with: Jeremiah Lee
Ben Nadel at InVision In Real Life (IRL) 2019 (Phoenix, AZ) with: Jeremiah Lee@JeremiahLee )

Using Flexmark 0.32.24 To Parse Markdown Content Into HTML Output In ColdFusion

By Ben Nadel on
Tags: ColdFusion

Yesterday, I took a look at using the OWASP (Open Web Application Security Project) AntiSamy project in ColdFusion in order to help sanitize and validate untrusted, user-provided HTML content. I'm particularly interested in AntiSamy because I want to enable markdown in my blog comments. However, since markdown allows arbitrary embedded HTML code, I needed to make sure that I would have a way to prevent XSS (Cross-Site Scripting) attacks. And now that I know AntiSamy can protect me, I want to look at how to actually enable markdown. For this, I'll be using the Flexmark Java library, which implements the CommonMark specification.

View this code in my Flexmark 0.32.24 With ColdFusion project on GitHub.

Flexmark appears to be a very extensible Markdown implementation. Out of the box, the API surface area is very simple. However, you can start to add parsing and rendering extensions that enable additional markdown "flavors" and a variety of formatting add-ons. For example, I want to allow for plain-text URLs to be automagically converted into HTML Anchor tags. The Flexmark core library doesn't support this. But, it does provide an "Autolinking" extension that does. The extensions are all provided as additional JAR files, making the entire Flexmark project very modular.

To load the Flexmark JAR files in our ColdFusion application, I'm going to use Mark Mandel's JavaLoader project - the gift that just keeps on giving to the ColdFusion community. For the purposes of this demo, I'm going to be loading the core Flexmark files plus the extension (and its dependencies) for the Autolinking feature. I'll be creating and caching the JavaLoader instances during the bootstrapping of our ColdFusion application file:

  • component
  • output = false
  • hint = "I provide the application settings and event handlers."
  • {
  •  
  • // Define the application.
  • this.name = hash( getCurrentTemplatePath() );
  • this.applicationTimeout = createTimeSpan( 0, 0, 10, 0 );
  • this.sessionManagement = false;
  •  
  • // Setup the application mappings.
  • this.directory = getDirectoryFromPath( getCurrentTemplatePath() );
  • this.mappings[ "/" ] = this.directory;
  • this.mappings[ "/flexmark" ] = ( this.directory & "vendor/flexmark-0.32.24/" );
  • this.mappings[ "/javaloader" ] = ( this.directory & "vendor/javaloader-1.2/javaloader/" );
  • this.mappings[ "/javaloaderfactory" ] = ( this.directory & "vendor/javaloaderfactory/" );
  •  
  • // ---
  • // PUBLIC METHODS.
  • // ---
  •  
  • /**
  • * I initialize the application.
  • *
  • * @output false
  • */
  • public boolean function onApplicationStart() {
  •  
  • // In order to prevent memory leaks, we're going to use the JavaLoaderFactory to
  • // instantiate our JavaLoader. This will keep the instance cached in the Server
  • // scope so that it doesn't have to continually re-create it as we test our
  • // application configuration.
  • application.javaLoaderFactory = new javaloaderfactory.JavaLoaderFactory();
  •  
  • // Create a JavaLoader that can access the Flexmark 0.32.24 JAR files.
  • // --
  • // NOTE: This list of JAR files contains the CORE Flexmark functionality plus
  • // the Autolink extension. Flexmark is configured such that each extension is
  • // packaged as a separate, optional set of JAR files.
  • application.flexmarkJavaLoader = application.javaLoaderFactory.getJavaLoader([
  • expandPath( "/flexmark/autolink-0.6.0.jar" ),
  • expandPath( "/flexmark/flexmark-0.32.24.jar" ),
  • expandPath( "/flexmark/flexmark-ext-autolink-0.32.24.jar" ),
  • expandPath( "/flexmark/flexmark-formatter-0.32.24.jar" ),
  • expandPath( "/flexmark/flexmark-util-0.32.24.jar" )
  • ]);
  •  
  • // Indicate that the application has been initialized successfully.
  • return( true );
  •  
  • }
  •  
  • }

As you can see, I'm just pointing the JavaLoader to the list of relevant JAR files. I downloaded these JAR files from the Flexmark Maven project page. Again, the Flexmark project is very modular; so, I only downloaded the JAR files needed to implement the core functionality plus the autolinking.

Once we have our JavaLoader instance created and cached, parsing markdown is fairly straightforward. The process involves a Parser, which takes the markdown content and generates an Abstract Syntax Tree (AST). That AST is then passed to a Renderer, which flattens the node tree down into an HTML string. Both the Parser and the Renderer are built using an Options map, which is how we enable features like the Autolinking extension.

To see this in action, I'm taking some buffered markdown content, parsing it, and outputting to the page (both in active and encoded format):

  • <!---
  • Setup our markdown content.
  • --
  • NOTE: Indentation is meaningful in Markdown. As such, the fact that our content is
  • indented by one tab inside of the CFSaveContent buffer is problematic. But, it makes
  • the demo code easier to read. As such, we'll be stripping out the extra tab after the
  • content buffer has been defined.
  • --->
  • <cfsavecontent variable="markdown">
  •  
  • # About Me
  •  
  • My name is Ben. I really love to work in the world of web development. Being able to
  • turn thoughts into user experiences is just ... well, it's magical. You should check
  • out my blog: **www.bennadel.com**.
  •  
  • I like to write about the following things:
  •  
  • * Angular
  • * Angular 2+
  • * AngularJS
  • * NodeJS
  • * ColdFusion
  • * SQL
  • * ReactJS
  • * CSS
  •  
  • After people read my stuff, they can often be heard to say &mdash; and I quote:
  •  
  • > Ben who?
  •  
  • Probably, they are super impressed with my use of white-space in code:
  •  
  • ```js
  • function convert( thing ) {
  •  
  • var mapped = thing.forEach(
  • ( item ) => {
  •  
  • return( item.name );
  •  
  • }
  • );
  •  
  • return( mapped );
  •  
  • }
  • ```
  •  
  • _**NOTE**: I am using [Prism.js](https://prismjs.com/ "Prism is very cool!") to add
  • syntax highlighting._
  •  
  • ## The Hard Truth
  •  
  • Though, it's also possible that people just come for the pictures of my dog:
  •  
  • [![Lucy the Goose](./goose-duck.jpg)](./goose-duck.jpg "Click to download Goose Duck.")
  •  
  • <style type="text/css">
  • img { width: 250px ; }
  • </style>
  •  
  • How <span style="text-transform: uppercase ;">freakin' cute!</span> is that goose?!
  • I am such a lucky father!
  •  
  • </cfsavecontent>
  •  
  • <!--- ------------------------------------------------------------------------------ --->
  • <!--- ------------------------------------------------------------------------------ --->
  •  
  • <cfscript>
  •  
  • // As per comment above, we need to strip off one tab from each line-start.
  • markdown = reReplace( markdown, "(?m)^\t", "", "all" );
  •  
  • // Create some of our Class definitions. We need this in order to access some static
  • // methods and properties.
  • AutolinkExtensionClass = application.flexmarkJavaLoader.create( "com.vladsch.flexmark.ext.autolink.AutolinkExtension" );
  • HtmlRendererClass = application.flexmarkJavaLoader.create( "com.vladsch.flexmark.html.HtmlRenderer" );
  • ParserClass = application.flexmarkJavaLoader.create( "com.vladsch.flexmark.parser.Parser" );
  •  
  • // Create our options instance - this dataset is used to configure both the parser
  • // and the renderer.
  • options = application.flexmarkJavaLoader.create( "com.vladsch.flexmark.util.options.MutableDataSet" ).init();
  •  
  • // Define the extensions we're going to use. In this case, the only extension I want
  • // to add is the Autolink extension, which automatically turns URLs into Anchor tags.
  • // --
  • // NOTE: If you want to add more extensions, you will need to download more JAR files
  • // and add them to the JavaLoader class paths.
  • options.set(
  • ParserClass.EXTENSIONS,
  • [
  • AutolinkExtensionClass.create()
  • ]
  • );
  •  
  • // Configure the Autolink extension. By default, this extension will create anchor
  • // tags for both WEB addresses and MAIL addresses. But, no one uses the "mailto:"
  • // link anymore -- totes ghetto. As such, I am going to configure the Autolink
  • // extension to ignore any "link" that looks like an email. This should result in
  • // only WEB addresses getting linked.
  • options.set(
  • AutolinkExtensionClass.IGNORE_LINKS,
  • javaCast( "string", "[^@:]+@[^@]+" )
  • );
  •  
  • // Create our parser and renderer - both using the options.
  • // --
  • // NOTE: In the demo, I'm re-creating these on every page request. However, in
  • // production I would probably cache both of these inside of some Abstraction
  • // (such as MarkdownParser.cfc) which would, in turn, get cached inside the
  • // application scope.
  • parser = ParserClass.builder( options ).build();
  • renderer = HtmlRendererClass.builder( options ).build();
  •  
  • // Parse the markdown into an AST (Abstract Syntax Tree) document node.
  • document = parser.parse( javaCast( "string", markdown ) );
  •  
  • // Render the AST (Abstract Syntax Tree) document into an HTML string.
  • html = renderer.render( document );
  •  
  • </cfscript>
  •  
  • <!doctype html>
  • <html lang="en">
  • <head>
  • <meta charset="utf-8" />
  • <title>
  • Using Flexmark 0.32.24 To Parse Markdown Into HTML in ColdFusion
  • </title>
  • </head>
  • <body>
  •  
  • <h1>
  • Using Flexmark 0.32.24 To Parse Markdown Into HTML in ColdFusion
  • </h1>
  •  
  • <h2>
  • Rendered Output:
  • </h2>
  •  
  • <hr />
  •  
  • <cfoutput>#html#</cfoutput>
  •  
  • <hr />
  •  
  • <h2>
  • Rendered Markup:
  • </h2>
  •  
  • <pre class="language-html"
  • ><code class="language-html"
  • ><cfoutput>#encodeForHtml( html )#</cfoutput></code></pre>
  •  
  • <!-- For our fenced code-block syntax highlighting. -->
  • <link rel="stylesheet" type="text/css" href="./vendor/prism-1.14.0/prism.css" />
  • <script type="text/javascript" src="./vendor/prism-1.14.0/prism.js"></script>
  •  
  • </body>
  • </html>

In this demo, I'm re-creating the Flexmark Parser and Renderer on each page load since it makes the development life-cycle very easy. In a real application, however, I'd be caching these objects inside some sort of abstraction (like a MarkdownService.cfc), which would itself be cached in some persistent scope.

That said, there's not a whole lot going on here. In the options object, I'm telling Flexmark to use the Autolinking extension. And, I'm also configuring the Autolinking extension. By default, the Autolinking extension will match WEB addresses (ex, www.*) and MAILTO addresses (ex. ben@bennadel.com). The "mailto:" protocol is so ghetto though (which I say mainly because it requires a native mail app to be configured). So, I am providing a Regular Expression pattern that will get the Autolinking extension to ignore links that looks like email addresses.

Now, if we run this page in the browser, we get the following output:


 
 
 

 
 Using Flexmark 0.32.24 to parse markdown content into HTML output using ColdFusion. 
 
 
 

As you can see, our markdown content was easily converted into HTML and rendered to the page. In this case, I'm using Prism.js to format the fenced code-blocks on the client-side. But, the rest of this is pure markdown to HTML support.

Once again, just a reminder that markdown will allow any embedded code. As such, markdown is not secure on its own. The markdown content has to be coming from a tursted source. Or, the rendered HTML has to be run through a sanitization process, like AntiSamy.

Now, I'm one step closer to being able to enable markdown in my blog comments. And, hopefully this helps anyone else who is interested in processing markdown content in ColdFusion.



Reader Comments

@All,

One thing I realized last night is that I was losing all the soft-line breaks in the rendered output of the parsed content. That's because, by default, markdown doesn't care about line-breaks inside a single content block. However, given the fact that I am just closing a Haiku content, line-breaks are kind of important.

To get Flexdown to turn soft-breaks into hard-breaks, you have to set some options on the Renderer:

options.set(
	HtmlRendererClass.SOFT_BREAK,
	javaCast( "string", "<br />#chr( 10 )#" )
);

This will use the <br />\n string wherever it encounters a soft-break inside a contiguous content block.

Reply to this Comment

** This was very helpful, thanks. **

I needed to import data (from Harvest) into a Coldfusion app, and your code and libraries worked and parsed perfectly.

The JavaLoader library was helpful too and I can definitely make use of that for other java libraries we use in other apps. So double win!

Thanks for saving us time once again!

Reply to this Comment

@Paolo,

My absolute pleasure. The JavaLoader is awesome. I feel like the entire ColdFusion community collectively owes Mark Mandel a huge THANK YOU for that one.

One tip that I didn't understand at first when I was starting to use the JavaLoader, create a new one for each library / feature that you want to use in your ColdFusion app. My mistake early on was trying to put as many Java libraries into a single instance of JavaLoader as possible. But, this causes all kinds of version conflicts and odd dependencies over time.

If you create a new instance of JavaLoader for each feature, the features and their Java libraries can evolve independently.

So, to be clear, you would create an instance of JavaLoader for Flexmark. But, you might later create a different JavaLoader for PDF generation; and maybe a different one later for Image Processing. Don't try to use your Flexmark one for anything else other than Flexmark work :D

Reply to this Comment

Hi Ben. Would you advise stripping out any HTML tags:


<cfset content = REReplaceNoCase(content,"<[^>]*>","","ALL")>

From, let's say, a textarea [ie comments field] form submission before adding content to the DB. Any MarkDown should be preserved, allowing your MarkDown library to work its magic, once the content has been fetched from the DB and displayed to the page.

And when you say ANTISAMY, is this what:

encodeForHtml()

Is for?

Reply to this Comment

OK. I have just read your ANTISAMY post.

I guess my HTML tag stripping regex is just a harsher approach to what ANTISAMY is trying to achieve?

To be honest, if you are allowing MarkDown, I don't really see why you would allow any HTML tags? If this is the case, then ANTISAMY would not be required? I can just strip all the HTML tags, using simple regex.

Reply to this Comment

@Charles,

So, the encodeForHtml() is just a Function that escapes characters that could be meaningful in an HTML context. It is intended to protect against malicious user-provided content (or, more generally, to prevent embedded HTML from acting like HTML). This is different from stripping HTML, which will remove the tags from the content, rather than escaping it.

What you do all depends on where the content is coming from; and, how much you can trust that data. For example, I use Markdown to author my posts. And, I use it in the comments. For the post-authoring, all content is coming from me. So, I trust it implicitly. As such, I don't escape anything and I don't run it through Antisammy.

For comments, on the user hand, which can be added by anyone, I run it through Antisammy with very strict rules so that very few HTML tags are allowed. However, I don't want to strip HTML tags, since that would kill all the formatting that was applied by the Markdown itself.

It's really two different concerns. Markdown is for formatting. Then, Antisammy is for security.

Not sure I am answering your questions :D

Reply to this Comment

Hi Ben. Thanks for the explanation.

RE: user comments

But, surely, it only gets converted to HTML after its gone through the MarkDown parser.

I presume I am storing the MarkDown version of the content in the Database and not the HTML version of the content?

Actually, I am going to set up this feature this afternoon, so I will find out how it all works. And then I will be kicking myself for being so dumb;)

Reply to this Comment

@Charles,

Correct, the HTML is generated from the Markdown. But, I actually store both versions. In my comment table, I think I have something like this (trying to remember off the top of my head):

  • .content_markdown
  • .content_formatted

Where the content_markdown is the raw markdown. And, the content_formatted is the converted and sanitized HTML. When I output the comments in the ColdFusion template, I consider content_formatted to be a trusted value; so, I don't have to run it through encodeForHtml() or anything like that.

Of course, I can only consider it "trusted" because I run it through Anti-sammy first, before I save it.

Reply to this Comment

Ben. Here is my mental model of MarkDown:

  1. User creates a comment, using MarkDown
  2. User submits comment
  3. Comment with MarkDown goes into DB
  4. DB retrieves comment with MarkDown
  5. MarkDown comment is converted into HTML, using MarkDown parser
  6. HTML comment is displayed to user

Now, this is where I get stuck, because if the user edits that comment and resubmits it, then it now contains HTML, rather than MarkDown, which is bad, unless I can convert it back to MarkDown before it goes back into the database. And I have a feeling this is not possible.

I know that you don't allow users to edit their comments, so this is OK, because the MarkDown version would always remain in the database. But, I want to allow users to edit their comments.

Reply to this Comment

OK. I understand. The MarkDown converts to HTML, which means I cannot strip out the HTML, using regex. This makes sense!

I need to use ANTISAMY for any untrusted HTML that a user might add. But, I think this is only necessary, if a user is allowed to edit their comment.

Theoretically, if I don't allow a user to edit a comment, the comment would only be added to the DB, once. I could just store the MarkDown version and parse it each time into HTML to display to the user.

If I use the 'non edit' methodology, I would be able to strip out any HTML tags, before the comment is added to the DB. I would then be left with just MarkDown!

Reply to this Comment

Horaaay! I managed to get everything to work. I followed the above post to the letter and it worked first time.

I am going to prevent users from editing the comment, which means I can just store the MarkDown in the DB and strip the HTML tags, using regex. It will only ever be submitted to the DB once.

I may add the ANTISAMY stuff later, if I decide to allow users to edit their comments. If I do this, I will have to remove the HTML stripper.

Thanks so much for your help! Awesome...

Reply to this Comment

OK. Now I am with you.

I didn't realise you could use HTML tags in MarkDown.

Things like:

<style type="text/css">
     img { width: 250px ; }
</style>

And:

<span style="text-transform: uppercase ;">freakin' cute!</span>

And:

<img src="markdownmonstericon.png" alt="Markdown Monster icon" style="float: left; margin-right: 10px;" />

This changes everything. So, I cannot strip all HTML tags. Damn!

My apologies. You were right all along...

Reply to this Comment

Having said all this, I decided that I don't think it's necessary to allow users to add HTML to MarkDown [MD].

All of the examples above can be rewritten in pure MD, so I decided to strip out all the HTML tags before inserting the MD into the DB. Further more, I am removing any content between certain tags like:

<InvalidTag>
<form>
<table>

I have warned users not to use HTML tags. If they ignore this warning, I don't think the content within a TABLE or FORM will display in a meaningful way.

I then parse the MD, each time it is displayed.

Users cannot edit their comments, which I feel is acceptable and seems to be a fairly standard practice.

Now, their maybe a performance hit, each time the MD is parsed, but I have placed all the MD Java classes, including parser & renderer in the request scope. In MURA CMS 6.1, the request scope for some reason is a persistent scope, like the application scope and can only be refreshed, using the app reload key. So, effectively, I am caching all the Flexmark stuff!

All is working like a dream now...

Reply to this Comment

@Charles,

Very cool stuff! Yeah, it's the ability to include HTML tags directly in the Markdown that makes it dangerous - theoretically allowing for persisted XSS attacks. Hence, the anti-sammy stuff. However, if you can prevent HTML tags from being added, then you should be protected. I think. Security is such a crazy game of cat-and-mouse. That's why I am using Anti-Sammy - so I don't have to "think" about it.

That's awesome that you got everything working! It's fun to be so excited about something, right?! Rock on with your bad self!

Reply to this Comment

Yes. I shall definitely add ANTISAMY, at some point. I think I need to do a bit more research on the policy rules, first. It looks quite in depth!

The main thing is that I now understand the process, thanks to your explanation.

It's good to feel excited about programming stuff! Keeps us moving forward!

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
NEW: Some basic markdown formatting is now supported: bold, italic, blockquotes, lists, fenced code-blocks. Read more about markdown syntax »
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.