Skip to main content
Ben Nadel at InVision In Real Life (IRL) 2019 (Phoenix, AZ) with: Lisa Tierney
Ben Nadel at InVision In Real Life (IRL) 2019 (Phoenix, AZ) with: Lisa Tierney

Generating A Table Of Contents Using jSoup And ColdFusion

By on
Tags:

I'm authoring my Feature Flags Book using Markdown. Then, I'm converting the Markdown into HTML using Flexmark and ColdFusion. And, once I have the raw HTML, I'm using jSoup to augment the DOM for output. As part of this, I'm dynamically injecting a Table of Contents (ToC). In the book, I'm only including the h2 headings; but, it got me thinking about how I might use jSoup and ColdFusion to create a more inclusive table of contents.

The primary issue here is the "Impedance Mismatch" between the structure of the HTML document and the structure of the Table of Content. The HTML structure is relatively flat (maybe even completely flat), wherein all of the heading elements can be siblings. We think of these headings as being hierarchical. But, this is a "mental model", not a structural model.

The table of contents, on the other hand, is (often) a hierarchical model, wherein "nested headers" are rendered as nested lists. As such, in order to dynamically render the TOC, we have to translate the implicit hierarchy of headers into an explicit hierarchy of data structures.

This is an algorithm that we intuitively understand; but, it's not the easiest to describe. Given a header (H), we need to walk up the pending Tree structure until we find a parent header (P) such that P<level> is semantically greater than H<level>. This means that we've located the direct parent of the given header; and, at that point, we can append the header (H) to the children of (P).

To explore this, I created a flat HTML file that has a series of header elements from H1 all the way down to H6 (content abbreviated for the blog):

<h1>My Groovy Manifesto (h1)</h1>
<h2>Chapter 1 (h2)</h2>
<h3>Subsection 1-1 (h3)</h3>
<h4>Subsection 1-1-1 (h4)</h4>
<h5>Subsection 1-1-1-1 (h5)</h5>
<h6>Subsection 1-1-1-1-1 (h6)</h6>
<h4>Subsection 1-1-2 (h4)</h4>
<h3>Subsection 1-2 (h3)</h3>
<h3>Subsection 1-3 (h3)</h3>
<h2>Chapter 2 (h2)</h2>
<h3>Subsection 2-1 (h3)</h3>
<h4>Subsection 2-1-1 (h4)</h4>
<h4>Subsection 2-1-2 (h4)</h4>
<h2>Chapter 3 (h2)</h2>
<h3>Subsection 3-1 (h3)</h3>

As you can see, all of the headers are siblings of each other - the "hierarchy" is semantic, not structural. Generating a structural table of contents in ColdFusion (Lucee CFML) looks like this:

<cfscript>

	document = javaNew( "org.jsoup.Jsoup" )
		.parseBodyFragment( fileRead( "./content.htm" ) )
	;

	// The heading nodes in the HTML content are hierarchical from a semantic standpoint,
	// but are all siblings from a structural standpoint. As such, we need to translate
	// that FLAT structure into a TREE structure for our table of contents. Each section /
	// heading is going to contain a level and a set of sub-sections (children).
	toc = [
		level: 0,
		children: []
	];

	// In order to generate a hierarchical structure, we need to keep track of the
	// "parent" heading. This way, we'll know when we encounter a child of the previous
	// heading; or, if we have to traverse back up the "parent chain" to find an
	// appropriate location in a different heading.
	parent = toc;

	// I determine how deep the table of contents should go. Not every single header
	// necessarily adds value to the ToC (from a user experience standpoint).
	maxLevelInToc = 5;

	for ( node in document.select( "h1, h2, h3, h4, h5, h6" ) ) {

		current = [
			level: val( node.tagName().right( 1 ) ),
			title: node.text(),
			children: [],
			// NOTE: By default, we're going to assume that the current heading node is a
			// subsection of the parent heading node. We'll validate this below.
			parent: parent
		];

		if ( current.level > maxLevelInToc ) {

			continue;

		}

		// The current/parent assumption above is ONLY CORRECT if the current level is
		// greater than the parent level (ex, h3 vs h2). However, if the current level is
		// smaller than or equal to the parent level, we have to travel up the TREE until
		// we find the appropriate parent (ex, if current node is h2 and parent is h2, we
		// have to travel up the parent-chain until we find the h1 that will contain the
		// current h2).
		while ( current.level <= current.parent.level ) {

			current.parent = current.parent.parent;

		}

		// Now that we've identified the correct parent/child relationship for our current
		// node, we can add it to the proper children collection and then track the
		// current node as the parent for subsequent headings. This will create a bi-
		// directional tree structure.
		current.parent.children.append( current );
		parent = current;

	} // END: For-loop.

	// At this point, we've aggregated all of our document headings. Render them as a
	// series of nested lists, starting with our root TOC container.
	renderSection( toc.children );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I render the given table-of-content (ToC) sections. This function calls itself
	* recursively while there are children to render.
	*/
	public void function renderSection( required array sections ) {

		if ( ! sections.len() ) {

			return;

		}

		```
		<cfoutput>
			<ul>
				<cfloop item="local.section" array="#sections#">
					<li>
						#encodeForHtml( section.title )#
						#renderSection( section.children )#
					</li>
				</cfloop>
			</ul>
		</cfoutput>
		```

	}


	/**
	* I create a new Java class wrapper using the jSoup JAR files.
	*/
	public any function javaNew( required string className ) {

		var jarPaths = [
			expandPath( "./jsoup-1.16.1.jar" )
		];

		return( createObject( "java", className, jarPaths ) );

	}

</cfscript>

As you can see, our Tree structure is bidirectional. As we iterate over the header elements, we build a connection from the parent heading and its subheadings as well as a connection from the subheading back to its parent. This bidirectionality allows us to walk back up the TOC structure when we need to find the appropriate semantic parent.

Once we have the nested data structure, we can then render it as a series of nested lists:

A table of contents rendered as a series of nested unordered lists.

jSoup is such a wonderful tool. I was rather slow to adopt it (it's been around for years). But, now that I have it as part of my ColdFusion tool-belt, I'm always finding more ways to leverage it.

Want to use code from this post? Check out the license.

Reader Comments

Post A Comment — I'd Love To Hear From You!

Post a Comment

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel