Skip to main content
Ben Nadel at CFCamp 2023 (Freising, Germany) with: Zac Spitzer
Ben Nadel at CFCamp 2023 (Freising, Germany) with: Zac Spitzer

Maintaining White Space Using jSoup And ColdFusion

By
Published in Comments (2)

jSoup is a Java library for parsing and manipulating HTML strings. For the last few years, I've been using jSoup to clean-up and normalize my blog posts. And now, I'm looking to use jSoup to help me transform and cache GitHub Gists. At the time of this writing, Gist code is rendered in an HTML <table> with cells that use white-space: pre as the means of controlling white space output. jSoup doesn't parse the CSS; so, it does understand that it needs to maintain this white space when serializing the document back into HTML. If we want to keep this white space in the resultant document, we have to disable pretty printing.

ASIDE: jSoup will naturally maintain white space that is contained within a <pre> tag. However, that doesn't apply to elements using white-space: pre CSS properties.

The pretty print settings control how white space is handled within the .html() and .text() methods. These methods can be used to access parts of the jSoup Document Object Model (DOM); and, are used internally during the serialization process.

The pretty print settings are defined at the Document level and can be accessed at:

document.outputSettings()

This object provides a getter / setter for the pretty printing:

outputSettings.prettyPrint( [ boolean ] )

In order to disable pretty printing and maintain the original white space, we have to invoke this method with (false) before we serialize our document. To see this in action, I'm going to parse a Paragraph tag that contains leading and trailing white space. Then, I'll serialize the resultant document: once with pretty printing and then once after pretty printing has been disabled:

<cfscript>

	// Note that our inner content is surrounded by leading / trailing spaces.
	input = "<p>     Some content with spaces     </p>";

	document = javaNew( "org.jsoup.Jsoup" )
		.parseBodyFragment( input )
	;

	// Let's update the document content (to demonstrate that we have reason to parse and
	// then re-serialize the content).
	document.selectFirst( "p" )
		.attr( "data-edited", "true" )
	;

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	// By default, pretty printing is enabled within the document. This means, when we go
	// to serialize the document as HTML, it will normalize all the text. Which means,
	// any "unnecessary" leading / trailing spaces will be trimmed.
	writeOutput( "<h2> Pretty Print Enabled </h2>" );
	renderDocumentAsPre( document );

	// When we disable pretty printing, jSoup will leave all the text nodes AS IS, even if
	// they aren't strictly necessary.
	document.outputSettings()
		.prettyPrint( false )
	;

	writeOutput( "<h2> Pretty Print Disabled </h2>" );
	renderDocumentAsPre( document );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I render the given jSoup document as an escaped markup within PRE tags.
	*/
	public string function renderDocumentAsPre( required any document ) {

		writeOutput(
			"<pre>" &
				encodeForHtml( document.body().html() ) &
			"</pre>"
		);

	}


	/**
	* I create a new Java class wrapper using the jSoup JAR files.
	*/
	public any function javaNew( required string className ) {

		var jarPaths = [
			expandPath( "./jsoup-1.16.1.jar" )
		];

		return( createObject( "java", className, jarPaths ) );

	}

</cfscript>

Essentially, this ColdFusion code is taking the jSoup DOM and calling .html() on it in order to serialize the DOM back into an HTML string. It's doing this twice, once before and once after the pretty printing has been disabled. And, when we run this ColdFusion code, we get the following output:

As you can see, the first serialization of the jSoup DOM resulted in stripped-out white space. However, after we disabled pretty printing, the second serialization of the jSoup DOM leaves our white space in tact.

Want to use code from this post? Check out the license.

Reader Comments

Post A Comment — I'd Love To Hear From You!

Post a Comment

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel