Skip to main content
Ben Nadel at NCDevCon 2016 (Raleigh, NC) with: Dan Skaggs
Ben Nadel at NCDevCon 2016 (Raleigh, NC) with: Dan Skaggs ( @dskaggs )

Decoding The EncodeForJavaScript() Output In ColdFusion

By on
Tags:

In ColdFusion, I often embed JSON payloads in a JavaScript context using the built-in encodeForJavaScript() function. This function makes sure to escape the given value such that a persisted cross-site scripting (XSS) attack cannot be perpetrated. On the JavaScript side, I then consume this encoded string value using JSON.parse()—note that JSON (JavaScript Object Notation) is the intermediary representation format.

<cfscript>

	// COLDFUSION context.
	data = {
		id: 1,
		name: "Kimmie Bo-Bimmie",
		contact: {
			type: "mobile",
			number: "212-555-1199"
		}
	};

</cfscript>
<script type="text/javascript">

	// ColdFusion data being embedded in JAVASCRIPT context.
	console.log(
		JSON.parse( "<cfoutput>#encodeForJavaScript( serializeJson( data ) )#</cfoutput>" )
	);

</script>

Heretofore, this has been a one-way data conversion. But, recently, I've been building an export feature at work; and, I've become curious to know if there's a way to parse the encodeForJavaScript() string back into a ColdFusion value.

The encodeForJavaScript() documentation doesn't provide much detail. And, the Lucee source code (ESAPI extension) seems to just hand off to the OWASP ESAPI encoder. But, the OWASP documentation doesn't seem to match the output. Based solely on trial-and-error, it seems that all non-alpha-numeric characters are encoded as hexadecimal using either 2-digit notation (\xHH) or 4-digit notation (\uHHHH).

And, after more trial-and-error, I was able to decode the encodeForJavaScript() output by creating a regular expression (RegEx) pattern that searches for both notations, extracts the HEX value, converts it to decimal value (ie, the code point), and then generates the character string for that code point.

When testing this, I wanted to make sure that I included code points beyond 65,536, which is the highest code point that ColdFusion's asc() function can handle. Beyond that, we move from the Basic Multilingual Plane (BMP) into the "supplemental characters" (such as emoji) that are a bit harder to consume in ColdFusion.

Because the pattern matching in this is a bit more nuanced, I'm dropping down into the Java layer to use the Pattern and Matcher classes. Normally, I would use my JRegEx project for this; but, to keep things simple, I'm just inlining the necessary functionality.

The RegEx pattern that I'm using is case-insensitive ((?i)) and OR's together both hexadecimal encodings (I'm adding spaces to make this more readable):

(?i) \\x([0-9a-f]{2}) | \\u([0-9a-f]{4})

Notice that we have two capture groups. The first capture group captures the 2-digit hexadecimal encoding and the second capture group captures the 4-digit hexadecimal encoding. As we loop over all the matches, we're going to convert this encoding to decimal using inputBaseN(16) before we convert it to a character.

Here's what I came up with. To build the test data, I'm looping from code point 0 up to code point 100,000 (well above the 65,536 asc() limit).

<cfscript>

	inputChars = [];

	// I'm going high enough up in the code-point value to make sure we move from the
	// Basic Multilingual Plane (BMP) range into the supplementary characters range (ie,
	// non-fixed width characters).
	for ( i = 0 ; i <= 100000 ; i++ ) {

		inputChars.append( chrFromCodePoint( i ) );

	}

	input = inputChars.toList( "" );
	encoded = encodeForJavaScript( input );
	decoded = decodeForJavaScript( encoded ); // My custom function for decoding.

	// Did we successfully decode the encoded JavaScript value.
	writeDump( input == decoded );

	// For debugging.
	// writeDump( input );
	// writeDump( encoded );
	// writeDump( decoded );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I decode the given JavaScript-encoded value.
	*/
	public string function decodeForJavaScript( required string input ) {

		// When encoding for JavaScript, each of the special characters appears to be
		// encoded using using hexadecimal format with either a 2-digit notation (\xHH) or
		// a 4-digit notation (\uHHHH). We can create a RegEx pattern that looks for both
		// encodings, capturing each in a different group.
		var decodedInput = jreReplaceAllQuoted(
			input,
			"(?i)\\x([0-9a-f]{2})|\\u([0-9a-f]{4})",
			( $0, $1, $2 ) => {

				var codePoint = $1.len()
					? $1 // Hex encoding.
					: $2 // Unicode encoding.
				;

				return chrFromCodePoint( inputBaseN( codePoint, 16 ) );

			}
		);

		return decodedInput;

	}


	/**
	* I replace all of the pattern matches in the given input with the result of the given
	* operator function. The replacements are quoted (ie, cannot consume back-references).
	*/
	public string function jreReplaceAllQuoted(
		required string input,
		required string pattern,
		required function operator
		) {

		var matcher = createObject( "java", "java.util.regex.Pattern" )
			.compile( pattern )
			.matcher( input )
		;
		var buffer = createObject( "java", "java.lang.StringBuffer" )
			.init()
		;

		while ( matcher.find() ) {

			var args = [ matcher.group() ];

			for ( var i = 1 ; i <= matcher.groupCount() ; i++ ) {

				// NOTE: If I try to combine the .group() call with the fallback (?:)
				// operator, it always results in an empty string. As such, I need to
				// break the reading of the value into its own line. I believe this is a
				// known bug in the Elvis operator implementation.
				var groupValue = matcher.group( javaCast( "int", i ) );
				args.append( groupValue ?: "" );

			}

			matcher.appendReplacement(
				buffer,
				matcher.quoteReplacement( operator( argumentCollection = args ) )
			);

		}

		matcher.appendTail( buffer );

		return buffer.toString();

	}


	/**
	* I return the String corresponding to the given codePoint. If the codePoint is
	* outside the Basic Multilingual Plane (BMP) range (ie, above 65535), then the
	* resultant string may contain multiple "characters".
	*/
	public string function chrFromCodePoint( required numeric codePoint ) {

		// The in-built chr() function can handle code-point values up to 65535 (these
		// are characters in the fixed-width 16-bit range, sometimes referred to as the
		// Basic Multilingual Plane (BMP) range). After 65535, we are dealing with
		// supplementary characters that require more than 16-bits. For that, we have to
		// drop down into the Java layer.
		if ( codePoint <= 65535 ) {

			return chr( codePoint );

		}

		// Since we are outside the Basic Multilingual Plane (BMP) range, the resulting
		// array should contain the surrogate pair (ie, multiple characters) required to
		// represent the supplementary Unicode value.
		var chars = createObject( "java", "java.lang.Character" )
			.toChars( val( codePoint ) )
		;

		return arrayToList( chars, "" );

	}

</cfscript>

Ultimately, the test consists of me calling encodeForJavaScript(); passing the coded value to my custom function, decodeForJavaScript(); and then seeing if the inputs and the outputs match. And, they match!.

Normally, I wouldn't use the encodeForJavaScript() output as a storage format - I could just use JSON directly. However, in my case, I'm wondering if I could parse this value out of a programmatically-generated JavaScript file. But, that's a topic for another post.

Want to use code from this post? Check out the license.

Reader Comments

24 Comments

This is great code to share. And a good lesson about > 65535 characters. (Didn't even know that was a thing!)

I also like how you only dip to Java if necessary. Likely, most "business data" probably won't contain the emojis (I guess depending on the audience), so a little less overhead there most of the time.

208 Comments

Daaaamn, Gina! This is some fancy code-dancing you're doing to decodeForJavascript. I like it! And I'm impressed!

15,688 Comments

@Will,

The moment you have a commenting system in a business app, people start using emoji 😜 that's just life now, it seems. But, in this particular case, it would only matter when trying to parse the data - as long as we're UTF-8 encoding all the things, emojis should "just work" seamlessly.

The hardest part for us (at work) emoji-wise is that we have a lot of old MySQL stables that sort of pre-date emoji usage. So, they all use utf8, which doesn't actually support emoji. We have to slowly migrate things over to utf8mb4; and, that's only if Product lets us do it (politics get in the way of so many things).

Eh, I'm way off topic now. Programming is fun.

15,688 Comments

@Chris,

Ha ha, I had to google for the meme. I guess that's an old Martin Lawrence reference from the 90s. Kicking it old-school today! 🤣

Post A Comment — I'd Love To Hear From You!

Post a Comment

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel