Comparing Java's MessageDigest To ColdFusion's hash() Function In Lucee CFML

By Ben Nadel

Published 2023-01-19 in ColdFusion — Comments (3)

Last week, I implemented a ColdFusion port of the CUID2 library. My version seems to work correctly; however, it has some performance problems when compared to the Java version. When I instrumented the ColdFusion component methods, nothing really jumped out at me. But, I have a hunch that I could make the SHA hashing more performant. Only, I don't have a great mental model for hashing. As such, I wanted to perform a small comparison of Java's MessageDigest class with ColdFusion's native hash() function for hashing a compound input.

As of ColdFusion 10, the hash() function can hash a binary value. And, before ColdFusion 10, we could still dip down into the Java layer to hash binary values with the MessageDigest class. However, I've historically only ever hashed a single value. And, with the CUID2 library, the hash is generated from a compound value that composes several sources of entropy. I'm curious to see if I can get better performance by generating the entropy as byte arrays, skipping any stringification of values, and hashing all the byte arrays together as a single composite value.

When using Java's MessageDigest class, hashing a compound input is seemingly straightforward since I can call .update(bytes) on the instance as many times as I want before completing the hash with a call to .digest(). But, as much as possible, I bias towards the native ColdFusion methods rather than dipping down into the Java layer. To that end, I want to see if passing concatenated byte arrays into the hash() function produces the same result as calling the .update() method several times on MessageDigest.

To test this, I'm going to create several binary values from different sources (text, secure-random, and image); and then, try to create a single SHA-256 hash from the compound input. The following ColdFusion code has three tests; but, the last two tests exercise the same hash() function - I'm simply building the byte[] (Byte Array) input differently.

<cfscript>

	// A collection of binary data read from different sources.
	parts = [
		// From string data.
		charsetDecode( "This is a string", "utf-8" ),
		// From secure entropy data.
		createObject( "java", "java.security.SecureRandom" )
			.init()
			.generateSeed( 100 )
		,
		// From image data.
		fileReadBinary( "./logo.png" )
	];

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	// TEST ONE: We're going to take the various binary values / byte arrays and hash them
	// all together using the MessageDigest. The nice thing about the MessageDigest class
	// is that you can call .update() multiple times to feed-in the inputs one at a time.
	messageDigest = createObject( "java", "java.security.MessageDigest" )
		.getInstance( "sha-256" )
	;

	for ( part in parts ) {

		messageDigest.update( part );

	}

	// The .digest() method completes the hashing algorithm and returns the bytes for the
	// hash calculation. Since the native hash() method returns hex, let's encode the
	// results as hex for comparison.
	hexEncoding = binaryEncode( messageDigest.digest(), "hex" );

	dump(
		var = hexEncoding,
		label ="Using MessageDigest"
	);

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	// TEST TWO: We're going to take the various binary values / byte arrays and hash them
	// all together using the ColdFusion native hash() function. Unlike MessageDigest,
	// there's no way to incrementally build the hash input. As such, we're going to have
	// to reduce the parts down into a single binary value by appending them all together.
	aggregatedBytes = parts.reduce(
		( reduction, part ) => {

			return( reduction.append( part, true ) );

		},
		[]
	);

	// And, once we have a single COLDFUSION ARRAY of bytes, we have to CAST it to a
	// BINARY value in order to get to get it work with the hash() function.
	hexEncoding = hash( javaCast( "byte[]", aggregatedBytes ), "sha-256" );

	dump(
		var = hexEncoding,
		label = "Using hash()"
	);

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	// TEST THREE: This is the same as TEST TWO, only I want to use a ByteBuffer to
	// aggregate the inputs instead of using javaCast() with the ColdFusion array. This
	// doesn't really add much value; but, I am just trying to fill my head with options.
	buffer = createObject( "java", "java.io.ByteArrayOutputStream")
		.init()
	;

	for ( part in parts ) {

		buffer.write( part );

	}

	hexEncoding = hash( buffer.toByteArray(), "sha-256" );

	dump(
		var = hexEncoding,
		label = "Using hash( byte buffer )"
	);

</cfscript>

As you can see, in the Java-oriented code, each binary part is passed to a separate .update() call. And, in the ColdFusion-oriented code, the binary parts are reduced down (ie, flattened) to a single ColdFusion array, which is then passed to the hash() function. And, when we run this ColdFusion code, we get the following output:

Three different SHA-256 hash values all showing the same string.

As you can see, all three SHA-256 hash generation approaches all resulted in the same hex-encoded output. As such, I think we can conclude that calling .update() multiple times on the MessageDigest instance is functionally equivalent to concatenating multiple binary values and passing the composite to ColdFusion's hash() function.

Now that I can, in theory, generate a hash from multiple binary values in ColdFusion, I might be able to update my CUID2 library to deal directly with byte arrays, removing unnecessary encoding. Of course, it remains to be seen as to whether or not that actually makes it any faster.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4393

Reader Comments

Brad Wood Jan 19, 2023 at 6:54 PM

45 Comments

I assume you know this, but Lucee's hash() BIF just uses the MessageDigest class under the hood.

https://github.com/lucee/Lucee/blob/5.3/core/src/main/java/lucee/runtime/functions/string/Hash.java#L90

Furthermore, there is no "stringification" of byte array inputs. All hashing accepts a byte array at the end of the day, so if you give Lucee a byte array, it uses it directly. If you give it a string, it calls input.getBytes() on it (which comes with no overhead).

Here's the relevant Lucee Java source code from the file above:

	if (input instanceof byte[]) data = (byte[]) input;
	else data = Caster.toString(input).getBytes(encoding);

The only issue I've had with hash() is it assumes you want the output (which again is always a byte array) hex encoded into a human-readable string. While hex encoding is most common, I've had places like S3's API which required digests/hashes base64 encode instead.

Ben Nadel Jan 19, 2023 at 8:10 PM

16,256 Comments

@Brad,

I love having the Lucee CFML code on GitHub - make it so easy to look things up. Plus, I know that if I use the "Go To File" feature in GitHub, each Function has it's .java file.

When talking about "stringifying", I mean in the code that I have in my CUID library. The algorithm calls for hashing several different data-points together, like:

hash( "#a##b##c#" )

And, if I can produce some of those variables a byte-arrays, then I wanted to see if I could just pass a concatenated byte-array into hash() without converting the parts to string and then concatenating them as string values.

Ben Nadel Jan 21, 2023 at 2:11 PM

16,256 Comments

@All,

As a follow-up exploration, I wanted to see if the order of the .update() calls on MessageDigest mattered:

www.bennadel.com/blog/4394-does-the-order-of-hash-inputs-matter-in-terms-of-uniqueness-and-distribution.htm

Since I don't know much about maths or security, I'm not looking at this from an engineering standpoint. Instead, I'm rendering hashes to a ColdFusion canvas and then making a visual judgement as to whether or not the order of the .update() appears to affect the hash characteristics.

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.