Using Checksums To Calculate Consistent User Bucket Assignment In ColdFusion

By Ben Nadel on April 4, 2016

Yesterday, I took a look at generating CRC-32 and Adler-32 checksums in ColdFusion. I've never used this kind of checksum before. But, I had been reading through the popular Ruby library - Rollout - when I came across a very interesting technique. The Rollout library was using the checksum of a user identifier in order to assign that user to a "percentage" bucket in a feature flag rollout. As someone who recently started using LaunchDarkly to experiment with feature flags in ColdFusion, this was a technique that I wanted to explore a little more closely.

When I talk about "buckets," I'm simply referring to a consistent subset of users within an application. When I talk about a given bucket, really, I'm talking about all the users that have been assigned to that bucket. Now, if you think about an application that has 100 buckets; and, the users within that application have been equally distributed into those 100 buckets; then, each of those 100 buckets should contain roughly 1% of the user base.

In a feature flag system, this mapping of percentages onto buckets becomes helpful when thinking about a phased rollout. With each bucket accounting for roughly 1% of the user base, rolling a feature out to one bucket would roll it out to 1% of the users. And, rolling a feature out to two buckets would roll it out to 2% of the users. And, so on, up to 100 buckets or, 100% of the user population.

If your application is already assigning users to buckets, then rolling a feature out to a set of buckets is easy. But, if your application doesn't have the concept of user buckets, then what we need is a way to take a user identifier (whether it be a UUID, an auto-incrementing ID, an email address, an IP address, or any other type of consistent identifier) and consistently calculate a bucket number. And, we need to be able to do this in a way that evenly distributes user identifiers across the set of all possible buckets.

As it turns out, a checksum is a consistent and well distributed way of translating user identifiers into buckets. Of course, the selection of the checksum algorithm will affect the distribution of buckets. But, the Ruby Rollout library uses CRC-32, so that's what I'll use in my demonstration.

First, let's take a look at how we can use the modulus operator to translate any user identifier into a bucket number. The checksum interface in Java returns Long values, which don't work very well with the modulus operator in ColdFusion. So, we'll defer to the BigInteger class to calculate the large-number remainder for us.

<cfscript>

	// The identifier can be any simple value as long as you can consistently provide
	// the same value as a means to identify the given user / client.
	identifiers = [
		1, // user ID.
		14, // user ID.
		10934, // user ID.
		83937531, // user ID.
		"ben@bennadel.com", // user Email.
		"ben+sarah@bennadel.com", // user Email.
		"ben+apr-5-2016@bennadel.com", // user Email.
		"127.0.0.1", // remote IP address.
		"192.168.1.1", // remote IP address.
		"108.98.22.100" // remote IP address.
	];

	// Output the bucket for each type of identifier.
	for ( id in identifiers ) {

		writeOutput( id & " ... " & getBucket( id ) & "<br />" );

	}


	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //


	/**
	* I return a bucket, 1-100, to which the given identifier is assigned. The bucket
	* value is consistently calculated by passing the identifier through the CRC32
	* checksum algorithm. The identifier can be any string.
	*
	* TODO: You could update this method to accept the bucket count as an argument.
	* Since it is only being used as a modulus operand internally, you could easily
	* refactor this to accept a dynamic bucket count.
	*
	* @identifier I am the identifier being assigned to a bucket.
	* @output false
	*/
	public numeric function getBucket( required string identifier ) {

		// The checksum algorithm interface returns a LONG value, which we cannot use
		// with the normal modulus operator. As such, we have to fallback to using the
		// BigInteger to perform the modulus operation.
		var BigInteger = createObject( "java", "java.math.BigInteger" );

		// Generate our BigInteger operands.
		var checksum = BigInteger.valueOf( javaCast( "long", getChecksum( identifier ) ) );
		var bucketCount = BigInteger.valueOf( javaCast( "int", 100 ) );

		return( checksum.mod( bucketCount ) + 1 );

	}


	/**
	* I compute the checksum for the given string, returning a Long value.
	*
	* @input I am the input being checked.
	* @output false
	*/
	public numeric function getChecksum( required string input ) {

		var checksum = createObject( "java", "java.util.zip.CRC32" ).init();

		checksum.update( charsetDecode( input, "utf-8" ) );

		return( checksum.getValue() );

	}

</cfscript>

As you can see, the user identifier can be any simple value, numeric or otherwise. We then use the CRC-32 checksum and the modulus operator to calculate the bucket. And, when we run the above code, we get the following output:

1 ... 84
14 ... 33
10934 ... 85
83937531 ... 69
ben@bennadel.com ... 25
ben+sarah@bennadel.com ... 32
ben+apr-5-2016@bennadel.com ... 46
127.0.0.1 ... 33
192.168.1.1 ... 64
108.98.22.100 ... 24

As you can see, each of the user identifiers was assigned to a bucket between 1 and 100. But, are these numbers evenly distributed? Meaning, with this approach, will one bucket truly account for roughly 1% of the user base? With a small sample set, it's hard to tell. So, I've created a text file with 100,000 unique IP addresses for more thorough testing. Let's read this file in, assign each IP address to a bucket, and then see how well the values are distributed:

<cfscript>

	// We're going to read in 100,000 sampled IP addresses and see what kind of
	// distribution we can get for those identifiers. Each index in this array will act
	// as a unique bucket of identifiers.
	buckets = [];

	// As we loop over the IP addresses, we're going to tally which bucket they fall
	// into. As such, let's start the count at zero for all buckets.
	arraySet( buckets, 1, 100, 0 );

	// As we iterate over the file (that contains the IP addresses), we want to keep
	// track of the file line count so that we can calculate a percentage of values that
	// get placed into each bucket.
	lineCount = fileEachLine(
		expandPath( "./ips.txt" ),
		function( required string ipAddress ) {

			buckets[ getBucket( ipAddress ) ] += 1;

		}
	);

	writeOutput( "Checking #numberFormat( lineCount )# IP addresses." );
	writeOutput( "<br /><br />" );

	// Output the percentage of IP addresses that were placed in each bucket.
	for ( i = 1 ; i <= arrayLen( buckets ) ; i++ ) {

		percentage = ( buckets[ i ] / lineCount * 100 );

		writeOutput( "#i# ... " & numberFormat( percentage, "0.00" ) );
		writeOutput( "<br />" );

	}


	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //


	/**
	* I iterate over each line in the given file and invoke the given callback. Returns
	* the number of lines that were processed.
	*
	* @filePath I am the filepath to open for read-access.
	* @callback I am the operator to invoke on each line.
	* @output false
	*/
	public numeric function fileEachLine(
		required string filePath,
		required function callback
		) {

		var lineCount = 0;
		var file = fileOpen( filePath, "read" );

		try {

			while ( ! fileIsEoF( file ) ) {

				callback( trim( fileReadLine( file ) ) );
				lineCount++;

			}

		} finally {

			fileClose( file );

		}

		return( lineCount );

	}


	/**
	* I return a bucket, 1-100, to which the given identifier is assigned. The bucket
	* value is consistently calculated by passing the identifier through the CRC32
	* checksum algorithm. The identifier can be any string.
	*
	* @identifier I am the identifier being assigned to a bucket.
	* @output false
	*/
	public numeric function getBucket( required string identifier ) {

		// The checksum algorithm interface returns a LONG value, which we cannot use
		// with the normal modulus operator. As such, we have to fallback to using the
		// BigInteger to perform the modulus operation.
		var BigInteger = createObject( "java", "java.math.BigInteger" );

		// Generate our BigInteger operands.
		var checksum = BigInteger.valueOf( javaCast( "long", getChecksum( identifier ) ) );
		var bucketCount = BigInteger.valueOf( javaCast( "int", 100 ) );

		return( checksum.mod( bucketCount ) + 1 );

	}


	/**
	* I compute the checksum for the given string, returning a Long value.
	*
	* @input I am the input being checked.
	* @output false
	*/
	public numeric function getChecksum( required string input ) {

		var checksum = createObject( "java", "java.util.zip.CRC32" ).init();

		checksum.update( charsetDecode( input, "utf-8" ) );

		return( checksum.getValue() );

	}

</cfscript>

Here, we're simply reading in each IP address, calculating the target bucket by way of the CRC-32 checksum, and then incrementing the counter for each bucket. And, when we run this code, we get the following output:

Checking 100,000 IP addresses.

1 ... 0.98
2 ... 1.00
3 ... 0.97
4 ... 0.98
5 ... 1.02
6 ... 1.05
7 ... 0.99
8 ... 0.98
9 ... 1.03
10 ... 1.00
11 ... 0.96
12 ... 0.98
13 ... 0.98
14 ... 1.02
15 ... 1.02
16 ... 1.04
17 ... 0.96
18 ... 1.02
19 ... 1.05
20 ... 1.02
21 ... 1.03
22 ... 0.97
23 ... 1.03
24 ... 1.08
25 ... 1.05
26 ... 1.01
27 ... 1.01
28 ... 1.01
29 ... 1.04
30 ... 1.00
31 ... 0.95
32 ... 1.02
33 ... 1.00
34 ... 1.01
35 ... 0.98
36 ... 0.98
37 ... 1.03
38 ... 0.99
39 ... 1.02
40 ... 0.96
41 ... 1.03
42 ... 1.02
43 ... 1.01
44 ... 1.03
45 ... 0.98
46 ... 1.01
47 ... 0.97
48 ... 1.07
49 ... 1.03
50 ... 0.98
51 ... 1.09
52 ... 1.00
53 ... 0.98
54 ... 1.04
55 ... 1.01
56 ... 0.98
57 ... 1.06
58 ... 1.00
59 ... 1.06
60 ... 0.94
61 ... 1.01
62 ... 0.99
63 ... 0.99
64 ... 0.96
65 ... 0.97
66 ... 0.99
67 ... 1.02
68 ... 1.00
69 ... 0.98
70 ... 0.95
71 ... 1.02
72 ... 0.96
73 ... 0.97
74 ... 0.98
75 ... 1.00
76 ... 1.00
77 ... 1.01
78 ... 0.99
79 ... 0.96
80 ... 1.02
81 ... 1.04
82 ... 0.98
83 ... 0.98
84 ... 0.98
85 ... 0.97
86 ... 1.00
87 ... 1.01
88 ... 1.01
89 ... 0.98
90 ... 0.97
91 ... 0.98
92 ... 0.96
93 ... 1.00
94 ... 1.02
95 ... 0.98
96 ... 1.02
97 ... 0.97
98 ... 1.01
99 ... 0.99
100 ... 0.96

As you can see, each of the 100 buckets truly accounts for roughly 1% of the unique values (IP addresses in this case). Some are a little more than 1% - some are a little less. But, overall, it's a sufficiently even distribution.

Again, if your application already assigns and persists a bucket for each user, then you can always use that to rollout features. But, if you don't have buckets - or, if your buckets don't align well with the desired granularity of your feature rollout - then, you can always use a checksum to consistently calculate a bucket of a desired size (in terms of user base percentage).

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/3066

Reader Comments

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.