Using An AtomicLong With A Modulo Operation May Be Faster Than Using compareAndSet() For Thread-Safe Range-Based Counters

By Ben Nadel

Published 2018-03-27 in ColdFusion

Last week, I looked at using Java's AtomicInteger class in order to loop over a range of numbers in a thread-safe manner without having to add any additional locking to the code. The core of the logic revolved around executing a while(true) loop that repeatedly invoked a .compareAndSet() method until it returned successfully. After I posted that exploration, fellow InVisioneer, Ben Darfler, suggested that I test this approach against the modulus operator. I had originally thought about using the modulus operator; but, opted-out for fear of running out of INT-values. To this, Darfler reminded me that I could use an AtomicLong for an increased number space. So, let's have at it!

Rather than trying to re-use the AtomicRange.cfc component that I made the other week, I wanted to boil this speed test down to just the necessary ingredients. As such, these tests will be entirely self-contained within a single ColdFusion script page. Each test will attempt to generate a large set of numbers over a static range. And, will output the time (in milliseconds) that it took for the test to run. Since the general concept has already been proven, I'm not going to validate the number generation - only the speed with which the numbers get generated.

And, of course, all the usual caveats about speed tests apply - this is being run on my local development environment while I was doing other stuff. So, the goal here is to see general magnitudes - not to calculate exact comparisons.

That said, let's look at the while-loop approach that I used in last week's blog post:

<cfscript>

	counter = createObject( "java", "java.util.concurrent.atomic.AtomicInteger" ).init();

	// Define the parallel testing threads.
	testThreadCount = 11;
	testIncrementCount = 21733579;
	testRange = 33;
	totalCount = ( testThreadCount * testIncrementCount );

	// Start recording the duration.
	startedAt = getTickCount();

	for ( i = 0 ; i < testThreadCount ; i++ ) {

		thread
			name = "parallel-counter-#i#"
			action = "run"
			{

			for ( var r = 0 ; r < testIncrementCount ; r++ ) {

				do {

					var currentValue = counter.get();
					var nextValue = ( currentValue + 1 );

					// Ensure that the counter never goes outside the desired range.
					if ( nextValue >= testRange ) {

						nextValue = 0;

					}

					// Note: This value may not actually be the final value.
					thread.lastValue = currentValue;

				// Since we are explicitly setting the value of the counter, there is
				// going to be a race-condition. As such, we have to keep looping until
				// the next value can be set successfully, based on the current value.
				} while ( ! counter.compareAndSet( javaCast( "int", currentValue ), javaCast( "int", nextValue ) ) );

			}

		}

	}

	// Block / wait for the asynchronous threads to return.
	thread action = "join";

	// Calculate the duration of the test (ie, how long it took for all of the numbers
	// to be generated using the WHILE-True approach).
	duration = ( getTickCount() - startedAt );

	writeOutput( "#numberFormat( totalCount )# values generated in #numberFormat( duration )#ms. <br />" );

	// Output the last value generated in each thread (just as a sanity check).
	for ( threadName in cfthread ) {

		writeOutput( "Last value: #cfthread[ threadName ].lastValue# <br />" );

	}

</cfscript>

As you can see, the core logic of the test performs a .compareAndSet() operation from within a do-while loop. If I run this three times in a row, we get the following concatenated output:

239,069,369 values generated in 563,014ms.
Last value: 31
Last value: 24
Last value: 26
Last value: 22
Last value: 18
Last value: 2
Last value: 1
Last value: 10
Last value: 3
Last value: 25
Last value: 23

239,069,369 values generated in 573,330ms.
Last value: 19
Last value: 10
Last value: 1
Last value: 28
Last value: 23
Last value: 32
Last value: 18
Last value: 1
Last value: 25
Last value: 21
Last value: 29

239,069,369 values generated in 565,516ms.
Last value: 13
Last value: 14
Last value: 10
Last value: 30
Last value: 29
Last value: 24
Last value: 10
Last value: 0
Last value: 21
Last value: 12
Last value: 22

For 239-million values, over a range of [0,33) (ie, 33 not inclusive), the test took around 567-seconds.

Now, let's look at the test that uses an ever-increasing AtomicLong in conjunction with the modulus operator in order to iterate over the same range:

<cfscript>

	// Since we're going to be using the modulo operation in this test, I want to use
	// a LONG rather than an INT in order to make sure that I am unlikely to run out
	// of numbers in a production scenario.
	counter = createObject( "java", "java.util.concurrent.atomic.AtomicLong" ).init();

	// The module operator can only work on an INT in ColdFusion. As such, we'll need
	// to use the Long class in order to perform the modulo operation on our counter.
	LongClass = createObject( "java", "java.lang.Long" );

	// Define the parallel testing threads.
	testThreadCount = 11;
	testIncrementCount = 21733579;
	testRange = 33;
	totalCount = ( testThreadCount * testIncrementCount );

	// Start recording the duration.
	startedAt = getTickCount();

	for ( i = 0 ; i < testThreadCount ; i++ ) {

		thread
			name = "parallel-counter-#i#"
			action = "run"
			{

			for ( var r = 0 ; r < testIncrementCount ; r++ ) {

				var index = counter.getAndIncrement();

				// Since the counter, in this demo, is always increasing, we need to
				// calculate the modulus in order to overlay the counter on the range
				// of acceptable values.
				var remainder = LongClass.remainderUnsigned(
					javaCast( "long", index ),
					javaCast( "long", testRange )
				);

				thread.lastValue = remainder;

			}

		}

	}

	// Block / wait for the asynchronous threads to return.
	thread action = "join";

	// Calculate the duration of the test (ie, how long it took for all of the numbers
	// to be generated using the MODULO approach).
	duration = ( getTickCount() - startedAt );

	writeOutput( "#numberFormat( totalCount )# values generated in #numberFormat( duration )#ms. <br />" );

	// Output the last value generated in each thread (just as a sanity check).
	for ( threadName in cfthread ) {

		writeOutput( "Last value: #cfthread[ threadName ].lastValue# <br />" );

	}

</cfscript>

As you can see, this approach ditches in internal do-while loop, using only the AtomicLong's .getAndIncrement() method. The result of the ever-increasing AtomicLong counter is then wedged into the same [0,33) range using a modulus operation - the Long class' .remainderUnsigned() method. If I run this three times in a row, we get the following concatenated output:

239,069,369 values generated in 248,951ms.
Last value: 20
Last value: 30
Last value: 31
Last value: 10
Last value: 22
Last value: 17
Last value: 28
Last value: 8
Last value: 6
Last value: 10
Last value: 14

239,069,369 values generated in 217,446ms.
Last value: 9
Last value: 24
Last value: 2
Last value: 16
Last value: 10
Last value: 17
Last value: 27
Last value: 5
Last value: 4
Last value: 32
Last value: 9

239,069,369 values generated in 223,999ms.
Last value: 5
Last value: 2
Last value: 16
Last value: 10
Last value: 22
Last value: 31
Last value: 3
Last value: 23
Last value: 30
Last value: 28
Last value: 26

For 239-million values, over the same range, the test took around 230-seconds. Wow - that's kind of shocking! Even if this is just a fuzzy comparison, the do-while approach took twice as long as the modulo approach. And, I believe the modulo-based code is objectively more simple and easy to understand.

As someone who is basically naive to the workings of computers (despite the fact that I program on them daily), seeing division operate with such speed is very surprising. But, I am sure that those who are familiar with how computers actually work could probably explain that non-decimal math is blazing fast in modern computer architectures because of "science" and "math" reasons. And, to that, I say, awesome!

Normally, I don't care too much about speed comparisons. For the most part, I like to write code that's easy to understand and is "fast enough." However, in this case, I think the modulo code is both easier to understand and, at the same time, appears to be about twice as fast so the equivalent do-while loop. As such, I can't see much of a downside to using the modulo approach to iterate over a range of numbers in a thread-safe manner. And, the AtomicLong should provide a large enough number space to get you to the next deployment (at which time the numbers will all get reset).

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/3427

Reader Comments

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.