Extending A Distributed Lock TTL Using CFThread, Redis, And Lucee CFML 5.3.7.47

By Ben Nadel

Published 2021-03-13 in ColdFusion, Redis — Comments (6)

The other day, after posting about the idea of dynamically extending a distributed lock timeout in Lucee CFML, Jan Sedivy - our lead engineer on the Freehand whiteboarding product at InVision - mentioned that he does something similar in Golang; only, instead of having the synchronized task explicitly update the TTL (Time To Live) on the Redis key, he has an asynchronous Goroutine that updates the Redis key behind the scenes. This sounded like a really clever approach. So, I wanted to see if I could achieve the same thing using CFThread, Redis, and Lucee CFML 5.3.7.47.

The overall goal here is to be able to take a chunk of code, synchronize its execution across a set of horizontally-scaled ColdFusion pods, and then - most importantly - try to fail gracefully in the event that a pod gets terminated by the platform while running synchronized code. Meaning, we want to create a "failure mode" whereby an established distributed lock doesn't remain open in an invalid state (for too long).

In my previous post, I attempted to achieve this kind of failure mode by keeping the TTL (Time to Live) on the underlying Redis key short; and then, having the synchronized algorithm explicitly push the TTL out into the future during its own execution:

Obtain distributed lock (with a short TTL).
Perform some synchronized work.
- Push out the TTL a small amount.
- Do some more work.
- Push out the TTL a small amount.
- Do some more work.
- Push out the TTL a small amount.
- Do some more work.
Release the distributed lock.

In today's post, we want to achieve the same result; only, we want to move the TTL management down into the distributed lock layer, leaving the synchronized code free of any lock management logic. To do this, I'm going to spawn an asynchronous CFThread tag that runs in parallel to the synchronized code and "touches" the TTL of the underlying Redis key every few seconds.

Before we dive into the lower-level implementation, let's look at what consuming this code might look like:

<cfscript>

	// Distributed locks can prevent different servers from stepping on each other's
	// processing workflows. However, in a horizontally-scaled system, pods can die at
	// any time (ex, the pod might crash, Kubernetes might need to suddenly schedule the
	// pod on a different node, or Amazon AWS might revoke a spot-instance). As such, an
	// "open lock" may be orphaned unexpectedly, leaving the lock OPEN for an
	// unnecessarily long period of time. For long-lived locks, this can pose a problem
	// because it leaves the system in an unmanaged state. To cope with this, we can
	// create a lock with shorter TTL and then use a BACKGROUND THREAD to start pushing
	// that TTL out into the future. This way, if the process dies unexpectedly, the
	// underlying Redis key will be expunged shortly thereafter.
	synchronizeAcrossNodes(
		"synchronized-processing-lock",
		() => {

			// Simulate some "work" inside the distributed lock.
			sleep( 20 * 1000 );

			echo( "Woot! I can haz success! Lock will be released!" );

		}
	);

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I automatically wrap a distributed lock around the given operator. Any value
	* returned from the operator is passed back to calling context.
	* 
	* @lockName I am the name of the distributed lock.
	* @lockOperator I am the operator to synchronize.
	*/
	public any function synchronizeAcrossNodes(
		required string lockName,
		required function lockOperator
		) {

		// CAUTION: Throws an error if lock cannot be obtained.
		var distributedLock = new DistributedLock( application.redisPool, lockName )
			.get()
		;

		try {

			return( lockOperator() );

		} finally {

			distributedLock.release();

		}

	}

</cfscript>

As you can see, the operator being passed to the synchronizeAcrossNodes() function has no internal reference to the distributed lock. It just runs, performs its synchronized task duties, and then exits - all the lock management is being pushed down into the DistributeLock.cfc ColdFusion component.

And, if we run this ColdFusion code and look at the TTL of the underlying Redis key, here's what we see in the terminal:

The TTL of a Redis key is being changed dynamically over time.

As you can see, in the 20-seconds that our synchronized code is executing, the TTL on the underlying Redis key is being periodically reset to about 60-seconds. This way, as the ColdFusion code continues to execute, the distributed lock will be held open. And, if the ColdFusion code (or underlying Docker container) were to suddenly crash, the TTL would stop getting updated and the distributed lock would naturally "expire" in less than a minute.

To achieve this behind-the-scenes update of the TTL, I'm using a CFThread tag that is tied to a stored ThreadID. So, when it comes time to release the distributed lock, not only do we delete the underlying Redis key, we also clear the ThreadID, allowing the asynchronous CFThread tag to self-terminate.

I wanted to make sure the asynchronous thread self-terminates, as opposed to calling something like threadTerminate(), because I am worried that killing the thread may end-up leaving one of the Redis connections in an invalid state. To be honest, I don't know much about low-level Java threading; and, as much as possible, I want to avoid "forcing" any thread terminations that may have unintended side-effects.

In the following DistributedLock.cfc ColdFusion component, most of the relevant logic is right in the get() method - this is where the distributed lock is obtained and the CFThread tag is spawned:

component
	output = false
	hint = "I manage the getting and releasing of a distributed lock with a globally-unique name."
	{

	/**
	* I manage a distributed lock with the given name.
	* 
	* @redisPool I am the Redis gateway in which distributed locks are stored.
	* @name I am the name of the distributed lock.
	*/
	public void function init(
		required any redisPool,
		required string name
		) {

		// NOTE ON REDIS POOL USAGE: Since this lock may be wrapped around a long-running
		// process, we don't want to hold a Redis Resource open for the entire duration
		// of the lock as this may end-up exhausting the Redis pool (if the application
		// node has several open locks running concurrently). As such, we're going to get
		// a new Redis Resource from the pool every time we need to interact with the
		// shared key representation.
		variables.redisPool = arguments.redisPool;
		variables.name = arguments.name;

		// Instead of creating a distributed lock with an arbitrarily large TTL (time to
		// live), we going to set a small TTL and then spawn a background thread that
		// incrementally pushes-out the TTL on the current lock key.
		variables.ttlThreadID = "";
		variables.ttlDeltaInSeconds = 60;

	}

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I obtain the distributed lock; or, throw an error if the lock could not be
	* obtained.
	*/
	public any function get() {

		// Try to obtain the lock.
		// --
		// NOTE: In a "production" setting, we might have some sort of exponential back-
		// off that waits for the lock from some "timeout" period. However, in this demo,
		// to keep things simple, I'm either getting the lock OR FAILING fast.
		if ( ! redisSetNxEx( name, "DistributedLock", ttlDeltaInSeconds ) ) {

			throw(
				type = "LockFailure",
				message = "Failed to obtain distributed lock",
				detail = "Distributed lock [#name#] was already obtained by a competing process."
			);

		}

		// At this point, we've obtained the distributed lock with a small TTL. Now, we
		// have to spawn a background thread that will periodically push-out the TTL on
		// the key so that we can hold the lock open.
		ttlThreadID = "DistributedLockThread::#createUniqueId()#";

		// CAUTION: The CFThread tag is limited by the top-level request-timeout. If you
		// expect to hold the lock open for longer than the current request, you either
		// need to jack-up the request-timeout setting; or, you need to start invoking
		// this thread recursively to get around the timeout. For the sake of the demo,
		// I'm just keeping things as simple as possible.
		thread
			name = ttlThreadID
			thisThreadID = ttlThreadID
			action = "run"
			{

			// When the distributed lock is released, the ttlThreadID value will be
			// cleared. As such, this background thread will naturally "self terminate".
			while ( thisThreadID == ttlThreadID ) {

				// NOTE: Since we'll be attempting to update the lock several times
				// within the duration of the TTL, we can handle the occasional error
				// (such as the Redis pool being exhausted). As such, let's just log any
				// errors and then try again shortly.
				try {

					if ( ! redisExpireAt( name, getFutureExpireAt() ) ) {

						return;

					}

				} catch ( any error ) {

					logBackgroundError( error );

				}

				sleep( 5000 );

			}

		} // END: Thread.

		return( this );

	}


	/**
	* I release the current distributed lock.
	*/
	public void function release() {

		// By clearing the thread ID, the background task will naturally self-terminate
		// after it wakes up. I want to avoid calling threadTerminate() explicitly,
		// because I'm worried that it might leave a Redis connection in a strange state.
		// I'd rather give the CFThread tag time to clean-up its own resources, and then
		// quietly exit-out.
		ttlThreadID = "";
		redisDel( name );

	}

	// ---
	// PRIVATE METHODS.
	// ---

	/**
	* I get the next future expireAt value to be set by the background TTL thread.
	*/
	private numeric function getFutureExpireAt() {

		return( fix( getTickCount() / 1000 ) + ttlDeltaInSeconds );

	}


	/**
	* I log an error that was thrown within an asynchronous process.
	* 
	* @error I am the error being logged.
	*/
	private void function logBackgroundError( required struct error ) {

		systemOutput( error, true, true );

	}


	/**
	* I delete the given key in Redis. Returns true if a key was deleted or false if the
	* key didn't exist.
	* 
	* @key I am the key being deleted.
	*/
	private boolean function redisDel( required string key ) {

		var delResult = redisPool.withRedis(
			( redis ) => {

				return( redis.del( key ) );

			}
		);

		return( !! delResult );

	}


	/**
	* I update the TTL of the given Redis key using absolute epoch seconds. Returns true
	* if the expiration was set or false if the key did not exist.
	* 
	* @key I am the key being updated.
	* @expireAt I am the epoch seconds at which to expunge the key.
	*/
	private boolean function redisExpireAt(
		required string key,
		required numeric expireAt
		) {

		var expireAtResult = redisPool.withRedis(
			( redis ) => {

				return( redis.expireAt( key, expireAt ) );

			}
		);

		return( !! expireAtResult );

	}


	/**
	* I set the given Redis key with the given ttl; but, only if the key does not exist.
	* Returns true if the key was set or false if the key already existed.
	* 
	* @key I am the key being set.
	* @value I am the value being assigned to the key.
	* @ttlInSeconds I am the TTL in seconds to apply to the key.
	*/
	private boolean function redisSetNxEx(
		required string key,
		required string value,
		required numeric ttlInSeconds
		) {

		var setResult = redisPool.withRedis(
			( redis ) => {

				return( redis.set( key, value, "NX", "EX", ttlInSeconds ) );

			}
		);

		return( isNull( setResult ) ? false : true );

	}

}

As you can see, the body of the CFThread is a while() loop that runs while its own ID matches the ttlThreadID stored in the component. Then, when the distributed lock is released - and we clear the ttlThreadID - this while() condition no longer holds true and the CFThread peacefully exits.

With this configuration, most of the failure modes leave a Redis key in place for less than a minute. But, there are two edge-cases that I can think of that I haven't accounted for in the code:

If the asynchronous CFThread tag cannot connect to the Redis database for over a minute, the distributed lock key will expire even though the calling context thinks it still has a lock.
The execution of the CFThread tag is controlled by the overall request-timeout setting for the parent page. As such, if the lock is intended to be held-open for longer than the parent page, the request-timeout setting would need to be adjusted by the calling context.

Distributed locks are fairly complicated (and brittle) beasts; and, should probably be avoided if at all possible. Even with the steps that I've taken, there are still failure modes that can leave the system in an invalid state. That said, I do think this approach, suggested by Jan, is quite clever! And, can definitely simplify the way I implement Distributed Locks in Lucee CFML 5.3.7.47.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4006

Reader Comments

Paolo Olocco Mar 18, 2021 at 7:06 AM

11 Comments

Hi Ben,
is it possible to convert your Lucee code to ACF or how could I do it?

Ben Nadel Mar 19, 2021 at 11:41 AM

16,125 Comments

@Paolo,

I think most of this code should be Adobe ColdFusion compatible. I think the only thing you might have to change is the "fat arrow" syntax functions:

() => { ... }

to become:

function() { }

But, that might be all you have to change.

Paolo Olocco Mar 19, 2021 at 12:34 PM

11 Comments

systemOutput VARIABLE is undefined in ACF
but for the output in a thread?!

Ben Nadel Mar 19, 2021 at 1:02 PM

16,125 Comments

@Paolo,

Ahh, right. Perhaps you could replace the systemOutput() with a writeLog() and write it a log file instead. You could also dip down in the Java layer and grab the system output object - it's been a while since I've done it, but I think you can do something like:

createObject( "java", "java.lang.System" ).out.println( your_string )

Ben Nadel Mar 21, 2021 at 5:43 AM

16,125 Comments

@Paolo,

As a fun exercise, I took a go at porting the SystemOutput() function over to Adobe ColdFusion:

www.bennadel.com/blog/4012-porting-lucee-cfmls-systemoutput-function-over-to-adobe-coldfusion.htm

Paolo Olocco Mar 21, 2021 at 2:17 PM

11 Comments

Thanks so much Ben!

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.