Enforcing Unique Naming Constraints Using A SERIALIZABLE Transaction In Lucee CFML 5.3.7.47

By Ben Nadel

Published 2021-09-22 in ColdFusion, SQL — Comments (2)

The other day, I had to write a SQL script to clean up some "dirty data" in our database that violated a unique-naming constraint imposed by the business logic. When I looked at the ColdFusion code that created the dirty data, I noticed that it wasn't using a transaction tag. Now, as much as I love databases and writing SQL, I'm not super confident when it comes to certain aspects of transactions - namely, isolation levels. As such, I wanted to experiment a little with using transaction isolation in order to enforce unique naming constraints in MySQL 5.7.32 and Lucee CFML 5.3.7.47

Why Not Use a `UNIQUE KEY` to Enforce Unique Naming Constraints?

Before we look at transaction isolation, let's address the elephant in the room: if our application needs to enforce unique naming constraints, why not just apply a unique index to the column (or set of columns) in question? And to be fair, that's a totally legitimate solution - a UNIQUE KEY on the column(s) would enforce unique values in the data-table. In fact, at InVision, we have a UNIQUE KEY on the email that users create as part of their login credentials.

But, that approach isn't always possible - at least not with MySQL InnoDB. For example, what if the application's business logic only applies the uniqueness constraint to a subset of records? For example, imagine that the entities within the application can be archived using an isArchived boolean column; and, imagine that the uniqueness constraint only applies to the active records (ie, those where isArchived=0). In MySQL - as far as I know - there's no way to define an index that only affects conditional rows within a table.

Another reason that a UNIQUE KEY may not be desirable is simply when the only value of the index is the uniqueness constraint. Indexes have a cost both in terms of storage and write-performance. That cost is generally worthwhile because of the massive performance gains that a database index can provide during a table scan. But, if the "unique" column isn't being searched (ie, we're never looking up rows based on the "unique" value), then we're not leveraging the main benefit of the index. As such, we'd be incurring all of the cost associated with the index maintenance but reaping little of the benefit. In such cases, the overall cost of the uniqueness constraint may make more sense to be pushed up into the application layer.

ASIDE: The UNIQUE KEY on the email column that I mentioned above is a win-win because it both enforces uniqueness of the login credentials and is used to quickly locate a user's record by email. As such, the cost of the index is well worth the benefit.

One final reason that a UNIQUE KEY on a column may not be desirable is if the column is too wide for an index. To be honest, I don't understand this aspect of index very well at all; but, from what I can see in the MySQL documentation, some VARCHAR indexes can only contain up to 255 characters:

The index key prefix length limit is 767 bytes for InnoDB tables that use the REDUNDANT or COMPACT row format. For example, you might hit this limit with a column prefix index of more than 255 characters on a TEXT or VARCHAR column, assuming a utf8mb3 character set and the maximum of 3 bytes for each character.

So - at least in some cases - if the column in question is more than 255 characters, a UNIQUE KEY index may not be sufficient for enforcing unique values. Though, it does appear that there are ways around that; but, again, my understanding here is very limited.

Why Not Use a Distributed Lock to Enforce Unique Naming Constraints?

If we're going to rule-out a UNIQUE KEY index for the purposes of our discussion, the next option might be to use a distributed lock to synchronize access to the code that mutates the database records. This would totally work; and, in fact, is an approach that I've used in the past. But, distributed locks add complexity and they add another point-of-failure. So, while I used to lean on locks a lot more often, for the last few years I've been trying to emphasize idempotent workflows without distributed locking.

NOTE: A distributed lock is only needed when you're operating more than one application server. If you only have a single ColdFusion server operating on the database, a simple CFLock tag would be sufficient for synchronization.

Enforcing Unique Naming Constraints Using SERIALIZABLE Transactions

Now that we've addressed our elephants, let's consider unique naming constraints in the application layer within our business logic. To set the premise for this exploration, imagine that we have a widget table with the following schema:

CREATE TABLE `widget` (
	`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
	`userID` int(10) unsigned NOT NULL,
	`name` varchar(100) NOT NULL,
	`createdAt` datetime NOT NULL,
	`updatedAt` datetime NOT NULL,
	PRIMARY KEY (`id`),
	KEY `IX_byUser` (`userID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

In this table, the userID is the "owner" of the widget. And, in order to quickly locate all widgets owned by a particular user, we have an index on the userID column. The name column is the column in which we want to enforce unique values; however, we only want to enforce unique values under a given user. So, essentially, we want the (userID,name) tuple to be unique across the entire table.

To this end, when a user goes to create a new widget, we want our business logic to look something like this:

Check to see if there's an existing widget record with the given name (owned by the given user).
If so, throw an error.
If not, allow the new widget record to be created.

To help separate out the interesting aspects from the mundane aspects within this exploration, I've created a set of CRUD (Create, Read, Update, Delete) functions for interacting with the widget table:

<cfscript>

	/**
	* I create a new widget record for the given user.
	*/
	public numeric function createWidget(
		required numeric userID,
		required string name
		) {

		```
		<cfquery result="local.insertResult" datasource="testing">
			INSERT INTO
				widget
			SET
				userID = <cfqueryparam value="#userID#" sqltype="integer" />,
				name = <cfqueryparam value="#name#" sqltype="varchar" />,
				createdAt = UTC_TIMESTAMP(),
				updatedAt = UTC_TIMESTAMP()
		</cfquery>
		```

		return( insertResult.generatedKey );

	}


	/**
	* I get all the widgets for the given user.
	*/
	public query function getWidgetsByUserID( required numeric userID ) {

		```
		<cfquery name="local.results" datasource="testing">
			SELECT
				w.id,
				w.userID,
				w.name
			FROM
				widget w
			WHERE
				w.userID = <cfqueryparam value="#userID#" sqltype="integer" />
		</cfquery>
		```

		return( results );

	}


	/**
	* I determine if a widget with the given name already exists under the given user.
	*/
	public boolean function isWidgetWithNameExists(
		required numeric userID,
		required string name
		) {

		```
		<cfquery name="local.results" datasource="testing">
			SELECT
				1
			FROM
				widget w
			WHERE
				w.userID = <cfqueryparam value="#userID#" sqltype="integer" />
			AND
				w.name = <cfqueryparam value="#name#" sqltype="varchar" />
		</cfquery>
		```

		return( !! results.recordCount );

	}

</cfscript>

We'll use these CRUD methods to create our test. What we want to try and do is create a race-condition in which two different asynchronous threads attempt to create a new widget with a given name. This would serve to simulate the notorious "double submit" that can occur when a user double clicks on a <form>'s submit button (and there is no client-side code to catch-and-dismiss such an action).

In ColdFusion, this is quite easy to do with the CFThread tag. In the following test, we're going to generate a random userID and a unique name and then try to run the same exact insert workflow in two different asynchronous threads:

<cfscript>

	include "./db-functions.cfm";

	// We're going to have two CFThread tags race to insert this record.
	userID = randRange( 1111, 9999 );
	widgetName = "widget-#createUniqueId()#";

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	thread
		name = "competingThreadOne"
		userID = userID
		widgetName = widgetName
		{

		transaction isolation = "serializable" {

			// MySQL is going to LOCK the index range scanned during this SERIALIZABLE
			// transaction, which in our case - with an index on `userID` - is going to
			// be all the rows that are owned by the given user.
			if ( isWidgetWithNameExists( userID, widgetName ) ) {

				throw( type = "NameAlreadyExists" );

			}

			// Race condition - both competing threads MAY HAVE time to get to this
			// point, each thinking that the given widget does NOT EXIST yet.
			// --
			// NOTE: We can use a sleep() to make sure both competing threads get to
			// this point while both transactions are being held open.
			// sleep( 2000 );
			createWidget( userID, widgetName );

		}

	}

	thread
		name = "competingThreadTwo"
		userID = userID
		widgetName = widgetName
		{

		transaction isolation = "serializable" {

			// MySQL is going to LOCK the index range scanned during this SERIALIZABLE
			// transaction, which in our case - with an index on `userID` - is going to
			// be all the rows that are owned by the given user.
			if ( isWidgetWithNameExists( userID, widgetName ) ) {

				throw( type = "NameAlreadyExists" );

			}

			// Race condition - both competing threads MAY HAVE time to get to this
			// point, each thinking that the given widget does NOT EXIST yet.
			// --
			// NOTE: We can use a sleep() to make sure both competing threads get to
			// this point while both transactions are being held open.
			// sleep( 2000 );
			createWidget( userID, widgetName );

		}

	}

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	// Wait for both competing threads to re-join the parent thread.
	thread action = "join";

	dump(
		label = "Widgets by User",
		var = getWidgetsByUserID( userID ),
		metainfo = false
	);
	dump(
		label = "CFThreads",
		var = cfthread
	);

</cfscript>

Because both of the CFThread tag-bodies are executing in parallel, there's a good chance that both threads will run the isWidgetWithNameExists() method at roughly the same time. As such, they will both return false; and then both threads will try and proceed to the createWidget() call at the same time.

Without a parent transaction, this would lead to "dirty data" in which two, duplicate records would be created. However, in this case, I am wrapping the workflow in a CFTransaction tag with SERIALIZABLE isolation. According to the MySQL documentation, this means:

This level is like REPEATABLE READ, but InnoDB implicitly converts all plain SELECT statements to SELECT ... LOCK IN SHARE MODE if autocommit is disabled.... Using LOCK IN SHARE MODE sets a shared lock that permits other transactions to read the examined rows but not to update or delete them.... InnoDB locks the index range scanned, using gap locks or next-key locks to block insertions by other sessions into the gaps covered by the range.

This means that the SELECT statement within our isWidgetWithNameExists() method call ends up locking all the rows scanned by our index. And, since we have an index on userID, MySQL locks all the rows with the given userID value. So, even if both CFThread tags enter the CFTransaction tag at the same time, only one of them will obtain the row-level locks on the userID value. Which, in turn, means that the subsequent call to the createWidget() method will ultimately fail for one of the asynchronous threads.

Ultimately, this leads to a deadlock for one of the threads, resulting in the following SQL error:

Deadlock found when trying to get lock; try restarting transaction

If unhandled, this would result in a 500 Server Error for the user, which is a poor user experience (UX). But, we can always add client-side logic to prevent the double-submission in the first place; or, we can add server-side logic that retries the creation workflow (ie, restarts the transaction) and then handles the thrown NameAlreadyExists custom error more gracefully.

By using the SERIALIZABLE isolation level in our CFTransaction tag, we are able to prevent dirty data. But, this comes at a cost - remember, this isolation locks all the rows owned by a given userID value. Which means, if other asynchronous threads / requests are attempting to run UDPATE or DELETE statements on any one of those rows at the same time, they may also run into deadlocks. To cut down on these race-conditions, the logic within your CFTransaction tags should be as fast as possible and do as little as possible so that the transaction can be committed as soon as possible thereby freeing up the locks.

In 99% of cases, when I use the CFTransaction tag, I'm doing it to enforce all-or-nothing updates on the database. Meaning, I'm not using the transaction to enforce uniqueness constraints, I'm using it to create atomic mutations on the data. As such, my mental model for isolation level is somewhat lacking. Hopefully, I haven't said anything here that is blatantly wrong or misleading. If so, please let me know!

The Default Isolation Level in MySQL is `REPEATABLE_READ`

The way that I understand these isolation levels, if this demo used:

transaction isolation = "repeatable_read" { ... }

Or, just used the CFTransaction tag without an explicit isolation (allowing MySQL to use the default isolation level):

transaction { ... }

... then we could end up with dirty data. This is because the REPEATABLE_READ isolation level is not converting our SELECT statement into a SELECT ... LOCK IN SHARE MODE statement. As such, our userID row-scan in isWidgetWithNameExists() wouldn't end up locking rows, which could allow both INSERT statements in our createWidget() call to succeed.

But, again, my mental model for isolation levels in transactions is very poor. It's definitely something I need to work on.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4119

Reader Comments

Ben Nadel Sep 24, 2021 at 7:20 AM

16,255 Comments

@All,

After this post, I wanted to do a quick follow-up post to look specifically at the scope of the row-locking applied by the SERIALIZABLE transaction. In this post, I claimed that it was locked-down the userID - but, I wanted to actually demonstrate that this was true:

www.bennadel.com/blog/4122-exploring-the-scope-of-serializable-transaction-row-locking-in-lucee-cfml-5-3-7-47.htm

This follow-up updates multiples rows across multiple users in order to show that the execution times don't interact.

Ben Nadel Oct 17, 2021 at 6:29 AM

16,255 Comments

@All,

Another follow-up on the scope of row-locking - it seems that if a given slice of the index is empty - ie, there are now rows yet associated with a given userID (in our demo) - then we can end-up getting deadlocks:

www.bennadel.com/blog/4133-the-scope-of-serializable-transaction-row-locking-is-larger-when-rows-dont-yet-exist-in-mysql-5-7-32.htm

This appears to be related to the way in which MySQL has to lock something called supermum. I don't really understand what this means; but, I demonstrate the change in the above link.

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.

Why Not Use a UNIQUE KEY to Enforce Unique Naming Constraints?

Why Not Use a Distributed Lock to Enforce Unique Naming Constraints?

Enforcing Unique Naming Constraints Using SERIALIZABLE Transactions

The Default Isolation Level in MySQL is REPEATABLE_READ

Reader Comments

Why Not Use a `UNIQUE KEY` to Enforce Unique Naming Constraints?

The Default Isolation Level in MySQL is `REPEATABLE_READ`