Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
Ben Nadel at the jQuery Conference 2010 (Boston, MA) with: Yehuda Katz
Ben Nadel at the jQuery Conference 2010 (Boston, MA) with: Yehuda Katz@wycats )

Considering Index Design When Using A Nullable DateTime Column To Record A Scheduled Action

By Ben Nadel on
Tags: SQL

I had a lot of fun writing yesterday's post on creating a composite index using ancestral keys in a hierarchical database. Index design - and database schema design in general - is fascinating to me. I've been using relational databases for years and I'm still learning how to leverage them most effectively. In the spirit of that journey, I wanted to share another thought that I've had recently about index design, this time around the use of a nullable DateTime column used to record the execution of a scheduled action. To me, this is an interesting use-case because the order of the columns in the composite index is opposite of what it is in both the MySQL table and in my mental model.

To set the context for this exploration, I'm building an internal scheduling system at work. Essentially, I have a table that keeps track of notifications that have to be sent. The records in this table are created explicitly by an internal user action; and then, they are "processed" by an ongoing scheduled task.

The scheduling and notification algorithm is controlled by two columns:

  • scheduledAt - A non-nullable DateTime column that indicates when the given notification should be sent.

  • sentAt - A nullable DateTime column that indicates if and when the given notification was sent by the scheduled task.

When these records are inserted into the database, the scheduledAt column is defined as some future date and the sentAt column is set to NULL. The NULL value indicates that the notification is "pending"; and, it's up to the scheduled task to locate these records and send them at the appropriate time.

When I first went to author the scheduled task query that locates the "overdue" notifications, my WHERE column looked something like this:

SELECT
	*
FROM
	demo_notification n
WHERE
	n.scheduledAt <= '2019-07-10' -- Indicates "overdue".
AND
	n.sentAt IS NULL -- Indicates "pending" state.

This WHERE column matches the voice in my head: "Find the notifications that are overdue and have not yet been processed". And, at first, I created an index that aligned with this mindset:

KEY `IX_byDate` ( `scheduledAt`, `sentAt` )

The problem with this approach is that my query is using a range condition on the first column of the index and then an equality comparison on the second column of the index. At least in MySQL, this doesn't work. MySQL can only consume a composite index prefix up to and including a range condition. It then ignores everything after the range (in terms of index consumption). So, in this case, the sentAt condition is being checked by Using Where only after the rows are identified by the index range condition:

MySQL will only consume a composite indiex upto and including a range condition; it will then perform subsequent filtering using WHERE.

In this case - probably because I have so little data, and because the range isn't very limiting - MySQL is bypassing the index altogether and just performing a full table scan.

In order to create a better index for my scheduled notifications algorithm, we have to reverse the way we think about gathering the data. Instead of locating the "overdue" notifications first, we have to focus on locating "unprocessed" records first. The new voice in my head then becomes: "Find the notifications that haven't been processed and then limit that to the ones that are overdue".

With this new mindset, the order of the columns in the composite index is flipped, with sentAt being first:

KEY `IX_byDate` ( `sentAt`, `scheduledAt` )

And, we can also flip the conditions in the SQL query as well. Though, to be clear, this is just a personal preference that defines "intent" - to line the SQL conditions up with the way in which the MySQL query optimizer will consume them. Technically, the query optimizer will do this for you behind the scenes:

SELECT
	*
FROM
	demo_notification n
WHERE
	n.sentAt IS NULL -- Indicates "pending" state.
AND
	n.scheduledAt <= '2019-07-10' -- Indicates "overdue".

As you can see, in this version of the query, the equality condition is being performed on the first column of the index, sentAt, while the range condition is being performed on the second column of the index, scheduledAt. This allows MySQL to use the composite index to perform both comparisons:

MySQL can consume the entire composite index if the range condition is applied to the last column being consumed.

To me, this is a really fun SQL exploration because it forces you to think deeply about how the data in your application will be accessed; and then, to modify that thinking based on the constraints of the underlying database engine. In this case, I had to flip the order of the columns in my index because my data was being accessed using a range condition. This is sharp contrast to my normal ID-based conditions that can be structured in closer alignment with my mental model.

I just love SQL.



Reader Comments

Ben. I never knew this. Thanks for the heads up. I'd be interested to know why MySQL will not process other indexes after a RANGE column in a composite index? This seems a bit weird to me. Possibly a bug? I guess, if this is the case for MSSQL, Oracle etc, then it is probably normal behaviour.

Reply to this Comment

@Charles,

I'm not 100% on this, but I believe this is more-or-less the way other SQL engines use indexes as well. I did a little Googling and found this post -- https://www.vividcortex.com/blog/the-left-prefix-index-rule -- that generalizes it for most databases:

There's more to the left-prefix rule. The usable prefix of the index, in most databases and under most circumstances, is up to and including the first inequality or range search condition.

Suppose we have a four-column index on columns (a, b, c, d). If we search for records where a=X, and b=Y, and c>Q, and d=R, then the usable prefix is only the first three columns. That's because the database will be able to progressively narrow the search until it gets to the fourth column...

I think it has to do with how the index is structured internally. But, that's a lot more lower-level than I understand. I just sort of accept that it stops at the first range-condition :D

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
NEW: Some basic markdown formatting is now supported: bold, italic, blockquotes, lists, fenced code-blocks. Read more about markdown syntax »
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.