Skip to main content
Ben Nadel at CFinNC 2009 (Raleigh, North Carolina) with: Simon Free and Dan Wilson
Ben Nadel at CFinNC 2009 (Raleigh, North Carolina) with: Simon Free@simonfree ) and Dan Wilson@DanWilson )

Irrational Guilt Over Using ORDER BY id In My ColdFusion SQL Queries

By on
Tags: ,

Most of the relational database tables that I create in my ColdFusion applications have an AUTO_INCREMENT primary key, id, and some sort of createdAt date/time column. The nature of the primary key along with the nature of a timestamp means that both of these columns "increase" at the same time. Which means that - in almost all cases - I can use the id to sort the query results in chronological order. But, I have a lot of irrational guilt over doing this since the id column is not semantically the same as the createAt column. But, this guilt is stupid; and succumbing to it would actually lead to worse performance in my ColdFusion and MySQL applications.

To get a sense of what kind of database schema I'm talking about, consider the following CREATE TABLE statement:

CREATE TABLE `doc` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `userID` int(10) unsigned NOT NULL,
  `name` varchar(255) NOT NULL,
  `createdAt` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IX_byUser` (`userID`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

As you can see, I have an id column (an auto-incrementing integer) and a createAt column. As the ColdFusion application consumes this table, all INSERT operations will include a createdAt value that represents the current moment-in-time. As such, this value will never get smaller - it will only ever increase (or stay the same depending on the volume of operations per second).

To drive home the point: Both the id and createdAt columns will only ever increase over time - they will never decrease. As such, the id value can be a rough stand-in for the createdAt value in terms of sorting.

To see why this is actually good for performance, let's throw some test data into this table. In the following ColdFusion code, I'm just going to execute 5,000 INSERT statements, each with a createdAt date on a unique day (to keep things simple):

<cfscript>

	// Increase the page-request timeout so that our 5,000 queries have time to run.
	cfsetting( requestTimeout = 300 );

	for ( i = 1 ; i <= 5000 ; i++ ) {

		userID = randRange( 1, 100 );
		name = "Doc for user #userID#, v#createUuid()#";
		// NOTE: For the sake of the demo data, we want to the DATE and the PKEY to
		// increment in lock-step. This is how the data would work normally; but, since
		// we're populating the demo data programmatically, we have to force the
		// synchronicity a bit.
		createdAt = createDate( 1990, 1, 1 )
			.add( "d", i )
		;

		createDoc( userID, name, createdAt );

	}

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I create a record in the doc table using the given properties.
	*/
	public void function createDoc(
		required numeric userID,
		required string name,
		required date createdAt
		) {

		queryExecute(
			"
				INSERT INTO
					doc
				SET
					userID = :userID,
					name = :name,
					createdAt = :createdAt
				;
			",
			{
				userID: {
					value: userID,
					cfsqltype: "cf_sql_bigint"
				},
				name: {
					value: name,
					cfsqltype: "cf_sql_varchar"
				},
				createdAt: {
					value: createdAt,
					cfsqltype: "cf_sql_timestamp"
				}
			}
		);

	}

</cfscript>

As you can see, both the id and the createdAt columns are increasing by 1 on each loop. The id column does this because that's what an AUTO_INCREMENT column does; and, the createdAt column does this because I'm using the for-loop iteration index in a dateAdd() call.

ORDER BY id and ORDER BY createdAt are Essentially the Same

To demonstrate that I can use either the id column or the createdAt column in an ORDER BY clause to achieve the same results, let's grab all the documents for a given user and order them in chronological order:

<cfscript>

	idLists = {
		"id": "",
		"createdAt": ""
	};

	// Iterate over each COLUMN to run the same query using a different ORDER BY clause.
	for ( orderByColumn in idLists ) {

		results = queryExecute("
			SELECT
				*
			FROM
				doc
			WHERE
				userID = 4
			ORDER BY
				#orderByColumn# ASC
		");

		// Get the list of IDs from this query.
		idLists[ orderByColumn ] = valueList( results.id );

	}

	// Now that we've run the query twice, once with ORDER BY id and once with ORDER BY
	// createdAt, let's compare the two lists of ids.
	writeOutput(
		"Same ID list: " &
		yesNoFormat( idLists.id == idLists.createdAt )
	);

</cfscript>

This ColdFusion code runs the same query twice, but with different ORDER BY clauses:

  • ORDER BY id ASC
  • ORDER BY createdAt ASC

Then, it pulls out the values from the id column and compares them:

Same ID list: Yes

As you can see, both ORDER BY clauses result in the same set of results order in the same way. Because, again, id can act as a rough stand-in for the createdAt when it comes to ordering rows.

ORDER BY id is Better for ColdFusion Application Performance

Ok, now let's talk performance; and, why using ORDER BY id has a positive affect on your ColdFusion application performance. If you recall from my post on the dark art of database indexing, all secondary indexes on a relational database table implicitly include the primary key for that table at the end of each index entry. Which means that the secondary index on our doc table:

KEY `IX_byUser` (`userID`) USING BTREE

... is actually using this under the hood:

KEY `IX_byUser` (`userID`,`id`) USING BTREE

Only, the id portion doesn't need to be explicitly defined because it's inherent to the way secondary indexes work. A secondary index doesn't actually point to the row - it points to the primary key which, in turn, points to the row.

The other thing to know about secondary indexes is that they are stored in sort-order. And, when you perform a SQL query using a secondary index, the rows are returned in that same sort order.

This behavior applies to the prefix-consumption of the secondary index. Imagine that you have a secondary index with the following columns:

A,B,C,D [,pkey]

If you execute a SQL query with a WHERE condition that uses columns A and B for filtering, the rows will come back naturally with the ascending sort order of the next column, C.

If you execute a SQL query that uses all of the columns in the secondary index, A, B, C, and D, the rows will come back naturally with the ascending sort order of the next column, which is implicitly the primary key.

So, for our demo with the secondary index on the userID column, a WHERE clause that uses the userID will inherently return the results using the ascending sort order of the id column, which is our primary key. And, since id is a rough stand-in for createdAt in this case, we actually wouldn't need to order the query at all if we wanted to return the results in chronological order.

And, this actually changes the query execution plan! To see what I mean, let's run an EXPLAIN on selecting documents for a given user (4), but using the different ORDER BY clauses:

As you can see here, when we ORDER BY id, the SQL query execution is able to sort the query using the index alone. However, when we ORDER BY createdAt, the SQL query execution has to use an additional filesort in order to return the records in the requested order.

And, since id is a rough stand-in for createdAt when it comes to sorting, the latter query is doing more work for no reason - both queries result in the same exact data.

Of course, the huge caveat here is that in order to leverage the stored sort of the implicit id column on the secondary index, you have to consume all of the columns in the secondary index. Meaning, if our IX_byUser index was actually defined using the name column as well:

KEY `IX_byUser` (`userID`,`name`) USING BTREE

... then both of our EXPLAIN outcomes above would have included a filesort. And both queries would have had the same performance characteristics.

In total, what this means is that at worst (depending on the indexes) ORDER BY id and ORDER BY createdAt have the same performance. But, at best (depending on the indexes an our queries) we can bring back the records in chronological order (ascending or descending) with better performance when using ORDER BY id.

In other words, it's always better to use ORDER BY id because, at least sometimes, it has better performance. And, I'm done feeling guilty about using the id column in this manner.

If You Don't Use an Incrementing Primary Key

This whole post applies to a situation in which you have a primary key what is incrementing / increasing monotonically. If you don't have that in your database table - such as when using a UUID as the primary key - then this post doesn't apply to you.

It's Not Premature Optimization, It's How Databases Work

It would be easy to look at this post and claim that using id instead of createdAt is a case of premature optimization; but, I would push back against that notion. Everything about how we structure the database is geared towards performance. From normalization to secondary indexes - it's all about performance; and how we organize our data in a way that balances performance and usability. If you think using id is premature, then you might as well say that using secondary indexes is premature as well since you can pull back data without them.

Want to use code from this post? Check out the license.

Reader Comments

152 Comments

I always learn so much about how databases work from your posts. It is by far the most opaque technically I work with regularly...and I appreciate you pulling back the curtain for me every once in a while. I need a master's course in EXPLAIN

15,230 Comments

@Chris,

I'm so happy you find this stuff interesting - you are likely in the small minority of people ๐Ÿ˜†๐Ÿ˜†๐Ÿ˜† But yeah, it took me years to start to feel really comfortable with relational databases. And, I'm still learning - and there's still so much stuff that I don't know. Even with EXPLAIN, I only vaguely know what I'm talking about.

When I talk to the people on our data-services team, I feel like I'm speaking a different language. They start talking about "bin logs" and "change data capture" and I'm like wat?! ๐Ÿ˜ฎ

Just taking it a day at a time.

15,230 Comments

@Nanos,

It's a good question - and something I didn't really touch-on in the post. At a high-level, yes, adding createdAt to the existing index would achieve the same performance (since the secondary index is stored in sort-order).

But, indexed are not free. They have storage costs and performance costs when it comes to maintaining them over time. So, if all you were doing was adding createdAt to get the sort to be faster, then you've incurred that cost for no real benefit since you can get the sort of the id anyway.

That said, if you need to search on the createdAt date, then that's a different story - in the case, adding the createdAt date to the index could be beneficial, assuming that you're first limiting the query by userID such that the createdAt column is the next column in the secondary index:

WHERE
	d.userID = 4
AND
	d.createdAt >= DATE_ADD( UTC_TIMESTAMP(), INTERVAL -30 DAY )

Of course, if you were less concerned with the "last 30 days" and would be OK with just "recent" documents, then you could do something like:

WHERE
	d.userID = 4
ORDER BY
	d.id DESC
LIMIT
	10

... where you can leverage the primary key for the sorting without incurring any performance penalty or having to index the createdAt column.

Ultimately, this is all more art than science; and every app is different and every View has different requirements. It's super subjective - I really only intended to alleviate my own guilt on the matter ๐Ÿคช Your mileage will definitely vary.

2 Comments

@Ben,

Yeah makes sense. In my case, I almost always need to be able to search by date, so I often end up adding an index on that column. But if you don't need that then you certainly have a point...

15,230 Comments

@Nanos,

Yeah, exactly - every app is different. I'm sure I actually do end up search by date more often than I realize. Or, even if it's not in the Product UI, there's always some back-end report that I need to generate for the Sales or Customer Success team. But, for those one-off things, if I can narrow the records based on an indexed value, the additional filtering on date - even when not indexed - is usually pretty fast.... But, not always. More art than science. ๐Ÿค“

Post A Comment — I'd Love To Hear From You!

Oops!
NEW: Some basic markdown formatting is now supported: bold, italic, blockquotes, lists, fenced code-blocks. Read more about markdown syntax »
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.