Skip to main content
Ben Nadel at Scotch On The Rock (SOTR) 2010 (Munich) with: Christian Etbauer and Harry Klein
Ben Nadel at Scotch On The Rock (SOTR) 2010 (Munich) with: Christian Etbauer and Harry Klein@kleinh )

MySQL Query Optimization + Forgetting To Run EXPLAIN = Full Table Scan

By on
Tags:

Yesterday, while editing some ColdFusion code for this blog, I had a face-palm moment: I came across a query - the "recent comments" query on the homepage - that was almost certainly performing a full table scan of comments. At first glance, I thought I must have forgotten to check the performance characters of the SQL query with EXPLAIN. But, as I began to tweak the SQL query, I realized what happened: an earlier version of the query was using indexes; but then, I made a small change that completely altered MySQL's query execution. I love how clever the MySQL query optimizer is sometimes; so, I thought this might be worth a closer look.

If you go to the homepage of this blog, you'll see a "Recently Posted Comments" section that shows the 10 most recent comments along with the "member" that posted the comment and the "entry" on which the comment was posted. This data is gathered across three tables:

  • blog_comment
  • blog_entry
  • member

While there is nothing of particular surprise in the blog_comment and member tables, the blog_entry table has an isActive boolean. This boolean allows me to take blog posts offline without actually deleting them. But, when I first authored this query, I suspect that I had forgotten about the isActive check and left it out.

Let's take a quick look at that incomplete query first so that we can see exactly where I messed up. Here's a truncated version of the SQL query without the isActive check (SELECT statement is truncated because it's not relevant to the conversation):

SELECT
	/* ... truncated ... */
FROM
	blog_comment c
INNER JOIN
	blog_entry b
ON
	b.id = c.blogEntryID
INNER JOIN
	member m
ON
	m.id = c.memberID
ORDER BY
	c.id DESC
LIMIT
	10

As you can see, it's a relatively straightforward SQL query. I'm defining the JOIN conditions using best practices: data that I have is on the right, data that I need is on the left. Then I'm using ORDER BY id DESC in lieu of date-based sorting in order to get the most recent comments.

In fact, this SQL query is so simple that the MySQL query optimizer does something very clever! It knows that it has to do a full table scan; but, it also knows that it doesn't have to do any filtering; so, it attempts to run the query "backwards," so to speak.

Let's look at the EXPLAIN output for this query:

mysql> EXPLAIN SELECT  .... ;
+----+-------------+-------+------------+--------+----------------------+---------+---------+------------------------+------+----------+---------------------+
| id | select_type | table | partitions | type   | possible_keys        | key     | key_len | ref                    | rows | filtered | Extra               |
+----+-------------+-------+------------+--------+----------------------+---------+---------+------------------------+------+----------+---------------------+
|  1 | SIMPLE      | c     | NULL       | index  | IX_join,IX_by_member | PRIMARY | 4       | NULL                   |   10 |   100.00 | Backward index scan |
|  1 | SIMPLE      | b     | NULL       | eq_ref | PRIMARY              | PRIMARY | 4       | bennadel.c.blogEntryID |    1 |   100.00 | NULL                |
|  1 | SIMPLE      | m     | NULL       | eq_ref | PRIMARY,IX_info      | PRIMARY | 4       | bennadel.c.memberID    |    1 |   100.00 | NULL                |
+----+-------------+-------+------------+--------+----------------------+---------+---------+------------------------+------+----------+---------------------+
3 rows in set, 1 warning (0.00 sec)

Notice that the blog_comment (alias c) only has to scan 10 rows. In the Extra column, it states:

Backward index scan

Essentially what (I believe) is happening here is that the query execution is just scanning the primary key index on the blog_comment table backwards, grabbing the "first" 10-rows, and then performing the JOIN conditions. Basically, it's merging the requirements of ORDER BY into the ON condition of the first JOIN.

To be clear, this is a query optimization that is happening behind the scenes - it is not what the query would be doing if it were naively executing the JOIN.

I think about SQL JOINs in the same way that I think about using .map() and .filter() methods in ColdFusion and JavaScript. Meaning that the result-set of a cross-table query is calculated one JOIN at a time; and then, the materialized results of that JOIN are passed-down to the next JOIN. I don't know how accurate this mental model is in relation to the nested-loop join algorithm that MySQL uses; but, it keeps it simple and helps me think about index structures and performance.

Accepting that I did run an EXPLAIN on my original query and saw that it was scanning 10-rows, given my mental model for how JOINs work, I likely glossed-over the whole "backwards index scan" optimization and just assumed that the query was using standard JOIN mechanics. Which means, when I went to add the missing isActive=1 condition to my ON clause, I assumed that I was doing little more than updating the cross-product calculation that was already in place.

As such, I likely assumed that I didn't have to re-run the EXPLAIN on the updated SQL statement.

However, by adding the isActive condition, I was negating the ability for MySQL to apply the "backwards index scan" optimization, which fundamentally changed the query plan. Here's the updated SQL statement:

SELECT
	/* ... truncated ... */
FROM
	blog_comment c
INNER JOIN
	blog_entry b
ON
	(
			b.id = c.blogEntryID
		AND
			b.isActive = 1 -- Active posts only!
	)
INNER JOIN
	member m
ON
	m.id = c.memberID
ORDER BY
	c.id DESC
LIMIT
	10
;

All I did was add one condition to the first ON clause. However, if we now run an EXPLAIN on this, we get a radically different result from our earlier EXPLAIN:

mysql> EXPLAIN SELECT .... ;
+----+-------------+-------+------------+--------+----------------------+---------+---------+---------------------+------+----------+----------------------------------------------+
| id | select_type | table | partitions | type   | possible_keys        | key     | key_len | ref                 | rows | filtered | Extra                                        |
+----+-------------+-------+------------+--------+----------------------+---------+---------+---------------------+------+----------+----------------------------------------------+
|  1 | SIMPLE      | b     | NULL       | ALL    | PRIMARY              | NULL    | NULL    | NULL                | 2605 |    50.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | c     | NULL       | ref    | IX_join,IX_by_member | IX_join | 4       | bennadel.b.id       |   14 |   100.00 | NULL                                         |
|  1 | SIMPLE      | m     | NULL       | eq_ref | PRIMARY,IX_info      | PRIMARY | 4       | bennadel.c.memberID |    1 |   100.00 | NULL                                         |
+----+-------------+-------+------------+--------+----------------------+---------+---------+---------------------+------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.00 sec)

This time, instead of the query scanning the blog_comment table first, it performs a full table scan of the blog_entry table in order limit the results to where the isActive=1 condition can be satisfied.

Dr. House saying Oops.

The fundamental problem here is that my SQL query didn't have a good way to limit the cross-product between the tables as they were being calculated. To remedy this, I can break the query into two parts:

  1. Find an appropriate comment ID that represents the earliest of the "recent comments" - a "cut-off" ID.

  2. Use the cut-off ID to limit the cross-product of the subsequent JOIN.

The following SQL query isn't exactly equivalent to the one above; but, we'll get into that shortly:

/**
* Let's get the 11th-from-last comment ID that we can use to drastically limit
* the INNER JOIN condition in the subsequent query. Since this query is only
* inspecting the primary key of the table, it can get this ID instantly using
* the backward index scan approach.
*/
SET @cutoffID = (

	SELECT
		c.id
	FROM
		blog_comment c
	ORDER BY
		c.id DESC
	LIMIT
		1
	OFFSET
		10 -- We want the most recent 10 comments, skip over first 10 IDs.

);

SELECT
	/* ... truncated ... */
FROM
	blog_comment c
INNER JOIN
	blog_entry b
ON
	(
			c.id > @cutoffID -- !! USING THE CUT-OFF ID TO LIMIT CROSS-PRODUCT !!
		AND
			b.id = c.blogEntryID
		AND
			b.isActive = 1
	)
INNER JOIN
	member m
ON
	m.id = c.memberID
ORDER BY
	c.id DESC
;

Now, when we run the SQL query, we're including our @cutoffID variable in the INNER JOIN which is allowing the query execution to greatly limit the results of the cross-product. In fact, since our cut-off ID is already limiting us to the most recent 10 IDs in the blog_comment table, we can remove the LIMIT 10 from the second portion of the query since it is now redundant (given that all of our JOIN relationships are one-to-one).

If we run an EXPLAIN on this new query, we get the following:

mysql> EXPLAIN SELECT .... ;
+----+-------------+-------+------------+--------+------------------------------+---------+---------+------------------------+------+----------+----------------------------------+
| id | select_type | table | partitions | type   | possible_keys                | key     | key_len | ref                    | rows | filtered | Extra                            |
+----+-------------+-------+------------+--------+------------------------------+---------+---------+------------------------+------+----------+----------------------------------+
|  1 | SIMPLE      | c     | NULL       | range  | PRIMARY,IX_join,IX_by_member | PRIMARY | 4       | NULL                   |   10 |   100.00 | Using where; Backward index scan |
|  1 | SIMPLE      | b     | NULL       | eq_ref | PRIMARY                      | PRIMARY | 4       | bennadel.c.blogEntryID |    1 |    50.00 | Using where                      |
|  1 | SIMPLE      | m     | NULL       | eq_ref | PRIMARY,IX_info              | PRIMARY | 4       | bennadel.c.memberID    |    1 |   100.00 | NULL                             |
+----+-------------+-------+------------+--------+------------------------------+---------+---------+------------------------+------+----------+----------------------------------+
3 rows in set, 1 warning (0.01 sec)

As you can see, we're back to performing a "Backward index scan" on the primary key of the blog_comment table that filters the cross-product to 10-records. Only this time, it's performing a "range" look-up with our pre-calculated @cutoffID value.

That said, this query and the previous query are not equivalent. In the first version of this query, if I were to mark the most recent blog post as inactive, the query would simply move onto the active blog posts and return the 10 most recent comments. However, in this latter query, if I were to mark the most recent blog post as inactive, I would likely lose results since my @cutoffID doesn't take into account any inactive blog posts.

To balance out performance and robustness, I could do something like limit the @cutoffID to the 100th or 200th comment ID and then add the LIMIT back into the final query. This would build-in some wiggle-room for the isActive=1 JOIN condition.

But, since I almost never mark a blog post as inactive, especially after it has received comments, I'm not even going to worry about that edge-case at this time.

Ultimately, I'm the one responsible for running EXPLAIN on my SQL queries before I deploy them to production. So, this was just sloppiness on my part. But, it's pretty cool to see how the MySQL query optimizer was able to save me from myself in at least one of the full table scan scenarios.

Want to use code from this post? Check out the license.

Reader Comments

4 Comments

In general, I despise sql syntax and I've always found it confusing. I find it easier to decipher full blown assembler, than SQL syntax. I guess that makes me odd. So unfortunately, the sql in your post looks like:

intaddr equ 1ch*4	; interrupt address
segaddr equ 62h*4	; segment address of first copy
mfactor equ 17478	; minute conversion factor * 16
whozat	equ 1234h	; signature
color	equ 14h		; crazy output

But maybe that's why I still can't find work ๐Ÿค”

15,192 Comments

@Anthony,

It's just arc-tangents and assembler code for you, huh? ๐Ÿ˜† But, seriously, SQL definitely rubs some people the wrong way. I know there are people who love document databases; but, when I look at the syntax for dealing with those APIs, they feel soo wonky to me. I guess so much of it is just what you're used to looking at.

Post A Comment — I'd Love To Hear From You!

Oops!
NEW: Some basic markdown formatting is now supported: bold, italic, blockquotes, lists, fenced code-blocks. Read more about markdown syntax »
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.