The other morning, I was up until 2:30 AM working with the Data Services team, trying to figure out why a MySQL data migration script was behaving badly. This script, which had completed successfully several times in the past, had suddenly started exhibiting strange behavior when processing a new data-set. At this time, my best guess is that the problematic behavior was caused by the fact that I wasn't using a deterministic ORDER BY clause in my chunking query.
Imagine that we have a table that looks something like this:
CREATE TABLE tokens (
id int(11) NOT NULL AUTO_INCREMENT,
value varchar(30) DEFAULT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This table only has two columns, one of which is the Primary Key, which will implicitly determine the order in which the records come back in a normal SELECT statement. Now, in the data migration script, I had to iterate over this table - a chunk of records at a time - performing additional processing for every single record.
Some of the records contained duplicate tokens; so, in an attempt to process only unique records, I was selecting DISTINCT token values:
CAUTION: This SQL statement is broken!
DISTINCT( value )
10 -- Chunk size.
8510 -- Where I am in the migration of the table.
NOTE: This query shouldn't have used DISTINCT to begin with since the master process which determines the chunking behavior wasn't using DISTINCT in its calculation. So, this query had a number of problems.
When I wrote this code originally, I had assumed that the order of the records would be consistent across all OFFSET values (as it had been in the past). For this migration, however, it stopped working. Suddenly, the DISTINCT / OFFSET query started returning the same rows over and over again, no matter what the OFFSET was.
I don't know why this outcome was different. Perhaps it was the size of the table or the version of MySQL running on the particular machine. I know that I can't replicate this behavior in my local development environment. That said, from what I have been reading, this does not appear to be a bug. Since I omitted an ORDER BY clause in the SQL statement, it was up to the query optimizer to determine how to most efficiently return rows. And, without an ORDER BY clause, there is nothing inherent to the OFFSET or LIMIT directives that guarantee consistent and predictable ordering across queries.
So, the real take away for me is, always use a deterministic ORDER BY when using the LIMIT and OFFSET directives in MySQL. Otherwise, there is no guarantee of consistent and predictable ordering.
Also, I would favor GROUP BY over DISTINCT as I believe it forces you to think about the query more effectively. I think a developer can too easily pop DISTINCT into the top of a query without having a solid understanding of the implications.
Want to use code from this post? Check out the license.