Exploring The Cardinality And Selectivity Of SQL Conditions
After posting Rick Osborne's comments about Selectivity in a SQL, I thought I would take my lunch break to sort of wrap my head around the concept a little bit more fully. After some Googling, I came across a good comment on Experts Exchange:
the selectivity is what goes with the cardinality concept. the "cardinality" refers to the number of "distinct" values, as in the set theory so, take a column "SEX". the possible values are "male" and "female" (ignoring for the moment other possible values like "unknown" or even "other" ) ... so, your cardinality for that column would be 2, no matter how many rows you have in that table.
the selectivity is the "number of rows" / "cardinality", so if you have 10K customers, and search for all "female", you have to consider that the search would return 10K/2 = 5K rows, so a very "bad" selectivity.
the column for the primary key on the other side is "unique", and hence the cardinality is equal to the number of rows, by definition. so, the selectivitiy for searching a value in that column will be 1, by definition, which is the best selectivity possible
After reading this, it might seem like thinking about the Cardinality of a set of values is a bit overkill; why not just think about the conditions that will return fewer sets? Well, for me, it's not always obvious as to which conditions will actually return the least number of records. As such, I think that evaluating the Cardinality will actually help me out.
Let's take a look at the example from my original post (and follow-up comments) about the database as a bottleneck. In the example I presented, I had a Job table that has the following keys (the relevant ones at least):
- contact_id (int)
- institution_id (int)
- contact_role_id (int)
- date_started (date/time)
- date_ended (date/time)
From this table, I was trying to select all contacts that had the Job type (contact_role_id) as "CEO". For this one, I am going to select the jobs in which a given contact has a given role type. Before I really thought about any of the selectivity of my statements, I would have done something like this:
FROM contact c INNER JOIN job j ON ( c.is_deleted = 0 AND c.id = j.contact_id AND j.is_deleted = 0 AND j.contact_id = #FORM.contact_id# AND j.contact_role_id = #ContactRoles.CEO# AND j.date_started <= NOW() AND ( j.date_ended IS NULL OR j.date_ended > NOW() ) )
My database uses logical rather than physical deletes, so as a matter of practice, I put the is_deleted checks on either side of the Join so that I would not forget to include them. After that, I then included the contact ID (assuming we are looking for jobs for a given contact) and the role, followed lastly by the date/time constraints (assuming we're looking for active jobs only).
Now that that is out there, let's walk through my choices in terms of Cardinality and Selectivity. I am going to throw out some estimated and hypothetical values. The inaccuracy of these values may affect the specifics, but I think it will still be relevant. For the following thought experiment, I am going to assume that I have a 1,000 contacts and 1,2000 jobs. I am also going to assume that there will always be more jobs than contacts and that, in fact, this 1:1.2 ratio will remain fairly constant.
is_deleted. This field is a yes/no toggle, therefore the Cardinality of the field is 2. So, if we take a sample size of 1,000 contact records, we get a Selectivity of 1000 / 2 = 500. If we take the sample size of 1,200 jobs, we get a Selectivity of 1,2000 / 2 = 600. The relative selectivity here tells us that our logical delete should be checked on the contact first and then on the job second. However, neither of these selectivities are very good, so they should come later in our JOIN clause, probably not at the top where I have them.
contact_id. This field is a foreign key to the contact table. This is an interesting one because I could check the contact_id in the contact table OR I could check it in the job table. Let's see if that matters; if we put the filter on the contact table, we will have a separate ID for each contact. This means that for 1,000 contacts, our cardinality will also be 1,000 which will yield a selectivity of 1 (the best we can hope for). On the other hand, if we go with the contact_id in the job table (as we are in the example above), we have a cardinally of 1,000 and a row count of 1,200 which yields a selectivity of 1,2000 / 1,000 = 1.2. This is still very good, but not as good as the first choice. Clearly, our FORM.contact_id should be checked against contact.id rather than job.contact_id.
contact_role_id. There are only about 10 job types in my system, giving us a cardinality of 10. And, since this column only exists in the job table, this yields a selectivity of 1,2000 / 10 = 120. This is much better than the is_deleted checks, but worse than the contact_id check; it should therefore go between the above two conditions.
date_started. This is a strange one to think about. Assuming that the system does not allow anyone to create future jobs (which mine does not currently do), it means that all rows in the job table will have a past date_started column value. But, how does that affect Cardinality? How can one put a solid number on this "set" of data that will change with time? For the sake of argument (and please correct me if this is bad), since the date lives in a binary context - either before today or after today - I'm going to give it a cardinality of 2 yielding a selectivity of 1,200 / 2 = 600. This selectivity is pretty bad, among the worst that we have seen so far, and should therefore go at the end of the conditions.
date_ended. At first, I started to think about this differently than the date_started column since it helps us to return meaningful, active jobs. But, is this really any different? What are the possible values for this column? Yes, there can be many dates, but what are the sub-sets of possible values: NULL, before today, and after today. There are three sets of categorical values. I think we can translate this to a cardinality of 3 yielding a selectivity of 1,200 / 3 = 400.
Things are starting to get interesting when we think about the selectivity of our different conditions, especially with cases like "contact_id" which can be tested in several different tables. Based only on the selectivities calculated above, I suppose I would order the JOIN conditions as such:
FROM contact c INNER JOIN job j ON ( <!--- Selectivity: 1. ---> c.id = #FORM.contact_id# <!--- Selectivity: 120. ---> AND j.contact_role_id = #ContactRoles.CEO# <!--- Selectivity: 400. ---> AND ( j.date_ended IS NULL OR j.date_ended > NOW() ) <!--- Selectivity: 500. ---> AND c.is_deleted = 0 <!--- Selectivity: 600. ---> AND j.is_deleted = 0 <!--- Selectivity: 600. ---> AND j.date_started <= NOW() <!--- Join condition. ---> AND c.id = j.contact_id )
As you can see, this ordering is MUCH different than the one I was using before. Now, based on Rick Osborne's advice, I am doing all of the table-specific conditions first and then putting my actual JOIN clause at the end. I am not 100% sure why this is done, but I believe it allows us to take better advantage of the table indexing.
But what about the splitting up of the two tables? As you can see from the above JOIN, I am (by coinscedence) alternating checks on the contact table with checks on the job table. I assume that this will affect our ability to leverage indexes - but how? Should I be grouping related conditions together (in order)? Or, will the database know how to use the appropriate index given the columns that are present? Is this where we can start to take advanteage of "covering" indexes?
I really can't answer these last few questions. my gut tells me that the answers will weigh heavily on the solutions, but that will have to come in a different blog post. For now, I am feeling really good with this thought experiment. I am sure I have a long way to go in this "art form", but at least now I feel like I have a foot in the door.
Want to use code from this post? Check out the license.
"I am doing all of the table-specific conditions first and then putting my actual JOIN clause at the end. I am not 100% sure why this is done, but I believe it allows us to take better advantage of the table indexing."
For most modern DBMSes this probably won't matter. SQL Server, for example, has a pretty good optimizer and you can reorder these any way you like and not get different results.
However, on some DBMSes, SQL is less of a declarative language and more of an instructional one. That is to say, they will follow the SQL step-by-step without doing any real optimization on it. DB2/400 is one such system that hasn't caught up to the rest of the world. (Because DB2/400 is v5 of DB2, which is about 5 versions behind non-400 DB2. But I digress.)
So, in that case, there's a distinct difference between:
"(j.is_deleted = 0) AND (c.id = j.contact_id)"
... and ...
"(c.id = j.contact_id) AND (j.is_deleted = 0)"
... if the DBMS processes the instructions in order, instead of optimizing them. The first scenario filters the table then does a join, while the second scenario does the join and then filters the result of the join. It may not make a big deal for small tables, but for millions of rows it can make a huge deal. Especially with high selectivity where the table size is significantly larger than the selection size.
If your criteria filter the table from 100 rows down to 25 rows, and then join with another table that is just as large, your working set won't be any more than 625 rows. But if you join first and then filter, your working set could be as large as 10000 rows.
Yeah, it's dumb to have to tell the SQL optimizer how to do its job. But, the reality is what it is.
"Should I be grouping related conditions together (in order)?"
Again, this is DBMS-specific. I'd hate to give advice one way, beyond just saying that for each DBMS you use you're going to have to try it and see. It probably wouldn't be a bad habit to get used to doing: (left conditionals) then (right conditionals) then (join clauses). But definitely double-check me on that.
"will the database know how to use the appropriate index given the columns that are present?"
Hopefully, yes. Again, it depends on how smart the optimizer is.
Definitely play around, but I advise that you do it in an older and less-optimized database to really see the results. Fiddling around in recent versions of SQL Server or Oracle isn't going to show you much, as they have pretty good optimizers. An older version of mySQL would be good, as would an older version of DB2. You can really see the differences there.
"Is this where we can start to take advanteage of "covering" indexes?"
Sort of. Covering indexes come down to this: "Is *every* column in this query, including SELECT and WHERE and GROUP and everything, in a single index?" If so, that's a covering index and your optimizer never needs to go anywhere else but that index. (These are truly awesome to behold, btw. They can change your run times by orders of magnitude.)
Ok, thanks, I think I am getting a better picture here. I know you work with DB2 a lot, but for a database like MySQL, which I think a lot of people are working with days (LAMP and all), would you say that ordering JOIN clauses is gonna be beneficial? Or, do you think that the SQL optimizer in MySQL can handle that stuff efficiently?
Regardless of that question, I can definitely see a HUGE benefit to using Selectivity to order clauses if for no other reason, it helps you to create much more relevant table Indexes.
The same with grouping LEFT-Join table conditions and then RIGHT-Join table conditions - it will help me to visualize the proper indexes.
Stop me if you've heard this before, but ...
You can also use the Wayback Machine to visit the late 80's and early 90's before SQL was mostly standardized. Back then, you didn't join tables with the words "INNER JOIN" or "OUTER JOIN" or "LEFT JOIN" or anything so nice and descriptive. Joins used to look like this:
/* tables list with no clear relations to each other */
FROM contact AS c, job AS j
/* conditionals and keys all jumbled together in any order */
WHERE (c.id = #FORM.contact_id#)
AND (j.contact_role_id = #ContactRoles.CEO#)
AND (COALESCE(j.date_ended,0) > NOW())
AND (c.is_deleted = 0)
AND (j.is_deleted = 0)
AND (j.date_started <= NOW())
AND (c.id = j.contact_id)
See how messy that was? Imagine a 6-table join and how big the WHERE clause would be. But you can also see how the engines worked -- tell me all the tables you want, then tell me the operations you want on each table. The order of the WHERE clauses may or may not have been important (any wasn't for many engines) because it really didn't matter -- it's all just a bunch of filters.
Also, outer joins weren't much of an option. So along came this notation:
FROM aa, bb, cc
WHERE (aa.a1 = bb.b1)
AND (bb.b2 *= cc.c2)
That's AA inner joined to BB, then left outer joined to CC. As you can imagine, since *= is left outer join, =* is right outer join. And = is still = and still means inner join. (I don't remember if there was a cross join or full outer join syntax, but I don't think so. And Wikipedia calls = an "equijoin", but I've never heard any real person actually use that term.)
But even then, the optimizers were pretty dumb and the WHERE clause wasn't much more than a bunch of filters. But now, the filters had the option of saying "don't throw out this row, just leave some of it blank" for the outer joins.
With outer joins came interesting problems with the mess-o-filters way of doing things. For example:
FROM aa, bb WHERE (aa.a1 *= bb.b1) AND (bb.b2 IS NULL)
... may not be the same as ...
FROM aa, bb WHERE (bb.b2 IS NULL) AND (aa.a1 *= bb.b1)
Let's say that AA has 100 rows, while BB has 50 rows, 25 of which have a null value for B2.
If you do the join first, then your working set is still 100 rows as it is an outer join. Then apply the filter and what happens? You'll get the 25 rows in BB that have a legitimate null value, but you'll also get the 50 rows in AA that don't have a matching BB row. You will not get the 25 rows where there is a match but B2 is not null. Your end result will have 75 rows.
Do it in the other order. Filter first, taking BB down to 25 rows. Then left outer join. You have 100 rows in your result set, 25 of which have a matching row in BB.
Hence the need to be able to promote conditionals and join clauses up to the FROM clause to be more explicit in what they mean.
(To be fair, you can still do this with today's explicit join syntax, but it is much harder to do and much easier to see.)
To be honest, I've got a mySQL v5 install here, but I don't play with it much. However, some quick testing tells me that its optimizer isn't too dumb:
Given a table p10000 with column (i int not null) which has values from 0 to 9999, and the following query:
FROM p10000 AS a, p10000 AS b
WHERE (a.i = b.i)
AND (a.i IN (2, 4, 6, 8))
A stupid optimizer would join first, then filter second. But either way I run it, mySQL does it in the same time so I assume it's not that dumb. When I use mySQL's "explain" functionality, it shows me this:
id, select_type, table, type, key, rows, extra
1, SIMPLE, a, range, PRIMARY, 4, using where; using index
2, SIMPLE, b, eq_ref, PRIMARY, 1, using index
You can see that it is looking through its primary keys for both tables, but what it doesn't tell you is which order it executes these in.
Now, if I drop the primary keys, it's a different story:
id, select_type, table, type, key, rows, extra
1, SIMPLE, a, ALL, NULL, 10000, using where
2, SIMPLE, b, ALL, NULL, 10000, using where
As you can see, it now has to do a table scan (ALL). But even after I drop the primary keys and flip the order of the clauses back and forth, the execution time doesn't change significantly so I'd say mySQL is pretty smart.
Unfortunately, my DB2/400 explain functionality is busted, or I could show you an example of an engine that isn't that smart.
Even with a smart query optimizer, like I said before, I can still see a benefit to thinking about this selectivity stuff as it might help to think about what would be the more effective Indexing strategies.
Good stuff both of you guys!
I've been following your posts for a few weeks and love to read your thought processes. And between you and Rick, I must thank you both, as until now I was afraid that the only future prospects for my teenaged son, who lives to argue and argues to live, might be that of a lawyer. But now I see that he has prospects of becoming a database programmer or even software engineer. :o
I just ran across a great set of articles specifically about MySQL and Indexing:
I read through 2 of them and skimmed the other 2, there are some great examples of optimization here.
That's too funny :) Send him along, we'll take good care of him. If he loves to argue, then perhaps he just loves to get to the bottom of the problem - something that is essential for computer science.
I just read the first two articles also - great links, thanks.
I am a little confused on how indexing relates to the SELECT and ORDER BY clauses (and probably GROUP BY as well). I used to think that an index would be used to find a spot on the file system and then things would be pulled from the file system, but I am thinking that this is incorrect thinking?
It seems that if SELECT or ORDER BY use data from a table that is NOT in the index, then the index is not used. BUT, what doesn't make sense, or at least what I don't understand from the examples is that even though an index was not being used, there were still huge performance increases in the select. So, is the index being used partially but not full? Or is it not being used at all?
Basically, what I don't get is in the EXTRAS column of the EXPLAIN, if it didn't have "using indexes" then how were they also seeing good performance increases?
I'm doing some studying for one of the MCITP exams in SQL Server and one of my study topics for the night was Selectivity vs Cardinality. Came across your blog during my search and it was exactly what the doctor ordered. Thanks a lot.
Very interesting discussion. why should you know arithmetic since we have calculators. Yes, Thank God for query optimizers (as its a practical tool). However, building execution plan in optimum way is a good practice, there are PHD's written on the subject of query process optimization whether in heuristic or cost estimation approach. So knowing and understanding the various cost results of the arrangement of operations is a must for any DB developer.