Over Thanksgiving, I wrote a blog post on the importance of database performance in data-driven applications. I had stated that since the database calls were most often the bottleneck in an application, it is absolutely essential that I, as a software developer, have really good SQL skills. I vented that I don't know nearly enough about database indexing and how SQL statements should be prepared in relation to index structures.
In the comments to my blog entry, Rick Osborne gave me some really awesome insight and tips on best practices. Rick always has great stuff to say and I wanted to take a few minutes just to summarize his comments here in case anyone missed them in the previous post.
To keep things simple, you should almost never have to rewrite queries based on how your indexes are set up. Almost. There are edge cases where you *can* be forced to, but those are endgame cases. Most of the time it is the other way around: write your queries, run them, then see if you need to tweak the indexes.
Until you get to very large databases with millions of rows, ignore anyone who tells you that too many indexes will slow down Inserts and Updates. That was true 20 years ago, but is not an issue now for most small- to mid-sized databases.
Most indexes are dead simple. "Does this column show up in my WHERE clause more than a couple times?". Yes? Index it.
3a. The counter-argument for the last one is "will an index here actually do anything for me?" and has to do with selectivity.
Selectivity in a nutshell: Your data is a card catalog in a library. (Remember those things?) Or a filing cabinet full of manila folders. An index is the tabs on the cards or folders that helps you find what you are looking for. Selectivity is the ratio of unique keys to the total number of rows.
Say you have 100 rows in your table and two columns that you are thinking about indexing, AA and BB. If AA has 2 unique values, its selectivity is 2/100. Similarly, if BB has 20 unique values, its selectivity is 20/100. Go back to the card catalog example: you have either 2 tabbed divider cards or 20 tabbed divider cards. Which is going to help you find data faster? The 20, right? That column has higher selectivity, and is thus going to get you to your data faster. The 2-tab index is almost useless. (And, in some cases, can actually make your query run slower.)
In most modern DBMSes, you almost don't need to index as long as you have Primary *and* Foreign keys set up. Joins are where indexes really shine, so proper keying will get you 90% of the way there. Set them up then go back to your "what's always in the WHERE clause?" and you're 99% of the way there.
Covering indexes are mid-advanced juju magic, and are especially awesome for data-driven apps. A covering index simply means that your query only uses data from an index and thus never has to go to the actual table. For example, you might create a covering index on your Users table that includes Username, Password, and IsActive. It seems like the index would only need to cover the Username as that's the real key, but if you include all three then a simple login-validation query wouldn't need to go to the actual table, it could just use the index. Cool, huh? But, as I said, they are semi-advanced and should almost never be built into an app from Day One. They are an endgame optimization for when things start to get too popular and bog down.
Learn to use and love the "Show Estimated Execution Plan" in the MS SQL query analyzer. You'll learn how to tell the difference between the types of index operations (Seeks versus Scans, etc) and it'll help you grok how to better index. Similarly, the Index Tuning Wizard is great for helping you get started. You'll eventually learn that you can't always trust it, but it's not a bad place to start.
(continued from another comment; the following comments were made in response to my questions and the following sample SQL:)
FROM contact c INNER JOIN job j ON ( c.is_deleted = 0 AND c.id = j.contact_id AND j.is_deleted = 0 AND j.contact_role_id = #ContactRoles.CEO# )
Most engines will use only one index per join, so multi-column indexes are a must. In your example, if you had one index on is_deleted and another on contact_role_id, the optimizer is going to choose whichever is the better one and use it. You'd probably be best off to have a single index on (c.id, c.is_deleted) and another on (j.contact_id, j.is_deleted, j.contact_role_id). But, again, different optimizers are going to have different rules for how they use indexes.
- Yes, put as much of the filtering as you can in ON clauses. Not only does it put the conditions where they are most relevant, but in some engines you'll get orders of magnitude better performance. The DB2/400 optimizer is so dumb (how dumb is it?) that if you put the conditions in the WHERE instead of the ON it will do the joins first, no matter how big the tables, and then only apply the conditions at the end. For extremely large tables, this is a nightmare.
1a. I've gotten in the habit of ordering the conditions in the ON so that the foreign key parts are always last and the conditionals are always first. Again, I find that this helps the DB2/400 optimizer figure out that it needs to filter the tables before it joins them. In your specific example, you'd just move the (c.id = j.contact_id) to the end of the list.
1b. You might want to also play around with the order of the tables in the join. I find that some engines will execute a query faster if you order your tables from smallest to largest, while others are the exact opposite. Yeah, it seems like it shouldn't matter, but sometimes it does. (Of course, if you are talking about a lot of Outer joins, you may not have much of a chance for reordering.)
Some optimizers really want the ON or WHERE conditions to be in the order that the index is in. But some perform better as I outlined in 1a. That's really just something you're going to have to fool around with and see. (And no, the dream of having one query work perfectly on multiple engines is really just a dream.)
It's probably not a bad idea to develop the habit of having the index columns in the order of most to least selective. But, again, I've seen engines where it didn't matter in the slightest and all of the DBAs were in the habit of putting them in the same order as they were in the table.
This is some great information. I still have a ton of questions, but Rick gave me a lot to think about. I want to get more comfortable with this idea of database "selectivity." Looks like that's my next thing to research. Thanks Rick!
Selectivity? I've always heard what you're describing being called Cardinality.
Great post, thanks BenRick! One thing I remember being taught is to put your most restrictive conditionals first. e.g. in your example I'd maybe put
j.contact_role_id = #ContactRoles.CEO#
as the first clause
I believe that is true; I think that is what Rick is talking about when it refers to "Selectivity". I need to get a little bit better handle on what it means to be selective. I understand the concept in a general way, but I need to get into the actual meaning much more.
Very interesting. Nice post
It all comes down to which book you learned it from.
Cardinality, in terms of databases, can mean a few things. Not only is it used for counting distinct values, but also for counting total rows, and also how many relationships a key constraint can have. (That is, "one to one", "one to many", or "many to many".)
I use Selectivity instead of Cardinality to avoid the ambiguity.
I tried to take this concept of Selectivity and Cardinality and run with it:
How does the grouping of table-specific conditions come into play? Should all conditions for a given table be grouped in order to better leverage indexing?
Check out the O'Reilly book SQL Cookbook. That has some great examples of query situations. That at least for me have paid for the book many times over.
I really like that they give examples in all the major Db types too.
I think we might have that in our company library. I'll double check, thanks.
Just a quick comment to say "thanks" for posting this -- very helpful!