Rick Osborne On Database Indexing And SQL Best Practices

Posted December 8, 2008 at 9:01 AM by Ben Nadel

Tags: SQL

Over Thanksgiving, I wrote a blog post on the importance of database performance in data-driven applications. I had stated that since the database calls were most often the bottleneck in an application, it is absolutely essential that I, as a software developer, have really good SQL skills. I vented that I don't know nearly enough about database indexing and how SQL statements should be prepared in relation to index structures.

In the comments to my blog entry, Rick Osborne gave me some really awesome insight and tips on best practices. Rick always has great stuff to say and I wanted to take a few minutes just to summarize his comments here in case anyone missed them in the previous post.

. . . .

1. To keep things simple, you should almost never have to rewrite queries based on how your indexes are set up. Almost. There are edge cases where you *can* be forced to, but those are endgame cases. Most of the time it is the other way around: write your queries, run them, then see if you need to tweak the indexes.

2. Until you get to very large databases with millions of rows, ignore anyone who tells you that too many indexes will slow down Inserts and Updates. That was true 20 years ago, but is not an issue now for most small- to mid-sized databases.

3. Most indexes are dead simple. "Does this column show up in my WHERE clause more than a couple times?". Yes? Index it.

3a. The counter-argument for the last one is "will an index here actually do anything for me?" and has to do with selectivity.

Selectivity in a nutshell: Your data is a card catalog in a library. (Remember those things?) Or a filing cabinet full of manila folders. An index is the tabs on the cards or folders that helps you find what you are looking for. Selectivity is the ratio of unique keys to the total number of rows.

Say you have 100 rows in your table and two columns that you are thinking about indexing, AA and BB. If AA has 2 unique values, its selectivity is 2/100. Similarly, if BB has 20 unique values, its selectivity is 20/100. Go back to the card catalog example: you have either 2 tabbed divider cards or 20 tabbed divider cards. Which is going to help you find data faster? The 20, right? That column has higher selectivity, and is thus going to get you to your data faster. The 2-tab index is almost useless. (And, in some cases, can actually make your query run slower.)

4. In most modern DBMSes, you almost don't need to index as long as you have Primary *and* Foreign keys set up. Joins are where indexes really shine, so proper keying will get you 90% of the way there. Set them up then go back to your "what's always in the WHERE clause?" and you're 99% of the way there.

5. Covering indexes are mid-advanced juju magic, and are especially awesome for data-driven apps. A covering index simply means that your query only uses data from an index and thus never has to go to the actual table. For example, you might create a covering index on your Users table that includes Username, Password, and IsActive. It seems like the index would only need to cover the Username as that's the real key, but if you include all three then a simple login-validation query wouldn't need to go to the actual table, it could just use the index. Cool, huh? But, as I said, they are semi-advanced and should almost never be built into an app from Day One. They are an endgame optimization for when things start to get too popular and bog down.

6. Learn to use and love the "Show Estimated Execution Plan" in the MS SQL query analyzer. You'll learn how to tell the difference between the types of index operations (Seeks versus Scans, etc) and it'll help you grok how to better index. Similarly, the Index Tuning Wizard is great for helping you get started. You'll eventually learn that you can't always trust it, but it's not a bad place to start.

(continued from another comment; the following comments were made in response to my questions and the following sample SQL:)

  • FROM
  • contact c
  • INNER JOIN
  • job j
  • ON
  • (
  • c.is_deleted = 0
  • AND
  • c.id = j.contact_id
  • AND
  • j.is_deleted = 0
  • AND
  • j.contact_role_id = #ContactRoles.CEO#
  • )

Most engines will use only one index per join, so multi-column indexes are a must. In your example, if you had one index on is_deleted and another on contact_role_id, the optimizer is going to choose whichever is the better one and use it. You'd probably be best off to have a single index on (c.id, c.is_deleted) and another on (j.contact_id, j.is_deleted, j.contact_role_id). But, again, different optimizers are going to have different rules for how they use indexes.

1. Yes, put as much of the filtering as you can in ON clauses. Not only does it put the conditions where they are most relevant, but in some engines you'll get orders of magnitude better performance. The DB2/400 optimizer is so dumb (how dumb is it?) that if you put the conditions in the WHERE instead of the ON it will do the joins first, no matter how big the tables, and then only apply the conditions at the end. For extremely large tables, this is a nightmare.

1a. I've gotten in the habit of ordering the conditions in the ON so that the foreign key parts are always last and the conditionals are always first. Again, I find that this helps the DB2/400 optimizer figure out that it needs to filter the tables before it joins them. In your specific example, you'd just move the (c.id = j.contact_id) to the end of the list.

1b. You might want to also play around with the order of the tables in the join. I find that some engines will execute a query faster if you order your tables from smallest to largest, while others are the exact opposite. Yeah, it seems like it shouldn't matter, but sometimes it does. (Of course, if you are talking about a lot of Outer joins, you may not have much of a chance for reordering.)

2. Some optimizers really want the ON or WHERE conditions to be in the order that the index is in. But some perform better as I outlined in 1a. That's really just something you're going to have to fool around with and see. (And no, the dream of having one query work perfectly on multiple engines is really just a dream.)

3. It's probably not a bad idea to develop the habit of having the index columns in the order of most to least selective. But, again, I've seen engines where it didn't matter in the slightest and all of the DBAs were in the habit of putting them in the same order as they were in the table.

. . . .

This is some great information. I still have a ton of questions, but Rick gave me a lot to think about. I want to get more comfortable with this idea of database "selectivity." Looks like that's my next thing to research. Thanks Rick!




Reader Comments

Dec 8, 2008 at 10:37 AM // reply »
33 Comments

Selectivity? I've always heard what you're describing being called Cardinality.

George.


Dec 8, 2008 at 11:08 AM // reply »
8 Comments

Great post, thanks BenRick! One thing I remember being taught is to put your most restrictive conditionals first. e.g. in your example I'd maybe put
j.contact_role_id = #ContactRoles.CEO#
as the first clause


Dec 8, 2008 at 12:17 PM // reply »
10,638 Comments

@Duncan,

I believe that is true; I think that is what Rick is talking about when it refers to "Selectivity". I need to get a little bit better handle on what it means to be selective. I understand the concept in a general way, but I need to get into the actual meaning much more.


Dec 8, 2008 at 1:57 PM // reply »
15 Comments

Very interesting. Nice post


Dec 8, 2008 at 2:04 PM // reply »
153 Comments

It all comes down to which book you learned it from.

Cardinality, in terms of databases, can mean a few things. Not only is it used for counting distinct values, but also for counting total rows, and also how many relationships a key constraint can have. (That is, "one to one", "one to many", or "many to many".)

I use Selectivity instead of Cardinality to avoid the ambiguity.


Dec 8, 2008 at 3:19 PM // reply »
10,638 Comments

@Rick,

I tried to take this concept of Selectivity and Cardinality and run with it:

http://www.bennadel.com/index.cfm?dax=blog:1424.view

How does the grouping of table-specific conditions come into play? Should all conditions for a given table be grouped in order to better leverage indexing?


Dec 11, 2008 at 9:30 PM // reply »
34 Comments

Ben,

Check out the O'Reilly book SQL Cookbook. That has some great examples of query situations. That at least for me have paid for the book many times over.

I really like that they give examples in all the major Db types too.


Dec 11, 2008 at 9:56 PM // reply »
10,638 Comments

@Kevin,

I think we might have that in our company library. I'll double check, thanks.


CM
Dec 24, 2008 at 4:58 PM // reply »
1 Comments

Just a quick comment to say "thanks" for posting this -- very helpful!


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
InVision App - Prototyping Made Beautiful With Prototyping Tools Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
Feb 3, 2012 at 10:49 PM
How I Got Node.js Running On A Linux Micro Instance Using Amazon EC2
Wow this was really helpful! Only thing I would add is you need to update your .bash_profile after you edit the secure_path. This is what I did: $ . ~/.bash_profile Otherwise, NPM won't be found. ... read »
Feb 3, 2012 at 10:14 PM
Pushing Base64-Encoded Images Over HTML5 WebSockets With Pusher And ColdFusion
@Ben, Just wanted to let you know that pusher are soon to start limiting sizes on messages. This was the detail that came through in the Feb dispatch: "However, we will soon be limiting the s ... read »
Feb 3, 2012 at 5:05 PM
Regular Expressions Make CSV Parsing In ColdFusion So Much Easier (And Faster)
I tried using your RegEx in my C# program, but it was matching an extra empty-string at the end and so I would end up with an extra field that doesn't exist, so I changed it to this: (^|,)("(?: ... read »
Feb 3, 2012 at 3:47 PM
ColdFusion Supports HTTP Verbs PUT And DELETE (As Well As GET And POST)
Josh Cyr posted this on Twitter just a little bit ago. Thought it was appropriate. http://stackoverflow.com/questions/1619152/how-to-create-rest-urls-without-verbs/1619677#1619677 ... read »
Feb 3, 2012 at 2:28 PM
Changing The Execution Context Of Your Self-Executing Function Blocks In JavaScript
@Michael, You definitely make a good point (and extra points for quoting movies - I love movies). When you use a return() statement to define the object's public API, it does provide a consistent a ... read »
Feb 3, 2012 at 2:04 PM
Changing The Execution Context Of Your Self-Executing Function Blocks In JavaScript
To quote Jurassic Park: "Just because you can doesn't mean you should". I completely, utterly disagree with the thought that this is more readable. Consider the current module pattern: if ... read »
Feb 3, 2012 at 1:10 PM
REST API Design Rulebook By Mark Masse
@Jordan, Yeah, WRML was created by Mark Masse (author of the book). I also found it to be a bit convoluted. I suppose it is intended to allow the Client to be able to programmaticaly respond to cha ... read »
Feb 3, 2012 at 1:08 PM
ColdFusion Supports HTTP Verbs PUT And DELETE (As Well As GET And POST)
@Jason, To be honest, I don't have good answers for that kinds of stuff. And, to the point, that is specifically why I *really* liked the REST API Design Rulebook by Mark Masse - he just cuts throu ... read »