Random Thought: The Database Bottleneck And Elegance
Posted November 29, 2008 at 9:39 AM by Ben Nadel
I forget exactly what I was doing, but last week, I had a random thought about application architecture and the database. From everything that I have been told, people say that the database is the bottleneck in a data-driven application. If that is the case, (which I would definitely agree with) then shouldn't the data layer be the most elegant, graceful, and efficient part of the application? I know that I personally carry a lot of shame that I don't know enough about proper table indexing and triggers and stored procedures and things of that nature. And I know that I am supposed to be architecting applications in a domain-driven mind set, not a data-driven one; but, when I stop to think that the database is most often the root-cause of poor performance, it makes me realize that my data layer needs a swift kick in its big ass.
I think the first thing that I need to start looking into immediately is database indexing. I know so little about that. Well, I do know enough to know that I don't know nearly enough. For instance, my indexing strategies will directly affect the way I need to write queries based on commonly used columns and multi-column indexes and left-hand indexing, etc. (sorry if the terminology is incorrect). Just the thought of this makes me nervous - how do I know what columns will be used often until I write my SQL. Then does my indexing need to reflect my SQL? Or does my SQL need to be refactored to reflect the indexing choices that I've made after I get a better feeling over what data will be used? Do primary key indexes get created automatically in all database systems? Can you index a tiny-int column? Can I index a View?
Indexing is a huge topic; clearly, I can't just make up these rules on my own. I need some books. Can anyone recommend any good books that discuss indexing strategies (Rick, I'm looking at you buddy!)?
Don't know about books because i myself am not a sql guru, but there are a lot of articles about it around in internet. Some time ago i took some time to search, try, etc and made a component with some functions that may be of help when it come to SQL performance, indexing decisions, etc. It's cfSQLMaster and is available on http://1smartsolution.com . As i see it got 350 downloads so far so i hope there was at least 10% of that number in people who found it useful. Maybe you will find something useful for yourself there too.
Short short short version:
1. To keep things simple, you should almost never have to rewrite queries based on how your indexes are set up. Almost. There are edge cases where you *can* be forced to, but those are endgame cases. Most of the time it is the other way around: write your queries, run them, then see if you need to tweak the indexes.
2. Until you get to very large databases with millions of rows, ignore anyone who tells you that too many indexes will slow down Inserts and Updates. That was true 20 years ago, but is not an issue now for most small- to mid-sized databases.
3. Most indexes are dead simple. "Does this column show up in my WHERE clause more than a couple times?". Yes? Index it.
3a. The counter-argument for the last one is "will an index here actually do anything for me?" and has to do with selectivity.
Selectivity in a nutshell: Your data is a card catalog in a library. (Remember those things?) Or a filing cabinet full of manila folders. An index is the tabs on the cards or folders that helps you find what you are looking for. Selectivity is the ratio of unique keys to the total number of rows.
Say you have 100 rows in your table and two columns that you are thinking about indexing, AA and BB. If AA has 2 unique values, its selectivity is 2/100. Similarly, if BB has 20 unique values, its selectivity is 20/100. Go back to the card catalog example: you have either 2 tabbed divider cards or 20 tabbed divider cards. Which is going to help you find data faster? The 20, right? That column has higher selectivity, and is thus going to get you to your data faster. The 2-tab index is almost useless. (And, in some cases, can actually make your query run slower.)
4. In most modern DBMSes, you almost don't need to index as long as you have Primary *and* Foreign keys set up. Joins are where indexes really shine, so proper keying will get you 90% of the way there. Set them up then go back to your "what's always in the WHERE clause?" and you're 99% of the way there.
5. Covering indexes are mid-advanced juju magic, and are especially awesome for data-driven apps. A covering index simply means that your query only uses data from an index and thus never has to go to the actual table. For example, you might create a covering index on your Users table that includes Username, Password, and IsActive. It seems like the index would only need to cover the Username as that's the real key, but if you include all three then a simple login-validation query wouldn't need to go to the actual table, it could just use the index. Cool, huh? But, as I said, they are semi-advanced and should almost never be built into an app from Day One. They are an endgame optimization for when things start to get too popular and bog down.
6. Learn to use and love the "Show Estimated Execution Plan" in the MS SQL query analyzer. You'll learn how to tell the difference between the types of index operations (Seeks versus Scans, etc) and it'll help you grok how to better index. Similarly, the Index Tuning Wizard is great for helping you get started. You'll eventually learn that you can't always trust it, but it's not a bad place to start.
Our dba lets us build our schema (tables,views,stored procedures, indexes). After I give my best guess, I have him do an explain plan (Oracle, SQL Server) on my bad sql times to see what can be done to decrease those times. I look at all of my architecture to see those weakest links, because performance is critical for the user.
Oh, and I should directly answer your questions, eh?
First, yes, on some systems you can index views. However, it generally requires a specific table setup and fairly rigid constraints. And, in my experience, it's almost always a sign that you need to go back to the drawing board. I'm not saying that there's never a case for an indexed view, just that you should be really, really sure that it is exactly what you need because it probably isn't.
Second, stored procedures are holy-war territory. Honest. For every person like me that wants to stick a #2 pencil through the eyeballs of anyone even considering adding sprocs to a db, there's a monocular jihadist standing behind them with a fountain pen and a CREATE PROC statement.
Third, I don't have any good books to recommend, sorry. See, the problem is that while the theory of indexes is fairly cross-platform (most of which I outlined above), the reality is that each platform is going to do it a little differently. And, in some cases, the same platform might even do it several different ways. (DB2, for example, has at least 3 different types of indexes with drastically different performance characteristics.)
But ... if you can grok the concepts of Keys, Selectivity, and Covering then you really do have almost everything you need to know. Beyond that, it's pretty much down to sitting at a prompt and running tests until you see what works. Write a query, run it 3 times to see how long it takes, then add an index and run it another 3 times to see if it is faster.
I know that sounds horrifically old-school, but trying to outguess the query optimizer is the path to madness -- do what makes sense to you, and it will probably make sense to the DBMS. Try to get witty and you and the optimizer will probably end up chasing each other around the mulberry bush.
>The 2-tab index is almost useless. (And, in some cases, can actually make your query run slower.)
I thought that a higher "selectivity" of a value in a column will mean that the query optimiser will ignore and given index (on that column) because there are too many values to be unique enough, so the database engine will run a full table-scan. This will be slow.
So if I had a bit field "isActive" I would never put an index on it as it will have (most of the time) only 2 unique values.
"Username" however is a perfect candidate as it's selectivity is such that you really have unique values...thus a perfect candidate for an index (as of course you will often use it in a WHERE clause for authentication).
I've also learned that you should order you column filters in your WHERE statement by what you think will be most unique subset first.
So if you're searching for all Users who are "active" but who's lastname starts with "s" you might have:
WHERE Users.LastName LIKE 's%'
AND IsActive = 1
Note that you filter on "LastName" before filtering on "IsActive" because that will filter out a higher initial result-set, whilst still using the index on LastName.
Also interesting to note that if you had a wildcard in the LastName first, (e.g. '%s') then the index will be ignored.
Yep. Selectivity is a double-edged sword. Most modern DBMSes will ignore indexes with extremely low selectivity, but there are still some around that won't. (In my personal experience, DB2/400 doesn't and it drives me crazy when people add indexes willy-nilly without knowing this.) But, at the same time, indexing a unique value and then *not* clustering it can be almost as bad depending on the DBMS and the table size.
"I've also learned that you should order you column filters in your WHERE statement by what you think will be most unique subset first."
This is a good rule of thumb, and will generally serve you well, but ...
It's also not always true. Some older DBMSes (namely SQL Server 7 and before, IIRC, as well as mySQL4, I think?) won't use an index unless the columns in the Where clause are in the same order as the columns in the index, or the same order as the columns are in the table. Personally, I can never remember which DBMSes are smart enough to figure it out so I try to run it a few different ways to be sure.
As for LIKE and indexes, yeah, they don't get along. If you find yourself doing lots of substring lookups, however, many modern DBMSed *will* allow you to index a substring of a column. You can then change your like to a LEFT or RIGHT or SUBSTRING, make an index to match, and have a margarita on the beach. But generally this is a bad idea and is really only useful in very specific edge cases.
One of the most useful things for tuning performance are Materialised Views (oracle speak) which SQL server added in the last release from memory.
Basically it lets you make a view into a normal table which is dynamically updated by the database either on the fly, or on demand.
You can also create indexes on the materialised views.
Usually it all comes back to schema design, I have seen many projects grind to a slow crawl performance wise from silly untested approaches pushed down from management, like lets make this complex system generic or lets build the capture system first and worry about reporting later.
Or systems which have been designed initially for data capture with out thought for how the data will be used.
Abstractly, an index is used to keep the data stored in fashion which is pre-cooked for performant use, therefore, if you design the database with that in mind, your schema will complement your indexes and it should fly.
Although I would never argue against doing one's best to ensure that appropriate indexes are in place, I would suggest that in many cases ensuring that you've implemented an appropriate caching strategy for your data is equally important. I would even go as far as suggesting that in many cases proper caching is going to have a greater effect on performance than tweaking indexes.
You don't mention which DB you are using but if it is MySQL then I can thoroughly recommend O'Reilly's "High Performance MySQL". It talks in depth about indexing and other important optimisations.
In fact, even if you are not using MySQL I would recommend it as many of the principles apply to all databases. I read it in preparation for migrating from MS SQL to MySQL at some point in the future. However with what I learnt was able to transform the performance of our existing MS SQL database.
I'd agree with Bob 100%. While you should strive to make the DB as efficient as you can, your caching strategy is probably even more important. If you look at any of the large sites operating on the internet (think Flickr, Facebook, etc.), they all have much larger caching farms than they do DB farms.
As food for thought, I'll have my slides from MAX on advanced caching strategies online tomorrow.
I agree that caching is important. One of the times when I see the largest SQL run times is when I build reporting sections for a given piece of software. Sometimes, my reports have HUGE joins and can bring in maybe 10 or 12 tables. In those cases, I don't think there is a good caching mechanism that could be in place - except maybe to have compiled reports - but even then, I usually have user-selected report criteria such as time-span or state.
Thank's very much for the tips. That was brilliant. If I may dig your brain a bit more though - a lot of my performance issues comes from my JOINs in queries that pull from a lot of tables (generally in reporting situations). When I have JOINs, I tend to move my "where" clauses up into JOINs so I can start filtering the data as soon as possible to reduce the size of intermediary result sets.... at least that is my theory.
In order to help minimize the possibility of forgetting crucial "active" columns, I usually put those on either side of the JOIN as a matter of habit (like I said - so I don't forget them):
. . . . contact c
. . . . job j
. . . . (
. . . . . . . . . . . . c.is_deleted = 0
. . . . . . . . AND
. . . . . . . . . . . . c.id = j.contact_id
. . . . . . . . AND
. . . . . . . . . . . . j.is_deleted = 0
. . . . . . . . AND
. . . . . . . . . . . . j.contact_role_id = #ContactRoles.CEO#
. . . . )
In this case, I probably have an index on the "contact_role_id" column since it defines large groups of people, but not on the "is_deleted" columns as they are not very selective (or am I reversed on that?).
Do you think this is a bad move?
1. Should I be putting so much filtering in my ON clauses?
2. If my is_deleted column is just a tinyint, should it come last in the list of filters, wich "contact_role_id" first?
And what about indexes that contain multiple columns? Right now, I use a column-per-index strategy, but I see in the MySQL GUI (Navicat), I can add multiple columns per index and even order those columns within the index. Can you offer any insight on that?
It seems to suggest that I should pick a standard of ordering my columns and stick to that.
Here's where we start getting into the vagaries of the different database engines.
Most engines will use only one index per join, so multi-column indexes are a must. In your example, if you had one index on is_deleted and another on contact_role_id, the optimizer is going to choose whichever is the better one and use it. You'd probably be best off to have a single index on (c.id, c.is_deleted) and another on (j.contact_id, j.is_deleted, j.contact_role_id). But, again, different optimizers are going to have different rules for how they use indexes.
To answer your specific questions:
1. Yes, put as much of the filtering as you can in ON clauses. Not only does it put the conditions where they are most relevant, but in some engines you'll get orders of magnitude better performance. The DB2/400 optimizer is so dumb (how dumb is it?) that if you put the conditions in the WHERE instead of the ON it will do the joins first, no matter how big the tables, and then only apply the conditions at the end. For extremely large tables, this is a nightmare.
1a. I've gotten in the habit of ordering the conditions in the ON so that the foreign key parts are always last and the conditionals are always first. Again, I find that this helps the DB2/400 optimizer figure out that it needs to filter the tables before it joins them. In your specific example, you'd just move the (c.id = j.contact_id) to the end of the list.
1b. You might want to also play around with the order of the tables in the join. I find that some engines will execute a query faster if you order your tables from smallest to largest, while others are the exact opposite. Yeah, it seems like it shouldn't matter, but sometimes it does. (Of course, if you are talking about a lot of Outer joins, you may not have much of a chance for reordering.)
2. Some optimizers really want the ON or WHERE conditions to be in the order that the index is in. But some perform better as I outlined in 1a. That's really just something you're going to have to fool around with and see. (And no, the dream of having one query work perfectly on multiple engines is really just a dream.)
3. It's probably not a bad idea to develop the habit of having the index columns in the order of most to least selective. But, again, I've seen engines where it didn't matter in the slightest and all of the DBAs were in the habit of putting them in the same order as they were in the table.
Oh, and I have to say that I completely disagree with the points above about developing a caching strategy first. No way.
Even if you are building a million-user application, caching should always be the dead last thing on your mind. Get it running first, then get your database out from under itself, then worry about caching.
Worrying about caching first is premature optimization.
I really like the idea of putting the foreign keys last in the ON clauses; I think that will go a long way to helping me visualize what the indexes should look like cause I can think in terms of per-table filtering.
As always, your excellent guidance is most appreciated!