Lately, I have been looking into SQL query optimization. We recently installed SeeFusion on our server and I can see where my long running tasks are causing the server to slow down. Turns out, not suprisingly, that the slow pages are very query-intense. Granted, a lot of these pages were pages years ago before I knew what nice code looked like, but the good news it, lots of room for optimization and clean up.
To start out, I thought it would be good to look up the order in which SQL directives get executed as this will change the way I can optimize:
- FROM clause
- WHERE clause
- GROUP BY clause
- HAVING clause
- SELECT clause
- ORDER BY clause
This order holds some very interesting pros/cons:
Since this clause executes first, it is our first opportunity to narrow down possible record set sizes. This is why I put as many of my ON rules (for joins) as possible in this area as opposed to in the WHERE clause:
- contact c
- INNER JOIN
- display_status d
- c.display_status_id = d.id
- d.is_active = 1
- d.is_viewable = 1
This way, by the time we get to the WHERE clause, we will have already excluded rows where is_active and is_viewable do not equal 1.
With the WHERE clause coming second, it becomes obvious why so many people get confused as to why their SELECT columns are not referencable in the WHERE clause. If you create a column in the SELECT directive:
- ( 'foo' ) AS bar
It will not be available in the WHERE clause because the SELECT clause has not even been executed at the time the WHERE clause is being run.
ORDER BY Clause
It might confuse people that their calculated SELECT columns (see above) are not available in the WHERE clause, but they ARE available in the ORDER BY clause, but this makes perfect sense. Because the SELECT clause executed right before hand, everything from the SELECT should be available at the time of ORDER BY execution.
I am sure there are other implications based on the SQL clause order of operations, but these are the most obvious to me and can help people really figure out where to tweak their code.
Looking For A New Job?
- Back-End Engineer - Node.js & Mongo at Interface Foundry
- Senior ColdFusion Web Developer at HD Web Studio
- In House ColdFusion at Marketing Holdings
I suppose an actual RDBMS (MySQL, Oracle, etc.) will perform optimizations out of this order. For example, MySQL probably looks for things like "WHERE a.b=c.d" and moves that filtering up into the FROM processing.
Good job on this entry, Ben. I actually have it bookmarked.
I was looking at it again as I prepared the schema for a new database and started to wonder about optimizing the "where" clause of an inner join.
If you read most tutorials on inner joins, they'll use some syntax like "select * FROM flintstones, bedrock WHERE flintstones.name = bedrock.resident and bedrock.gender = 'male'".
If we imagine that these are both large tables with thousands of members each, I'd imagine that the inner join can be optimized for faster execution and less memory usage since we only want data from "flintstones" when it corresponds with a record from "bedrock" that meets our criteria.
But what's the best way to optimize that query for maximum performance / minimum resource usage?
I have two ideas.
1: Put the "flintstones.name = bedrock.resident" at the end of the WHERE clause, so the cross-matching isn't done until the result set has already been narrowed by the "bedrock.gender = 'male'" part. Problem is that I haven't been able to find anything on the execution order of subcomponents of a WHERE clause to know if this might make a difference.
2: Reformat the query to "SELECT * from (SELECT * from bedrock WHERE gender = 'male') as bedrock, flintstones WHERE flintstones.name = bedrock.resident". It seems that this would constrain the "bedrock" dataset prior to the join, resulting in a faster query.
That's an interesting thought to make the intermediary result set. I wonder how that interacts with table indexing. Unfortunately, I don't really know enough about database execution to think at this level. What I have found is that good indexes almost make more of a difference that some of the low-level optimizations.
I'm still trying to find the sweet spot :)
Thanks for the info, i've bookmarked this page.
I appreciate this page. This is exactly the information I was looking for when I googled this afternoon. I, too, appreciate it.
@Greg -- got a kick out of this, Ben, having just started tweeting recently...anyway...
I would write the join as follows, and especially after looking at this order of operations post...
FROM [Flintstones] f
JOIN [Bedrock] b
ON f.[Name] = b.[Resident]
AND b.[Gender] = 'male'
...I'd explicitly list columns, have singular table names and prefix everything with schema, and key the gender, but beyond that the join should as shown should be fastest and ultimately look and feel the "cleanest" once you start writing this way.
I've been structuring joins in much this manner, but was about to write one this afternoon and wanted to double check the order of operation with an eye to the performance implication of the particular statement I was working, hence this search. I have opted to include anything relative among the entities being joined and typically isolated key conditional "business" checks in the where clause for readability. Now, in a couple of key procs, I will be moving those checks into the first join.
The lightbulb came on this afternoon, thanks again, Ben. I thank the Lord I have been as close to the mark as I've been already. It does make sense when you visualize SQL Server "building out" the recordset...the innermost join should have the biggest performance impact. So for industrial strength code, this should pay nice dividends.
Thank you again, I hope someone finds this comment interesting if not helpful.
Peace to all.
Glad to help out my man. From what I have been told, the SQL server should do some of this optimization for you after it parses the SQL. But, I like to be as explicit as possible in my thinking.
sir I would like to know what is good to study? sql, web designing or programming there in new york? cuz I have a plan in staying there when I graduate in college!
These are three overlapping skills - ideally you should know *something* about design; but also, web programming and SQL go hand in hand.
Thanks for the hint, certainly it helps
This post was really helpful. It answered quite a few of my questions regarding query optimization. And also cleared many concepts that why they were so.
Glad to help out guys. I thnk understanding the order really help you think more effectively about SQL queries.
It is a very simple and nicely explained article. Was really useful.
Way to go BEN !!!
Thanks my man.
hi, this is exactly what i needed, nicely written
I was executing a query and puzzled with the execution of the query. Then I googled for the order of SQL directives and found you post.
It is really good.
Glad you guys enjoyed this post. I hope it was helpful.
Great post! Found it by googling "mysql order of execution". It confirmed my assumption that the SELECT clause would execute one of the last, and now I'm absolutely clueless about the following:
I have a rather complex subquery in the SELECT portion, expecting that the main query will return only a limited number of results. BUT the funny thing is, even if there are 0 results returned, adding this subquery into the SELECT makes the script run 1000slower. How's that possible? If SELECT is executed after the WHERE and WHERE returned 0 results, the SELECT should not even execute, right? Or am I missing someting? ...and if it does not execute why does it slow down the query? Any ideas?
If you had DISTINCT in your select query where would that go in the list, in my guess this should go above Order By clause?
What do you say?
Hmm, that's a good question. Perhaps the SQL engine is reworking the execution flow for some sort of optimization? I don't know how SQL engines determine their most optimized work flows. What kind of database are you using?
I believe that DISTINCT executes as part of the SELECT statement. But, I'm not entire sure on that - it's just a guess.
Hi Ben, thanks for your comment. It's mysql database with MyISAM tables. Don't know if it matters...
I ended up just re-doing the query altogether. Works now, that's what counts at the end!
Cool - thanks for this info, handy to know.
Just thought that something might be of interest.
I recently attended an Oracle DBA course, and the instructor mentioned that in the newer versions of Oracle, it shouldnt matter how you throw your SQL statements, the database should tune itself to perform the statement in the most effecient way.
SQL tuning in this case was almost always at the database end, rather than on the application side (where the SQL was written).
I havent looked that much into SQL server or mysql to see if they have a similar type of process that the database performs.
On a completely unrelated note, its a real shame that Oracle is becoming more and more the big bad wolf of the Software/I.T. world.
It's interesting that you mention that - I was not aware that there were different types of MySQL out there until this weekend. Bob Silverberg said there's a MyISAM and an InnoDB version and that you should pretty much always use InnoDB. Of course, he was talking about this in the context of ColdFusion's new ORM settings - I'm not sure how that applies to anything else.
But, as far as flavor-specific caveats, I am not in the know about this.
I've definitely heard that kind of idea from a number of people - that the Database will optimize the execution regardless of how you order your operations. I really wish I knew more about how databases actually work.
As far as the Oracle thing, people freak out any time they have to pay for software.... the irony of which is that they then turn around a charge their clients :D
I was searching for info on how SQL Server engine prioritises Logical operations, AND + OR, as in BODMAS for arithmetic operations. (Brackets, Order(powers), Division, Multiplication,Addition,Subtraction),
If I write A AND B OR C it appears to evaluate A AND B then ORs the result with C. If I write A AND (B OR C) that forces the OR to evaluate first then AND with A. However if the number of terms is extended greater than 3 A,B,C,D etc it becomes a far more unpredictable scenario. Take A OR B AND C OR D, I expected that to execute B AND C ORed with A then ORed with D but it only seems to give me that result if I explictly insert brackets like
A OR (B AND C) OR D. Can anyone add any experience of this to the discussion.
Further if you go in and out of the Query builder the expression come back out as
(A OR B) AND (A OR C) OR D ????? what is that all about. I am a didital electronics engineer by degree and this is the reverse principle of the distributive laws and extends rather than simplifies the expression.
Nice post, and obviously a very useful one to a lot of people - posted in June 5, 2006 and people (like myself) are still commenting! :)
I'm going to be doing a lot of optimization over the coming months, and this information is going to prove useful.
This post is really helpful for me. many time i read your blog and found lots of things which is to be learn and helpful in my work.
i need your favour, Would you please tell me the Query Execution Architecture in SQL Server.
Well, Thanks in Advance...:)
@All, hope this helps:
The optimizer will execute the query according to the lowest cost execution path regardless of the way you write the query so it doesn't really matter about this order of precedence...
First of all, thank you for this post. It is of great help to me.
But I found a wired thing here. As you said, SELECT columns are not referencable in the WHERE clause. However, they are referencable in the GROUP BY clause in my test. Like I renamed some columns and refer to them in group by clause. It did work.
My question here is: is this related to some sort of optimization? Does select have priority to group by, or the columns in group by clause had been executed twice, first in group by and then in select?
Thanks a lot
really a good information...:)
thanks a tone...
but i am still confused in the execution order between from and where??
i heard from someone before reading this is that execution start from where clause --> from clause.
now in a dual state of mind..:(
please help me ..
Great post, thanks for the info! Googled "SQL order of operations" and this post from years ago was still #1.
Think about the query this way. How can you know the WHERE data that qualifies the statement without knowing the location FROM which the data is being retrieved?
You know you want an orange, but you have to go to the store where it comes FROM first :)
I know its a very basic explanation but according to the SQL "order of operations" that is how it works.
Of course there are many things now in DBMs that optimize the query for you so the order listed on this page may or may not be used exactly as stated anymore.
Thanks for this post.
Select * from Dept where DeptNo in (Select DeptNo from Emp)
I would like know how the above query will execute.
As of my knowledge, The above query will execute in the following Order:
1. Main Query From Clause
2. main Query Where Clause
3. Sub Query From Clause
4. Sub Query Where Clause
5. Sub Query Select
6. Main Query Select
Please Let me know my thought process is correct or not.
Thanks in advance.
Always kept forgetting the order. But not anymore:
@Sekhar,that was amazing . But can you tell us general form where there will more than one sub query. and how preference is decided ?