SQL Optimization Case Study - JOIN Clause vs. IN Clause
Posted September 4, 2007 at 6:02 PM by Ben Nadel
Just a quick note, not much to see here. I've been working on a report for a client that queried over a lot of tables. As its first step, it created a temporary, in-memory table, @users, into which I was INSERTing filtered IDs (based on user-entered search criteria). I then used that @users table in a bunch of subsequent queries so that I didn't have to re-run the filtering for each piece of calculated data.
Anyway, long story short, the report has recently started to timeout (as the data tables have started to grow quickly in size) and the report has a Request Timeout of about 5 minutes. I took a look at the SQL (which I haven't touched in a long time) to see what could be optimized. Two of the calculated columns used a SQL IN clause on the @users table; something to this effect:
- ... WHERE id IN ( SELECT u.id FROM @users u )
Since these were one-to-one type relationships (only one record per-user-id in the parent query), I changed the IN clauses to INNER JOIN clauses. After making this change in two places, the query went from timing out after 300 seconds to executing in about 15 seconds.
There's still a lot of indexing that needs to be done (still working with the senior tech guy to get approval on that), but this was a HUGE performance difference. I am sure many of you are looking at this and thinking, "IN clause - that's such a rookie mistake!" Yeah, yeah it is, but I thought I would point it out here so others can learn from my mistake. Sometimes I get concerned that the joins will take a lot of time, but as you can see above, it is the much-lesser of two evils.
What Other People Are Searching For
Sometimes you just need to use IN, specifically if there are some ids in ColdFusion that need to be part of the query. This could happen for example when you have results from search server but need to filter out them based on some SQL based logic.
When that happens one thing to check is that number of ids never exeed capability of database. Oracle 9 "in list iterator" can handle 1000 items, I haven't checked this with other databases.
If you anticipate having more than that it's relatively easy to build code that splits list to serveral parts.
Of course if you run into this kind of situation it's always good idea to really think if there is some other way - it will never be really efficient.
Yes, certainly there are cases where an IN clause just makes sense. I use them all a good deal, especially when dealing with a short list of values (usually involving a CFQueryParam with a List attribute set to True).
I just wanted to point out that in some case (especially big sub-query cases) this can be a huge bottleneck (at least it was for me in this case).
you might want to check the EXISTS clause, too. E.g.
SELECT requestID, request_title
FROM Request r
Given proper indexes, EXISTS is much faster than IN. And, considering that NOT IN is even slower than IN, NOT EXISTS can really bring dramatic performance improvements.
You still have to keep in mind that different database engines handle things differently. For example, with MSSQL (2000) it is generally better to use SELECT * in the subquery (instead of SELECT fieldname), because the query optimizer will choose the best index or combination of indexes.
It's funny - I have never used EXISTS but I have looked it up many times. There's something about it that I just can't seem to wrap my head around. Sometimes, I think about using it, so I look up how it works and when I read the definition, it seems to be not usable. From what I gather, it evaluates to TRUE if it returns any rows, but is it on a per-row basis of the parent query. Meaning, can the EXISTS clause reference values in the parent SELECT?
yes, EXISTS can reference values from the outer select.
Look at the condition of the subquery from my example:
Here r is an alias for the table from the outer select. The subquery only selects those records that have a corresponding userID in the outer select. Those records that do NOT have a matching userID are not returned.
And, you are understanding EXISTS quite correctly. EXISTS evaluates to true if a matching record is found.
And, EXISTS indeed evaluates per row. Thus, a record (from the outer select) that has a corresponding record in the sub query is returned in your result set, whereas a record, that does not have a match in the subquery, is not returned.
And, because of the matching WHERE clause in the subquery, the subquery.recordCount can only be 1 or 0. And that is pretty much like CFML: recordcount = 0 = FALSE, recordcount >0 = TRUE
So the following statements are equal in result (compared to my last example, I switched tables here... makes more sense to me ;-)):
-- Select all users that have entries in request table
SELECT u.userID, u.lastname, u.firstname
FROM User u
WHERE userID IN (SELECT userID FROM Request)
-- again, select all users that have entries in request table
SELECT u.userID, u.lastname, u.firstname
FROM User u
Plus, you could use a join, too:
SELECT DISTINCT u.userID, u.lastname, u.firstname
FROM User u, Request r
But, note the use of DISTINCT here. Without it, you will get multiple records for users with more than 1 request. The cost of DISTINCT is pretty high, so that alternative is usually slower than EXISTS.
Side note: all those variants can be used to return only those records without a match, too.
Thanks for the most excellent and thorough explanation. This helps a lot. So, just so we are all on the same page, EXISTS is simply faster than IN? If we can do the same thing using both types of sub-queries, then we should generally / always opt for EXISTS?
Man, use .NET technologies to connect to your Data Base, and everything gonna be allright!
Thanks for the tips. The SQL EXISTS clause was really easy to use now that I have seen some good explanation of it:
interesting discussion. :)
found it quite helpful for the reports i need to generate. thanks!
Interestingly I've had the same experience. I used subselects with IN statements thinking it would be much faster because in my case the outer table jas to be joined 3 times and i was thinking it'll scan the table 3 times.
It was something like this;
select name from names where id in ( creating a list of ids with simple select from another table)
It took minutes.
I changed this to
select n1.name, n2.name, n3.name from names n1, n2, n3, othertable ot where
n1.id=*ot.f1 and n2.id=*ot.f2 and n3.id=*ot.f3
The join runs way faster then first version...
You should be the no.1 google hit, not no. 2. Hopefully since I'm spending more time on your site, you'll be no. 1 for this topic.
I've always done Joins, so it is second-nature to me. However, I wanted to understand which is the recommended best-method. In order to think out of the box, and see what has been tested, I was wondering if IN clause would work better - apparently not. Thanks for the post.
IN vs. EXISTS performance can vary depending on the size and indexing of both the outer and inner tables. Also a neat behavior of EXISTS is that it only needs to find one row that matches then it can stop looking. For example, this will return a 0 or 1 (in Oracle - good "row_exists" function):
from dual where exists (select null from t1 where ...)
I was with the belief that an IN condition is better than a JOIN condition. But today, I came across a scenario where the JOIN condition performed better than IN.
To add to my woes, the IN condition and the JOIN condition both retrieve only 9 rows.
The cost of the query where IN was used is less than the Cost of the query where the JOIN is used.
Can anyone explain how this is possible?
I am confused in using Inner JOin and Outer join.
My sql queries having many tables references. So i have used LEFT OUTER JOIN.
Still my query performance is slow.
Any idea to speed up my queries?
I think it all depends on the complexity of the queries and the indexing and what not. But, I am not a SQL expert.
When in doubt, try adding "indexes" to the fields used in the join clauses.
I still a bit confused here...
I did too find that at one times that in is faster than join, and visa versa.
So, have you found any based that when is in is faster and when is the oposite?
I think it depends on what type of query you are running and how big the data tables are. I don't think there is any set rule. If you find that your queries are running slow, this gives you one thing to try and change for performance improvements.