SQL Optimization Case Study - JOIN Clause vs. IN Clause

Posted September 4, 2007 at 6:02 PM

Tags: SQL

Just a quick note, not much to see here. I've been working on a report for a client that queried over a lot of tables. As its first step, it created a temporary, in-memory table, @users, into which I was INSERTing filtered IDs (based on user-entered search criteria). I then used that @users table in a bunch of subsequent queries so that I didn't have to re-run the filtering for each piece of calculated data.

Anyway, long story short, the report has recently started to timeout (as the data tables have started to grow quickly in size) and the report has a Request Timeout of about 5 minutes. I took a look at the SQL (which I haven't touched in a long time) to see what could be optimized. Two of the calculated columns used a SQL IN clause on the @users table; something to this effect:

 Launch code in new window » Download code as text file »

  • ... WHERE id IN ( SELECT u.id FROM @users u )

Since these were one-to-one type relationships (only one record per-user-id in the parent query), I changed the IN clauses to INNER JOIN clauses. After making this change in two places, the query went from timing out after 300 seconds to executing in about 15 seconds.

There's still a lot of indexing that needs to be done (still working with the senior tech guy to get approval on that), but this was a HUGE performance difference. I am sure many of you are looking at this and thinking, "IN clause - that's such a rookie mistake!" Yeah, yeah it is, but I thought I would point it out here so others can learn from my mistake. Sometimes I get concerned that the joins will take a lot of time, but as you can see above, it is the much-lesser of two evils.

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page


You Might Also Be Interested In:



Learning ColdFusion 9 - ColdFusion 9 tutorials, samples, examples, demos

Reader Comments

Tero Pikala
Sep 4, 2007 at 7:27 PM // reply »
9 Comments

Sometimes you just need to use IN, specifically if there are some ids in ColdFusion that need to be part of the query. This could happen for example when you have results from search server but need to filter out them based on some SQL based logic.

When that happens one thing to check is that number of ids never exeed capability of database. Oracle 9 "in list iterator" can handle 1000 items, I haven't checked this with other databases.

If you anticipate having more than that it's relatively easy to build code that splits list to serveral parts.

Of course if you run into this kind of situation it's always good idea to really think if there is some other way - it will never be really efficient.


Sep 4, 2007 at 8:08 PM // reply »
6,371 Comments

@Tero,

Yes, certainly there are cases where an IN clause just makes sense. I use them all a good deal, especially when dealing with a short list of values (usually involving a CFQueryParam with a List attribute set to True).

I just wanted to point out that in some case (especially big sub-query cases) this can be a huge bottleneck (at least it was for me in this case).


Christoph Schmitz
Sep 5, 2007 at 4:50 AM // reply »
24 Comments

Hi Ben,

you might want to check the EXISTS clause, too. E.g.

SELECT requestID, request_title
FROM Request r
WHERE EXISTS
(
SELECT *
FROM User
WHERE userID=r.userID
)

Given proper indexes, EXISTS is much faster than IN. And, considering that NOT IN is even slower than IN, NOT EXISTS can really bring dramatic performance improvements.

You still have to keep in mind that different database engines handle things differently. For example, with MSSQL (2000) it is generally better to use SELECT * in the subquery (instead of SELECT fieldname), because the query optimizer will choose the best index or combination of indexes.

Best,

Chris


Sep 5, 2007 at 8:09 AM // reply »
6,371 Comments

@Chris,

It's funny - I have never used EXISTS but I have looked it up many times. There's something about it that I just can't seem to wrap my head around. Sometimes, I think about using it, so I look up how it works and when I read the definition, it seems to be not usable. From what I gather, it evaluates to TRUE if it returns any rows, but is it on a per-row basis of the parent query. Meaning, can the EXISTS clause reference values in the parent SELECT?


Christoph Schmitz
Sep 5, 2007 at 8:39 AM // reply »
24 Comments

@Ben,

yes, EXISTS can reference values from the outer select.
Look at the condition of the subquery from my example:

WHERE userID=r.userID

Here r is an alias for the table from the outer select. The subquery only selects those records that have a corresponding userID in the outer select. Those records that do NOT have a matching userID are not returned.

And, you are understanding EXISTS quite correctly. EXISTS evaluates to true if a matching record is found.

And, EXISTS indeed evaluates per row. Thus, a record (from the outer select) that has a corresponding record in the sub query is returned in your result set, whereas a record, that does not have a match in the subquery, is not returned.

And, because of the matching WHERE clause in the subquery, the subquery.recordCount can only be 1 or 0. And that is pretty much like CFML: recordcount = 0 = FALSE, recordcount >0 = TRUE

So the following statements are equal in result (compared to my last example, I switched tables here... makes more sense to me ;-)):

-- Select all users that have entries in request table
SELECT u.userID, u.lastname, u.firstname
FROM User u
WHERE userID IN (SELECT userID FROM Request)

-- again, select all users that have entries in request table
SELECT u.userID, u.lastname, u.firstname
FROM User u
WHERE EXISTS
(
SELECT *
FROM Request
WHERE userID=u.userID
)

Plus, you could use a join, too:

SELECT DISTINCT u.userID, u.lastname, u.firstname
FROM User u, Request r
WHERE r.userID=u.userID

But, note the use of DISTINCT here. Without it, you will get multiple records for users with more than 1 request. The cost of DISTINCT is pretty high, so that alternative is usually slower than EXISTS.

Side note: all those variants can be used to return only those records without a match, too.

Chris


Sep 5, 2007 at 8:54 AM // reply »
6,371 Comments

@Chris,

Thanks for the most excellent and thorough explanation. This helps a lot. So, just so we are all on the same page, EXISTS is simply faster than IN? If we can do the same thing using both types of sub-queries, then we should generally / always opt for EXISTS?


Sep 5, 2007 at 2:18 PM // reply »
1 Comments

Man, use .NET technologies to connect to your Data Base, and everything gonna be allright!


Sep 5, 2007 at 2:39 PM // reply »
6,371 Comments

@Christoph,

Thanks for the tips. The SQL EXISTS clause was really easy to use now that I have seen some good explanation of it:

http://www.bennadel.com/index.cfm?dax=blog:941.view

Thanks.


joan
Sep 9, 2007 at 10:09 AM // reply »
1 Comments

hei.

interesting discussion. :)
found it quite helpful for the reports i need to generate. thanks!

joan


said okur
Sep 24, 2007 at 4:11 PM // reply »
1 Comments

Interestingly I've had the same experience. I used subselects with IN statements thinking it would be much faster because in my case the outer table jas to be joined 3 times and i was thinking it'll scan the table 3 times.
It was something like this;
select name from names where id in ( creating a list of ids with simple select from another table)
It took minutes.
I changed this to
select n1.name, n2.name, n3.name from names n1, n2, n3, othertable ot where
n1.id=*ot.f1 and n2.id=*ot.f2 and n3.id=*ot.f3

The join runs way faster then first version...


Fahim
Oct 25, 2007 at 2:15 PM // reply »
1 Comments

You should be the no.1 google hit, not no. 2. Hopefully since I'm spending more time on your site, you'll be no. 1 for this topic.

I've always done Joins, so it is second-nature to me. However, I wanted to understand which is the recommended best-method. In order to think out of the box, and see what has been tested, I was wondering if IN clause would work better - apparently not. Thanks for the post.


Post Comment  |  Ask Ben

Recent Blog Comments
Nov 7, 2009 at 5:53 PM
Ask Ben: Javascript String Replace Method
You can find here an advanced function that prepared with javascript replace function. This can make the first letters of words, sentences, lines and whatever you define automatically: http://www.m ... read »
Andrew Neely
Nov 7, 2009 at 4:56 PM
A Moment That Touched Me - The Fountainhead
Ben, Glad you enjoyed the podcast. Yeah, the Tank Riot guys can get really chatty during the episodes, but that's part of the charm of it for me. They've covered everything from Nichola Tesla to Cha ... read »
Nov 7, 2009 at 4:43 PM
Building A Fixed-Position Bottom Menu Bar (ala FaceBook)
Is it possible to make some more MenĂ¼`s ? ... read »
Jill
Nov 7, 2009 at 11:40 AM
How To Unformat Your Code (Like A Pro)
Derek, I think you might be right - sweet! Thanks for the link :) ... read »
Nov 7, 2009 at 11:25 AM
How To Unformat Your Code (Like A Pro)
I think it would be way easier to just use this http://www.logichammer.com/html-formatter/ He just released v3 and it rocks. ... read »
Jill
Nov 7, 2009 at 7:58 AM
How To Unformat Your Code (Like A Pro)
LMAO - this was pretty funny! I have to admit - I also love to reformat code so I can read it. My boss used to tell me to leave my OCD at home. Now I don't feel so bad after reading everyone else' ... read »
Nov 6, 2009 at 10:10 PM
How To Unformat Your Code (Like A Pro)
The timing of this post is just uncanny. I spent the last 15-20 minutes manually un-formatting my "Ben Nadel" style code within a CFC of mine. I was really digging the readability a few weeks ago, bu ... read »
Roe
Nov 6, 2009 at 5:11 PM
Passing Arrays By Reference In ColdFusion - SWEEET!
ArraySort also reorders the results of these java obj's ... read »