SQL Optimization Case Study - JOIN Clause vs. IN Clause

Posted September 4, 2007 at 6:02 PM

Tags: SQL

Just a quick note, not much to see here. I've been working on a report for a client that queried over a lot of tables. As its first step, it created a temporary, in-memory table, @users, into which I was INSERTing filtered IDs (based on user-entered search criteria). I then used that @users table in a bunch of subsequent queries so that I didn't have to re-run the filtering for each piece of calculated data.

Anyway, long story short, the report has recently started to timeout (as the data tables have started to grow quickly in size) and the report has a Request Timeout of about 5 minutes. I took a look at the SQL (which I haven't touched in a long time) to see what could be optimized. Two of the calculated columns used a SQL IN clause on the @users table; something to this effect:

 Launch code in new window » Download code as text file »

  • ... WHERE id IN ( SELECT u.id FROM @users u )

Since these were one-to-one type relationships (only one record per-user-id in the parent query), I changed the IN clauses to INNER JOIN clauses. After making this change in two places, the query went from timing out after 300 seconds to executing in about 15 seconds.

There's still a lot of indexing that needs to be done (still working with the senior tech guy to get approval on that), but this was a HUGE performance difference. I am sure many of you are looking at this and thinking, "IN clause - that's such a rookie mistake!" Yeah, yeah it is, but I thought I would point it out here so others can learn from my mistake. Sometimes I get concerned that the joins will take a lot of time, but as you can see above, it is the much-lesser of two evils.

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page


You Might Also Be Interested In:



Learning ColdFusion 9 - ColdFusion 9 tutorials, samples, examples, demos

Reader Comments

Sep 4, 2007 at 7:27 PM // reply »
9 Comments

Sometimes you just need to use IN, specifically if there are some ids in ColdFusion that need to be part of the query. This could happen for example when you have results from search server but need to filter out them based on some SQL based logic.

When that happens one thing to check is that number of ids never exeed capability of database. Oracle 9 "in list iterator" can handle 1000 items, I haven't checked this with other databases.

If you anticipate having more than that it's relatively easy to build code that splits list to serveral parts.

Of course if you run into this kind of situation it's always good idea to really think if there is some other way - it will never be really efficient.


Sep 4, 2007 at 8:08 PM // reply »
6,516 Comments

@Tero,

Yes, certainly there are cases where an IN clause just makes sense. I use them all a good deal, especially when dealing with a short list of values (usually involving a CFQueryParam with a List attribute set to True).

I just wanted to point out that in some case (especially big sub-query cases) this can be a huge bottleneck (at least it was for me in this case).


Sep 5, 2007 at 4:50 AM // reply »
24 Comments

Hi Ben,

you might want to check the EXISTS clause, too. E.g.

SELECT requestID, request_title
FROM Request r
WHERE EXISTS
(
SELECT *
FROM User
WHERE userID=r.userID
)

Given proper indexes, EXISTS is much faster than IN. And, considering that NOT IN is even slower than IN, NOT EXISTS can really bring dramatic performance improvements.

You still have to keep in mind that different database engines handle things differently. For example, with MSSQL (2000) it is generally better to use SELECT * in the subquery (instead of SELECT fieldname), because the query optimizer will choose the best index or combination of indexes.

Best,

Chris


Sep 5, 2007 at 8:09 AM // reply »
6,516 Comments

@Chris,

It's funny - I have never used EXISTS but I have looked it up many times. There's something about it that I just can't seem to wrap my head around. Sometimes, I think about using it, so I look up how it works and when I read the definition, it seems to be not usable. From what I gather, it evaluates to TRUE if it returns any rows, but is it on a per-row basis of the parent query. Meaning, can the EXISTS clause reference values in the parent SELECT?


Sep 5, 2007 at 8:39 AM // reply »
24 Comments

@Ben,

yes, EXISTS can reference values from the outer select.
Look at the condition of the subquery from my example:

WHERE userID=r.userID

Here r is an alias for the table from the outer select. The subquery only selects those records that have a corresponding userID in the outer select. Those records that do NOT have a matching userID are not returned.

And, you are understanding EXISTS quite correctly. EXISTS evaluates to true if a matching record is found.

And, EXISTS indeed evaluates per row. Thus, a record (from the outer select) that has a corresponding record in the sub query is returned in your result set, whereas a record, that does not have a match in the subquery, is not returned.

And, because of the matching WHERE clause in the subquery, the subquery.recordCount can only be 1 or 0. And that is pretty much like CFML: recordcount = 0 = FALSE, recordcount >0 = TRUE

So the following statements are equal in result (compared to my last example, I switched tables here... makes more sense to me ;-)):

-- Select all users that have entries in request table
SELECT u.userID, u.lastname, u.firstname
FROM User u
WHERE userID IN (SELECT userID FROM Request)

-- again, select all users that have entries in request table
SELECT u.userID, u.lastname, u.firstname
FROM User u
WHERE EXISTS
(
SELECT *
FROM Request
WHERE userID=u.userID
)

Plus, you could use a join, too:

SELECT DISTINCT u.userID, u.lastname, u.firstname
FROM User u, Request r
WHERE r.userID=u.userID

But, note the use of DISTINCT here. Without it, you will get multiple records for users with more than 1 request. The cost of DISTINCT is pretty high, so that alternative is usually slower than EXISTS.

Side note: all those variants can be used to return only those records without a match, too.

Chris


Sep 5, 2007 at 8:54 AM // reply »
6,516 Comments

@Chris,

Thanks for the most excellent and thorough explanation. This helps a lot. So, just so we are all on the same page, EXISTS is simply faster than IN? If we can do the same thing using both types of sub-queries, then we should generally / always opt for EXISTS?


Sep 5, 2007 at 2:18 PM // reply »
1 Comments

Man, use .NET technologies to connect to your Data Base, and everything gonna be allright!


Sep 5, 2007 at 2:39 PM // reply »
6,516 Comments

@Christoph,

Thanks for the tips. The SQL EXISTS clause was really easy to use now that I have seen some good explanation of it:

http://www.bennadel.com/index.cfm?dax=blog:941.view

Thanks.


Sep 9, 2007 at 10:09 AM // reply »
1 Comments

hei.

interesting discussion. :)
found it quite helpful for the reports i need to generate. thanks!

joan


Sep 24, 2007 at 4:11 PM // reply »
1 Comments

Interestingly I've had the same experience. I used subselects with IN statements thinking it would be much faster because in my case the outer table jas to be joined 3 times and i was thinking it'll scan the table 3 times.
It was something like this;
select name from names where id in ( creating a list of ids with simple select from another table)
It took minutes.
I changed this to
select n1.name, n2.name, n3.name from names n1, n2, n3, othertable ot where
n1.id=*ot.f1 and n2.id=*ot.f2 and n3.id=*ot.f3

The join runs way faster then first version...


Oct 25, 2007 at 2:15 PM // reply »
1 Comments

You should be the no.1 google hit, not no. 2. Hopefully since I'm spending more time on your site, you'll be no. 1 for this topic.

I've always done Joins, so it is second-nature to me. However, I wanted to understand which is the recommended best-method. In order to think out of the box, and see what has been tested, I was wondering if IN clause would work better - apparently not. Thanks for the post.


Post Comment  |  Ask Ben

Recent Blog Comments
Nov 21, 2009 at 11:03 AM
Groovy Operator Overloading Does Not Work In The ColdFusion Context
Hi Ben, Thanks for this informative post. Now I am reading ur old posts too ... read »
Nov 21, 2009 at 10:56 AM
HostMySite.com Has The Best ColdFusion Hosting
@Mehul, Yes very nice people, however several downtimes per day which was not acceptable. Hence we had to move out. I am glad you are having good luck with them so far. ... read »
Nov 20, 2009 at 11:32 PM
Five Months Without Hungarian Notation And I'm Loving It
I've used headless camel case for years for not only ColdFusion variables, but also SQL tables and fields... pretty much everything involving code. I also subscribe to the "don't abbreviate and clea ... read »
Nov 20, 2009 at 11:00 PM
Five Months Without Hungarian Notation And I'm Loving It
@Marcel, Yeah, I always err on the side of longer but more readable variable names. As for the camel casing of CF methods and the headless camel casing of custom items, I get around this by always ... read »
Nov 20, 2009 at 10:56 PM
Five Months Without Hungarian Notation And I'm Loving It
I use the following and love it: my.namespace.MyComponents.functionMethodsOrUDF() CONSTANT_VALUES_OR_PROPERTIES One thing I always try is to CamelCaseBuiltInColdFusionFunctions() so others can tell ... read »
Nov 20, 2009 at 5:38 PM
Learning ColdFusion 8: CFImage Part I - Reading And Writing Images
Hi Ben, Great article. I've been looking around to see if ColdFusion image engine can programatically create the following "wrap around" effect: http://www.creativepro.com/article/photoshop-s-she ... read »
Nov 20, 2009 at 5:35 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
@Dave: I talked to Gert he suggested: <cfhttp method="get" url="http://{some cf website}" result="stuff" addtoken="yes" /> Note the addition of cfhttp attribute addtoken. That should persist y ... read »