This week has been a very good week for me in terms of learning some great SQL information. My newest nugget of gold is the SQL EXISTS clause. I recently came across a situation where using an INNER JOIN blew away the performance of a SQL IN clause; on that blog post, Christoph Schmitz suggested that I take a look at the SQL EXISTS clause stating that it was generally faster than the IN clause and that NOT EXISTS can be insanely fast when compared to a NOT IN clause.
Some of Christoph's advice:
Given proper indexes, EXISTS is much faster than IN. And, considering that NOT IN is even slower than IN, NOT EXISTS can really bring dramatic performance improvements.
You still have to keep in mind that different database engines handle things differently. For example, with MSSQL (2000) it is generally better to use SELECT * in the subquery (instead of SELECT fieldname), because the query optimizer will choose the best index or combination of indexes.
Additionally, I found some really good tips from Josh Berkus on PostgreSQL.org:
The rule I use is: if I expect the sub-select to return more than 12 records 20% or more of the time, use EXISTS. The speed gain for IN on small lists is not as dramatic as the speed loss for EXISTS on large lists.
More importantly, the difference between NOT IN and NOT EXISTS can be as much as 20:1 on large sub-selects, as opposed to IN and EXISTS, where I have rarely seen a difference of more than 3:1. As I understand it, this is because NOT EXISTS can use optimized join algorithms to locate matching rows, whereas NOT IN must compare each row against every possible matching value in the subselect.
It also makes a difference whether or not the referenced field(s) in the subselect is indexed. EXISTS will often use an index to compare the values in the master query to the sub-query. As far as I know, IN can use an index to retrieve the subquery values, but not to sort or compare them after they have been retreived into memory.
.... Incidentally, the dramatic differences between IN and EXISTS are not only a "PostgreSQL Thing". The same rules apply to MS SQL Server and SQL Anywhere, for the same reasons.
From Chris' and Josh's advice, it would seem that I should err on the side of SQL EXISTS most of the time. Of course, I am not excluding all uses of the SQL IN clause - it has its time and place and even ColdFusion's CFQueryParam tag has a List attribute to make the SQL IN clause easier to use.
Anyway, I thought I would build upon my UPDATE and DELETE posts to explore the EXISTS statement. As before, I will be building in-memory tables to test with. Since I don't want to repeat this code for the different examples, I am using ColdFusion's CFSaveContent tag to store the SQL table build scripts and then just including them into the subsequent examples:
<!--- Save SQL that builds temporary in-memory tables so that we can use them in more than one example without taking up too much room. ---> <cfsavecontent variable="strSQLBuild"> <!--- Declare in-memory data tables. ---> DECLARE @boy TABLE ( id INT, name VARCHAR( 30 ) ) ; DECLARE @girl TABLE ( id INT, name VARCHAR( 30 ) ) ; DECLARE @relationship TABLE ( boy_id INT, girl_id INT, date_started DATETIME, date_ended DATETIME ) ; <!--- Populate the boy table with some information. ---> INSERT INTO @boy ( id, name )( SELECT 1, 'Ben' UNION ALL SELECT 2, 'Arnold' UNION ALL SELECT 3, 'Vincent' ); <!--- Populate the girl table with some information. ---> INSERT INTO @girl ( id, name )( SELECT 1, 'Maria Bello' UNION ALL SELECT 2, 'Christina Cox' UNION ALL SELECT 3, 'Winona Ryder' ); <!--- Populate the relationship table. ---> INSERT INTO @relationship ( boy_id, girl_id, date_started, date_ended )( SELECT 1, 1, '2007/01/01', NULL UNION ALL SELECT 1, 3, '2004/09/15', '2005/06/15' UNION ALL SELECT 2, 1, '2006/05/14', '2006/05/23' UNION ALL SELECT 3, 2, '2007/02/14', NULL ); </cfsavecontent>
Now that we have that taken care of, I am going to show you an old-school example that uses the SQL IN clause. Here, we are going to select all boys that are currently in an open relationship (not in the swinging sense, but rather as in the relationship has not yet ended):
<!--- Query for boys using SQL EXISTS construct. ---> <cfquery name="qINTest" datasource="#REQUEST.DSN.Source#"> <!--- Include SQL build statements. ---> #PreserveSingleQuotes( strSQLBuild )# <!--- Select all the boys who are CURRENTLY in relationships. This will denoted by records in the relationship table that doesn't have an END date. ---> SELECT b.id, b.name FROM @boy b WHERE b.id IN ( SELECT r.boy_id FROM @relationship r <!--- Notice that our sub query can reference row/column-specific values in the parent query (b.id). ---> WHERE r.boy_id = b.id AND r.date_ended IS NULL ) ; </cfquery> <!--- Dump out the returned record set. ---> <cfdump var="#qINTest#" label="SQL IN Test: @boy" />
Notice that the subquery in the SQL IN clause can make row-specific references to the data in the parent query. Running the above code, we get the following CFDump output:
As you can see, Ben is returned since he is still dating Maria Bello (lucky bastard!) and Vincent is returned as he is still dating Christina Cox.
Ok, now, I am going to swap out the SQL IN clause with the SQL EXISTS clause. Since the subquery of an EXISTS clause can also reference row-specific data in the parent query, almost nothing has to change:
<!--- Query for boys using SQL EXISTS construct. ---> <cfquery name="qExistsTest" datasource="#REQUEST.DSN.Source#"> <!--- Include SQL build statements. ---> #PreserveSingleQuotes( strSQLBuild )# <!--- Select all the boys who are CURRENTLY in relationships. This will denoted by records in the relationship table that doesn't have an END date. ---> SELECT b.id, b.name FROM @boy b WHERE EXISTS ( SELECT * FROM @relationship r <!--- Notice that our sub query can reference row/column-specific values in the parent query (b.id). ---> WHERE r.boy_id = b.id AND r.date_ended IS NULL ) ; </cfquery> <!--- Dump out the returned record set. ---> <cfdump var="#qExistsTest#" label="SQL EXISTS Test: @boy" />
Running the above code, we get the following CFDump output:
While my in-memory data tables are very small and don't have any indexing or good stuff like that, this should be a better way to go (or at least a better general plan of attack). I have looked at the SQL EXISTS clause before and something about it always confused me; I am happy to finally have my head wrapped at least mostly around it at this point. I am sure other will have more tips on when to use and when to not use it, but this is another good step on my way to SQL mastery.
EXISTS is your friend. ;-)
I recommend the book "SQL Queries for Mere Mortals", which is an excellent resource for advanced SQL.
My favorite use of EXISTS with ColdFusion is the ability to use one query to do an insert or update:
IF EXISTS (SELECT girlName FROM girls WHERE girlID = #girlID#)
SET girlName = 'Halle'
WHERE girlID = #girlID#
INSERT INTO girls(name)
I see there are now two version of this book (a 2nd Edition came out recently). Do you think it's worth getting the new one? Or should I just get a used copy of the first one and call it a day?
That is cool stuff. I didn't think of using it as stand-alone boolean condition.
Man, I like EXISTS, and I'm happy to see someone writing about it from a CF perspective.
One thing I do with it is write things:
EXISTS (SELECT 1 FROM...)
The "SELECT 1" just adds a nice semantical clue as to what the exists bit is doing...
Yeah, I like that idea. The more info that can be gotten across to the programmer, the better. Actually, I was just reading over on The SQL Authority where Pinal Dave was talking about the same sort of thing, but in his case, I think it was in regards to using COUNT():
See point #13. I think he is making the point that COUNT() will always return a record, where as SELECT 1 will depend on actual returned records. Certainly, both good things to think about.
Not only does SELECT 1 make it more clear, but it will be slightly faster. This is because it isn't actually retrieving any column data, but instead is only returning the literal value 1.
Ben, get the new version of the book. It is quite a bit better.
Another useful SQL technique is using SELECT in conjunction with INSERT in order to insert multiple rows via a single SQL statement.
INSERT users (id, name, email)
SELECT userid, fullname, emailaddress
ORDER BY fullname
Nice - I didn't even think of that fact that it would be faster for not having to read actual data. Good call.
I'd double check the use of 1 as a column.
In MS SQL Server, SELECT 1 is the same as SELECT whatEverYourFirstColumnIs. The same might be true in sub queries and agregate functions.
I read that as well and thought about this. However, since data is not really being returned, I think its OK either way.
But would that go against this:
"with MSSQL (2000) it is generally better to use SELECT * in the subquery (instead of SELECT fieldname), because the query optimizer will choose the best index or combination of indexes"
SELECT 1 might be the same as SELECT myFirstColumn.
I've just changed a query that did NOT IN to a NOT EXISTS on a large table that is forever growing. No visible change to the speed of the page load but I'm hoping it's made some difference as the server is up and down at the moment and this page is the prime suspect :)
Hmmm, interesting. You could be right. However, as a rule, my tables usually have an autoincrementing ID column as the first column, which should be index (I think). But, that is just a convention. I guess using * might be the best way to go.
Always good to know a bit more SQL.
Isn't it enough we have to learn CF to a high standard! :P
Regarding SELECT 1, in the ANSI standard this should select the literal value 1. So if MSSQL does it differently, that's something they've chosen to ignore.