When it comes to database queries, the only kind of JOIN that I like to perform with SQL is an INNER JOIN. They are wicked fast and always satisfying. Any time that I have to figure out which records don't have relationships, it means that I have to do a LEFT OUTER JOIN; this is always depressing and often slow and I try to avoid it as much as possible. That's why I ended up trying something a bit counterintuitive the other day. I was in a situation where I needed to populate one table (T) with the records from table (A) that did not have a related record in table (B). My first instinct was an insert with a LEFT OUTER JOIN between A and B. This was wicked slow (dealing with hundreds of thousands of records). Then, I decided to try a full insert followed by a subsequent INNER JOIN Delete. This, as it turns out, was wicked fast!
To see what I'm talking about, imagine that I have three tables:
The Profile table has a foreign key for the User table - userID - such that (User.id == Profile.userID). The Guest table is empty. What I need to do is populate the Guest table with all User records that do not have a related Profile record.
My first instinct was to do something like this:
<!--- We want to move some of the USER records into the GUEST table. We want to do this when the USER record does not have a matching PROFILE record. ---> <cfquery name="populateGuest" datasource="testing"> <!--- Clear out guest table. ---> TRUNCATE TABLE guest; <!--- Populate the guest table with users that do NOT have profiles. That is, where there is no matching profile record. ---> INSERT INTO guest ( id, name )( SELECT u.id, u.name FROM user u LEFT OUTER JOIN profile p ON u.id = p.userID <!--- Make sure this user doesn't have a matching profile record joined. ---> WHERE p.id IS NULL ); <!--- Return SOME value so we can CFDump the query. ---> SELECT ( 'Done' ) AS message ; </cfquery>
As you can see, I am using a LEFT OUTER JOIN to figure out which User records don't have a Profile. This derived recordset is then inserted directly into the Guest table. This approach works, but is kind of slow; even with my relatively small demo tables (10,000 records), the query runs in about 28ms.
Once I saw this poor performance, I started to think if there was any way that I could accomplish the same thing using an INNER JOIN rather than a LEFT OUTER JOIN. Then it occurred to me! What if I simply INSERT all the User records; then, go back and DELETE any of the Guest records that do have a profile. In this way, I could skip the LEFT OUTER JOIN between User and Profile and replace it with an INNER JOIN between Profile and Guest.
<!--- We want to move some of the USER records into the GUEST table. We want to do this when the USER record does not have a matching PROFILE record. ---> <cfquery name="populateGuest" datasource="testing"> <!--- Clear out guest table. ---> TRUNCATE TABLE guest; <!--- First, we're gonna insert ALL the users into the guest table. Then, we'll worry about pruning the guest table. ---> INSERT INTO guest ( id, name )( SELECT u.id, u.name FROM user u ); <!--- Now that we have completed populated the guest table, we want to delete any record that has a matching profile. ---> DELETE g FROM guest g INNER JOIN profile p ON g.id = p.userID ; <!--- Return SOME value so we can CFDump the query. ---> SELECT ( 'Done' ) AS message ; </cfquery>
As you can see here, I am moving all the User records into Guest. Then, I am pruning the Guest table to remove any rows that were moved over unnecessarily. This might sound like a lot of extra work and overhead; but, with the INNER JOIN instead of the LEFT OUTER JOIN, this query executes much faster! Even with my relatively small demo tables, this query runs in about 14ms - twice as fast as the first approach!
As the record sets get larger, I found the difference in performance to be even more pronounced. It might seem totally counterintuitive to move records unnecessarily; but, when such a move allows you to use an INNER JOIN as opposed to a LEFT OUTER JOIN, the speed of the subsequent DELETE is far better than the slothfulness of the original LEFT OUTER JOIN.
Want to use code from this post? Check out the license.
Couldn't you just do something like
insert into guest(id, name)
select u.id, u.name
from user u
where u.id not in (
select id from profile
Not sure if it'd be faster but I imagine the insert then delete would play havoc with any indexes.
I just tried that and the performance seems to be on par with the LEFT OUTER JOIN ~ 30ms.
So I thought, those in clauses can be murder. I just have an unfounded dislike of inserting data only to immediately delete it. But like you said, it's a counter intuitive approach.
Totally counter-intuitive! I only tried it because my original INSERT with the LEFT OUTER JOIN was taking waaaay too long. It was an act of desperation; and then it worked super fast!
FYI, I am using MySQL in this post.
I will warn you that your results may vary based on the size of the data. While what you're doing is faster over 10,000 rows, it may indeed be slower when doing it over 10,000,000 rows.
That's the one thing that's always difficult about performance testing on small datasets, what works efficiently on small datasets, doesn't always run well on large datasets.
Also, you might want to make sure you have all indexes on all your important columns--that can obviously have a huge impact on performance.
Lastly, is there really a reason to insert the "Name" column into your Guests table. I know this is a demo, but this breaks the relational model.
Not sure if it would be any different from JoeM's query, but my first instinct would be to use an exists subquery:
insert into guest(id, name)
select u.id, u.name
from user u
where not exists (
select * from profile where userID = u.userID
No, no reason to insert the Name column. I'm just not that good at dumbing down examples :)
I think the table that I was working with in production was like 600,000+ records. The Insert/Delete ran in about 2 seconds. The LEFT OUTER JOIN insert was taking like 30 seconds or something crazy. I probably could have had a better index somewhere.
I just need to get to the point where I could run it fast locally for testing, then move to production to run again.
But yeah, the caveat with SQL is definitely that indexing will always radically change the performance. Almost to a shocking degree. Indexing is something that I am still learning about.
The problem I see with EXISTS is that we're looking for records that fail. As such, I don't think that NOT EXISTS would actually be any different than NOT IN as both are looking for zero results per record.
A lot of it depends on what proportion of user records have corresponding profile records. If most of them do, then exists should perform well. If most of them don't, not so much.
I'm guessing that the reason your insert/delete method works well is that there aren't that many records to delete. Is that right?
Maybe I have low standards, but you're talking about 28 milliseconds and 14 milliseconds, right? Is that really a difference anyone would notice?
NOT IN has always been known to be a performance killer. One of the slowest operations in SQL.
NOT EXISTS on the other hand can take full advantage of the indices that exist and usually is much faster than NOT IN or LEFT OUTER JOIN.
If we are talking about many records, this difference can be huge.
I have had cases where NOT IN would run for say 20 secods and NOT EXISTS would be done in less than a second.
You could even speed this up a little more by not using SELECT * , but a fixed value instead, so the subquery does not actually have to built a result set.
insert into guest(id, name) select u.id, u.name from user u where not exists ( select 1 from profile where userID = u.userID )
Also, you might want to experiment with indices on userID, you might see another performance increase with the proper index (depending on the number of records of course).
Ben - I'd be curious to see an EXPLAIN on the two different queries. It seems like an indexing issue.
I agree with Matt, 28ms is not a slow query in the first place. And I wouldn't use time to measure performance on such a small scale. Better to look at the IO, CPU cost, and the sub-tree cost in your execution plan (as well as the plan itself). There's a million and a half things that can influence the performance of a query, and time is not always a good indicator. For example, did you clean your server's data and plan cache before running each query. If you didn't, the second query might have had the data it needed in memory while the first one didn't. Or did you mean seconds, not milliseconds? In that case, it's fair enough to look at how long the query takes.
Performance would also depend on the join selectivity. i.e. if the ratio of users vs profiles is low and you have a covering index on userID,name, the optimiser will be able to perform IX seek (or CIX seek if you clustered on userID). If that ratio is high, it will perform a (C)IX scan. Also keep in mind that your example is extremely simplistic, which skews the results. In reality, the table would probably be wider, with more indexes on it, all of which have to be updated by both your delete and insert statements. That can take time, too.
Not sure about mySql, buth with SQL server, you could use a MERGE operator, which is optimized for this kind of stuff. You could also write it with EXCEPT, but wouldn't gain anything in terms of performance.
How about something like:
insert into guest(id, name)
from profile p
where u.id = p.userID) = 0)
Talk to you later,
Huh. I thought a "left outer join" was different than a "left join" but it's not. Looks like in your case you are doing good.
Recently, I needed all records from one table, with data from several other tables, also where I wanted to have the data. If I had 'inner joined' all of the tables together, I probably wouldn't have had much data at all.
In Oracle, you can do a "CTAS" (Create Table As Select). Not sure if this returns the same results as your example:
CREATE TABLE guest AS SELECT user.id , user.name FROM user INNER JOIN profile ON profile.user_id = user.id WHERE profile.user_id is null
Yes, you can use aliases, but if the table name is short and the query fairly simple, I don't like taxing my brain unnecessarily ;-) Verbosity for the win!
I agree it seems counter-intuitive, but if it works, it works. I wonder if you can get the same performance improvements in Oracle, and if it would be just as good as indexing. I recently had a query I had to write with multiple joins and left joins in Oracle, and for them, performance was a huge issue. But I used indexing and it improved my performance by leaps and bounds. On the subject of performance, is it just more, or has anyone else noticed how much distinct slows a query down? I guess it makes sense when you consider the order execution of the sql statement, but still...it seems like it slows it down more than it should. :-/
In my experience you're much better off doing a GROUP BY as opposed to DISTINCT.
Yeah, I may be missing something due to my admitted lack of experience with database, but I am using a group by clause in the current query in question, but it alone didn't produce accurate results, and I stuck a distinct in there, and it does, it just seems to cost a lot in processing times.
I'm glad you clarified MySQL at work here. I'm pretty sure if you got the EXPLAIN on the three ways to do this, NOT IN / LEFT OUTER JOIN-IS NULL and the NOT EXISTS, you'd find three totally different execution plans, and you'd find, probably, NOT EXISTS to be the slowest, almost inexplicably, when I look at my own tests of the same thing.
And for the kind of operation you're attempting to do, NOT IN should look about the same speed wise as the LEFT JOIN / IS NULL and would look totally different in EXPLAIN. MySQL is still executing the same algorithms, just shuffles where it takes place in code. NOT EXISTS just happens to be the most horrible of all the possibilities you have without introducing, as you have, something counter-intuitive.
It's funny, if this were MS SQL, I'd be saying the opposite, LEFT JOIN / IS NULL vs. NOT EXISTS will generate relatively similar execution plans, where as NOT IN will be the performance hog.
And MySQL has its own version of MERGE, though not as robust nor flexible, in the form of INSERT ... ON DUPLICATE KEY.
Someone also said it, but it bears repeating, and this is product dependent to some degree, indexed tables will normally reduce the number of scans in favor of seeks, which usually is where, once you've accounted for which syntactic operation will perform the best, you'll find your performance boondoggles.
I'd be interested in what version of MySQL we're talking about here. Not all optimizers are created equal between versions.
I think David cautioned as well that this might look good for one off use, but you might find that the performance of your original method would look better under load.
I beg to differ (for MSSQL, anyway). While GROUP BY can achieve what DISTINCT does, it is meant for aggregation. i.e. if you just want to output distinct data without counting/summing... use DISTINCT, it's more readable and will end up using the same execution plan (namely a combination of CIX/IX/heaps scan or seek followed with a hash match). Same execution plan means same performance on the same server and database at a given point in time.
Aside from performance, the major difference here is the actual result set returned. While LJ+IS NULL and NOT EXISTS are logically identical, NOT IN isn't because of how it treats NULL values. The IN operator is trivalent which means it can return TRUE/FALSE/NULL (the other 2 are bivalent). For example, say you have table with 100 products:
select name from products where id NOT IN (1,2,3)
will return the other 97 products but
select name from products where id NOT IN (1,2,3, NULL)
will return an empty record set because NULL effectively means unknown. As far as the server is concerned it could be 69, 666, "I can has cheezburger", so it will return unknown. unknown <> true, so it returns nada. I understand my example is a bit dumb, but if you substituted (1,2,3, NULL) with a sub-query hitting a large table, chances are you'd find NULL records in there.
you're right, not exist is often faster because of the way it operates. The LJ/IS NULL version has to build the entire record set in a hash table first, then evaluate each row to check if the right part is null, whereas the NOT EXISTS can do a scan of the left table (which theoretically is a smaller set (more rows in orderDetails than products) ), loop through those records and do an index seek in the right table. That's provided the right table has a non clustered index on its foreign key, but as a rule of thumb it should be implemented.
@JoeM and @tof,
thanks for clarifying. There are still cases where I can not get the same results using group by as I can using distinct, but again, that may be due to my limited knowledge of database.
Show us the queries :-). There's probably a tinny difference between the two. At the end of the day databases are like compilers, they do what you ask them to. Mostly ;-D
I would have done like this:
INSERT INTO guest
u.id not in (select user_id from profile)
I didn't read the comments, my solution was already in the comments.
that was the exact exist query I came on here to find. Thanks. You saved my day. And my job. Thank you. Much appreciated.
To be honest, I can't remember exactly what I was doing this for (I'm falling way behind on comments lately). I would guess that there were more matching joins than non-matching... but I don't really know more sure.
Definitely, in this example the time difference is hardly anything; but I wasn't using many records. In my production system, which I was trying to model, there were many many records and the time difference because rather enormous.
It's good to hear that IN vs. EXISTS can have such a big difference. I have also had some pretty bad experiences with IN. Typically, what I use IN for is feeding the results of one query into another like:
SELET * FROM table1 WHERE fk IN ( #valueList( firstQuery.id )# )
Here, I am using IN as means to link one existing query to a new query. Other than that, however, I try not to use it too often.
Definitely, it's hard to simplify an example to the point where it's easy to discuss - you're bound to lose some of the important details. And, on top of that, I'm not a SQL guru or anything, so it doesn't necessarily occur to me which parts are critical to the conversation. That said, there was a much much greater difference in performance on the larger, production record set as opposed to the smaller demo one.
When I learned SQL, I learned to use "outer"; though, I see that people don't always use them. I was never sure if there was a difference.
As @Joe said, I tend to just go with GROUP BY rather than DISTINCT. It helps me think about the records more... and, I am not sure that MySQL always supported DISTINCT (thought it does now). I think it used to be a MS SQL only feature.
A lot of the SQL stuff is still black magic as far as I'm concerned :) I'm using MySQL 5 (though Im not sure what dot-release). Also, I don't know anything about MERGE (or MySQL's version of it). I'll have to look that up.
I've experienced things like this as well. Sometimes we humans try to think so hard like machines and make our commands as efficient as possible, and yet a more inefficient or "wasteful" way may actually be faster! Great example. Thanks.
Thanks for the kind words!
Good article and good discuss
Thank's a lot for you guys
any one of you tested below query ?
g.id = p.userID
Given query is not functioning properly , it has been delteting all data in both case
g.id = p.userID
g.id <> p.userID
Thanks & Regards
Jayant Kumar Das