After my database was accidentally upgraded to MySQL 8.0.28, I started to look at the features released in version 8. One fun feature that caught my eye was this idea of a
LATERAL derived table. This is a derived table in which the resultant rows are calculated based on other tables in the query. Not only that, the "lateral rows" are based on other rows in recordset. This feature, added in MySQL 8.0.14, will make it easier to gather relevant aggregate data for each row in the results.
In the past, if I needed to pull back a number of related aggregations for a given row, I would
JOIN to a derived "stats" table that contained the aggregations for the predicted records in the outer query. The downside to this approach is that - because the derived query can't reference the outer query - I need to duplicate all of the filtering inside of the derived query in order to limit the number of records that it generates.
To see what I mean, let's look at an example using data from this blog. In my database schema, I have two tables:
member- Contains all of the author information for the comments.
blog_comment- Contains all of the comments left by the members.
This is a 1-to-many relationship in which every member can leave N-number of comments (related by
blog_comment.member_id). If I want to gather aggregate information about the comments for a given user, I would have to filter records using the
member_id in both the outer query and the derived query:
SET @userID = 1; SELECT m.id, m.name, -- Gather the aggregate data from the derived stats table. stats.commentCount, stats.minCreatedAt, stats.maxCreatedAt FROM member m INNER JOIN -- Generate a DERIVED TABLE with multiple aggregations relating to the -- commenting for the given user. This allows us to run the related query -- "once for the member" instead of "once for each aggregate". ( SELECT c.member_id, COUNT( * ) AS commentCount, MIN( c.date_created ) AS minCreatedAt, MAX( c.date_created ) AS maxCreatedAt FROM blog_comment c -- NOTE: Since the derived query CANNOT reference the outer query - it -- has to be a constant evaluation - I have to repeat the filtering in -- the derived query in order to avoid a FULL TABLE SCAN. In this case, -- it's only one column; but, the more complex the filtering gets, the -- more conditions would need to be duplicated. WHERE c.member_id = @userID GROUP BY c.member_id ) AS stats ON ( m.id = @userID -- Filter OUTER QUERY to given member. AND stats.member_id = m.id ) ;
As you can see, I'm using the
@userID to limit rows in both the outer query as well as in the derived
stats query. Since the derived query cannot reference rows in the outer query - derived queries have to be "constant" in the query execution plan - I have to reproduce all of the desired filtering within the derived query in order to prevent a catastrophic full-table scan.
LATERAL derived table removes this point-of-friction and allows our derived query to reference rows in the outer query! All we have to do is include the
LATERAL keyword prior to the
SELECT. Here's the same query, refactored to use a
LATERAL derived table:
SET @userID = 1; SELECT m.id, m.name, -- Gather the aggregate data from the derived stats table. stats.commentCount, stats.minCreatedAt, stats.maxCreatedAt FROM member m INNER JOIN -- Generate a DERIVED TABLE with multiple aggregations relating to the -- commenting for the given user. This allows us to run the related query -- "once for the member" instead of "once for each aggregate". LATERAL ( SELECT COUNT( * ) AS commentCount, MIN( c.date_created ) AS minCreatedAt, MAX( c.date_created ) AS maxCreatedAt FROM blog_comment c -- Since we are using a LATERAL derived table, it means that this inner -- query is run once per outer-query row. Which means, we can now -- reference the row in the outer-query in order to perform the -- necessary filtering. WHERE c.member_id = m.id ) AS stats ON m.id = @userID -- Filter OUTER QUERY to given member. ;
As you can see, I included the
LATERAL keyword just prior to my derived table calculation. This changes the derived table from one that must be constant within the query execution plan to one that will execute once per row in the outer query. This means that our derived table query can reference the outer query which allows us to simplify:
Remove the duplicated filtering reference,
@userID, in the derived table query - we can now limit the query based on the condition,
c.member_id = m.id.
stats.member_id = m.id. Since we know that our
LATERALderived table query is running once per row, there's no need to limit the relationship within the
ONclause - the derived table SQL is already applying the
c.member_idcolumn from the derived table query. Since we no longer need to use it within the
ONclause, we no longer need to select the column.
Since we are able to remove the non-aggregate column (
c.member_id) from the derived table query, we are also able to remove the
GROUP BYand just run the aggregates (
MAX) on the entirety of the derived recordset.
Even though this query is relatively low in complexity - we're filtering based on a single ID, using the
LATERAL derived table is already simplifying the structure of the SQL. The more complex the query gets - and the more filtering it uses - the more savings we'll see with this new type of per-row cross-product.
LATERAL Derived Tables and Performance
The nice thing about a traditional derived table is that it has to be "constant" within the query execution plan. Which means, MySQL can run it once and then cache it for the rest of the query. A
LATERAL derived table, on the other hand, can't be cached since it is executed once per row of the outer query. This has some performance implications.
But, that doesn't mean that the performance implications are bad. Consider the use-case problem that the
LATERAL derived table is solving for. If you look back up at the first SQL statement that we were executing, we're basically doing the same thing: we're generating a derived table "for each row" in the outer query. Only, instead of doing this with a row-specific reference, we're duplicating the filtering conditions inside the derived table.
So, it's not like the
LATERAL derived table is doing any additional "work" - it's just simplifying the SQL statement that we have to write.
LATERAL Derived Tables Can Return Multiple Rows
For me, the obvious use-case for this is returning multiple aggregates per row without having to rerun the related queries. But, there's nothing about the
LATERAL derived table specification that says the derived table can only return one row. If the inner query returns multiple rows, each derived row will be joined to the outer row the same way any
JOIN product works in MySQL.
LATERAL derived table looks like it will be a MySQL feature that I'll enjoy using. I'm already running SQL queries like this, with derived stats. But, throwing
JOINs into the mix is just going to make the SQL easier to write and to reason about. I'm pretty excited about this one.
Want to use code from this post? Check out the license.
Hi Ben. This looks interesting.
What happens if you take out the LATERAL keyword and just use the clause, inside parentheses, as a traditional SubQuery?
How will the results differ?
It will throw an error saying that the
m.id column in the
WHERE c.member_id = m.id
... is an unknown column. I believe the derived query is kind of factored-out and executed ahead of time. So, in a traditional derived table, there's no
member table context.
OK. I see. That's really cool.
So with LATERAL, you can access the outer table from within the INNER JOIN SubQuery?
That really opens up some nice possibilities! 👏🏻
Unfortunately, I am still on MySQL 5.6 😂
By the way, I am loving the edit feature on these comments! 👏🏻
You can also use
LATERAL to move calculated column data into a
LATERAL statement (at least in Postgres, haven't tried in MySQL).
SELECT a.total_bill FROM billing_table bt LATERAL (bt.previous_bill - bt.current_bill as total_bill) a WHERE a.total_bill > 200
SELECT bt.previous_bill - bt.current_bill as total_bill FROM billing_table bt WHERE bt.previous_bill - bt.current_bill > 200
which removes duplicate logic when using calculated data in multiple places
Much cleaner. I like it 👍🏻
Oh, super interesting!! I'll have to try that after work (currently only have MySQL 5.7 running). I don't think I saw anything in the documentation about using
LATERAL without a derived table; but, that could be because I was in the derived table documentation :D I'll have to see.
Thanks! The editing stuff has been really nice! I find so many little mistakes right after I post 🤣
Ok so I checked it out and Yes you can do this in MySQL. The premise of a
LATERAL JOIN is that you are able to use row-level column data in a JOIN. Typically this would be to join another table but it can just as easily work as a temporary table. Here is a real-world example:
SELECT b.id as blog_id, age_in_years, date( created_on + (age_in_years + 1) * interval '1 year' ) AS next_anniversary, date( created_on + (age_in_years + 1) * interval '1 year' ) - date(now()) AS days_to_next_anniversary FROM blog b CROSS JOIN LATERAL ( SELECT cast( extract(YEAR FROM age(now(), b.created_on)) AS int ) AS age_in_years ) AS t ORDER BY blog_id
which I took from this post: https://vladmihalcea.com/sql-lateral-join/
Ahh, I see - so you would use the
SELECT to create a temp-table out of some calculated values. That makes sense! Thanks for the tip - I don't think this would have occurred to me.