Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at the jQuery Conference 2009 (Cambridge, MA) with:

Selecting Top X From Each Group

By Ben Nadel on
Tags: ColdFusion, SQL

Over the weekend, I was talking to my friend who is building a fantasy sports site. In this site, he has different leagues, and within each league, many members. He asked me how to build a query that selects the top 5 performing members from each league. This is one of those queries that always FEELS like it should be really easy, but the moment you try to build one, it just never comes naturally.

At first, I wanted to try some sort of GROUP BY, right? I mean, basically, we want to group by each league and then select the top 5 from each of those groups. This is the kind of thinking that makes it seem easy, and then turns out to be difficult. We can't use any GROUP BY clause, because we don't want to totally collapse each group (league), we want to extract a few rows from it.

Then, I thought maybe I could apply a pivot table. But, that doesn't work because you don't want to apply the pivot to the whole table; you would kind of want to apply the pivot table to each ordered group within the entire table. Not gonna happen.

After a lot of chin scratching and Googling, I really only came up with two solutions. Well, four solutions, but the latter three are really variations on the same solution with different levels of dynamics. I have to say that none of these solutions feels elegant to me and none of them feels efficient. Frankly, I am surprised that this is such a hard query to run.

To demonstrate what I came up with, I will first build a girl table and then populate it:

  • <!--- Create SQL population scripts. --->
  • <cfsavecontent variable="strInsertSQL">
  • <!--- Create our girls table. --->
  • DECLARE @girl TABLE (
  • id INT IDENTITY( 1, 1 ),
  • name VARCHAR( 30 ),
  • hair VARCHAR( 10 ),
  • score TINYINT
  • );
  •  
  •  
  • <!---
  • Populate girls table. The ID column will populate and
  • increment automatically as records are inserted.
  • --->
  • INSERT INTO @girl
  • (
  • hair,
  • name,
  • score
  • )(
  • SELECT 'Brunette', 'Kim', 8 UNION ALL
  • SELECT 'Brunette', 'Anne', 7 UNION ALL
  • SELECT 'Brunette', 'Sarah', 10 UNION ALL
  • SELECT 'Brunette', 'Deborah', 9 UNION ALL
  • SELECT 'Brunette', 'Mia', 5 UNION ALL
  • SELECT 'Brunette', 'Samantha', 0 UNION ALL
  • SELECT 'Blonde', 'Jo Ann', 7 UNION ALL
  • SELECT 'Blonde', 'Katie', 8 UNION ALL
  • SELECT 'Blonde', 'Becca', 9 UNION ALL
  • SELECT 'Blonde', 'Mini', 5 UNION ALL
  • SELECT 'Blonde', 'Lauren', 4 UNION ALL
  • SELECT 'Blonde', 'Kit', 10
  • );
  • </cfsavecontent>

I am building the SQL temp table and populating it within a ColdFusion CFSaveContent tag so that I can easily include it in the rest of my test queries without having to retype it each time. As you can see from the build SQL, we are creating a temp table, @girl, which has columns for ID, name, hair, and score. For the rest of this post, we are going to consider Hair to be our "category"; we want to select the top 3 rated girls from each hair category (where score is the rating value).

The first solution that I came up with was also the most simple, but I also believe, the least efficient:

  • <!--- Get the top rated girls from each category. --->
  • <cfquery name="qTop3Girl" datasource="#REQUEST.DSN.Source#">
  • <!--- Create and populate girls table. --->
  • #PreserveSingleQuotes( strInsertSQL )#
  •  
  • <!--- Select top 3 from each hair. --->
  • SELECT
  • g.id,
  • g.name,
  • g.hair,
  • g.score
  • FROM
  • @girl g
  • WHERE
  • <!---
  • Make sure that the ID of the current row
  • is in the top 3 records of the same category
  • (hair color).
  • --->
  • g.id IN
  • (
  • SELECT TOP 3
  • ghair.id
  • FROM
  • @girl ghair
  • WHERE
  • ghair.hair = g.hair
  • ORDER BY
  • ghair.score DESC
  • )
  • ;
  • </cfquery>
  •  
  •  
  • <!--- Dump out results. --->
  • <cfdump
  • var="#qTop3Girl#"
  • label="Select TOP 3 From Girls By Group (Hair)"
  • />

In this solution, as we query the table, we check each row to see if the ID of the current girl with the current hair color is in the TOP 3 ids of the group with the same hair color. Again, this solution is fairly simple (especially for our example), but it is hugely inefficient; we have to run a sub-query for each row in the master table. If the master table had hundreds of thousands of rows, I could only assume this would be significantly slower, even with good indexes.

Inelegant, but it works. Running the above code, we do get the top 3 scorers from each category:


 
 
 

 
SELECT TOP X FROM GROUP - top 3 girls by hair color  
 
 
 

Then, I thought, If only I could just grab the IDs that I needed, I could use those to get the target records. In the past, I have been a huge fan of creating temporary SQL tables to hold IDs and then performing an INNER JOIN to that table. It just works nicely and, most of the time, it seems to be fairly efficient.

The problem that I quickly ran into with this technique is that you really have the same exact problem - getting the top 3 performers from each category. The only difference is that this time, I am trying to get only the IDs rather than all of the columns. Basically, the same problem, right? Well, I thought it was worth exploring because maybe it would allow me to hit upon some eureka moment:

  • <!--- Get the top rated girls from each category. --->
  • <cfquery name="qTop3Girl" datasource="#REQUEST.DSN.Source#">
  • <!--- Create and populate girls table. --->
  • #PreserveSingleQuotes( strInsertSQL )#
  •  
  •  
  • <!---
  • Create a new table to hold JUST the ids of the valid
  • IDs (the top 3 rated girls from each category).
  • --->
  • DECLARE @valid TABLE (
  • id INT
  • );
  •  
  •  
  • <!---
  • Populate the valid ID table with IDs from each of the
  • categories. For the time being, we are hardcoding the
  • values of the categories.
  • --->
  • INSERT INTO @valid
  • (
  • id
  • )(
  •  
  • <!--- Get the top Brunettes. --->
  • (
  • SELECT
  • t.id
  • FROM
  • (
  • SELECT TOP 3
  • g.id,
  • g.score
  • FROM
  • @girl g
  • WHERE
  • g.hair = 'Brunette'
  • ORDER BY
  • g.score DESC
  • ) AS t
  • )
  •  
  • UNION ALL
  •  
  • <!--- Get the top Blondes. --->
  • (
  • SELECT
  • t.id
  • FROM
  • (
  • SELECT TOP 3
  • g.id,
  • g.score
  • FROM
  • @girl g
  • WHERE
  • g.hair = 'Blonde'
  • ORDER BY
  • g.score DESC
  • ) AS t
  • )
  • );
  •  
  •  
  •  
  • <!---
  • Select top 3 from each hair category. We can do
  • this by using an INNER JOIN to the valid ID table
  • which should only contain the top 3 performers from
  • each category.
  • --->
  • SELECT
  • g.id,
  • g.name,
  • g.hair,
  • g.score
  • FROM
  • @girl g
  • INNER JOIN
  • @valid v
  • ON
  • g.id = v.id
  • ;
  • </cfquery>

This solution is a lot more wordy in terms of what needs to get done. And, on top of that, SQL Server has all kinds of odd restrictions as to where and when ORDER BY clauses can be used in sub queries. As you can see, my individual UNION ALL queries are actually queries of sub queries. This is done purely because SQL Server would not allow me to use the ORDER BY clause in the top level UNION ALL query. Not sure why - seems quite arbitrary to me.

Now, this is a much bigger solution, but I feel that it is much more efficient (at least from my totally uneducated SQL point of view). Instead of checking each row of our master query against a sub query, we are first collecting JUST the IDs that we need and then joining them to our master query. Like I said, this feels more efficient, but I don't know for sure.

To get this solution to work, one of the things I had to do was hard code the names of the categories. Clearly, this was easy for our situation, but I want to avoid having to do this, since most of the time, our categories are going to be database driven and not known at the time the query is written.

To get around this fact, I am using a little ColdFusion magic and splitting the solution across two queries. The first query just gets all the distinct category names (hair colors). The second query then loops over the first query to build our temp ID table as described in the solution above:

  • <!--- Get unique hair categories. --->
  • <cfquery name="qHair" datasource="#REQUEST.DSN.Source#">
  • <!--- Create and populate girls table. --->
  • #PreserveSingleQuotes( strInsertSQL )#
  •  
  • <!--- Select distinct hair categories. --->
  • SELECT DISTINCT
  • g.hair
  • FROM
  • @girl g
  • ;
  • </cfquery>
  •  
  •  
  • <!--- Get the top rated girls from each category. --->
  • <cfquery name="qTop3Girl" datasource="#REQUEST.DSN.Source#">
  • <!--- Create and populate girls table. --->
  • #PreserveSingleQuotes( strInsertSQL )#
  •  
  •  
  • <!---
  • Create a new table to hold JUST the ids of the valid
  • IDs (the top 3 rated girls from each category).
  • --->
  • DECLARE @valid TABLE (
  • id INT
  • );
  •  
  •  
  • <!---
  • Populate the valid ID table with IDs from each of the
  • categories. This time, in stead of hardocding the hair
  • categories, we are going to use a ColdFusion loop to
  • select from each category.
  • --->
  • <cfloop query="qHair">
  •  
  • INSERT INTO @valid
  • (
  • id
  • )(
  • SELECT
  • t.id
  • FROM
  • (
  • SELECT TOP 3
  • g.id,
  • g.score
  • FROM
  • @girl g
  • WHERE
  • g.hair = <cfqueryparam value="#qHair.hair#" cfsqltype="cf_sql_varchar" />
  • ORDER BY
  • g.score DESC
  • ) AS t
  • );
  •  
  • </cfloop>
  •  
  •  
  • <!---
  • Select top 3 from each hair category. We can do
  • this by using an INNER JOIN to the valid ID table
  • which should only contain the top 3 performers from
  • each category.
  • --->
  • SELECT
  • g.id,
  • g.name,
  • g.hair,
  • g.score
  • FROM
  • @girl g
  • INNER JOIN
  • @valid v
  • ON
  • g.id = v.id
  • ;
  • </cfquery>

This solution is about the same size as the previous one - the hard coded category query was replaced by the DISTINCT hair query. Basically, this is the same solution but with dynamic categories rather than hard coded ones.

After going back over the previous two solutions, I realized something: why even bother getting an intermediary temp table for the IDs? Think about it? There's no complicated logic that goes between the ID table and the @girl table; I'm just joining one to the other to get all the appropriate columns. Well, why not just move the whole query into the UNION ALL clauses?

In this last solution, I am scrapping the concept of the temporary ID table and simply creating a UNION ALL of the top 3 girls from each category in which each category gets its own sub query:

  • <!--- Get unique hair categories. --->
  • <cfquery name="qHair" datasource="#REQUEST.DSN.Source#">
  • <!--- Create and populate girls table. --->
  • #PreserveSingleQuotes( strInsertSQL )#
  •  
  • <!--- Select distinct hair categories. --->
  • SELECT DISTINCT
  • g.hair
  • FROM
  • @girl g
  • ;
  • </cfquery>
  •  
  •  
  • <!--- Get the top rated girls from each category. --->
  • <cfquery name="qTop3Girl" datasource="#REQUEST.DSN.Source#">
  • <!--- Create and populate girls table. --->
  • #PreserveSingleQuotes( strInsertSQL )#
  •  
  • <!---
  • Select top 3 from each hair category and then
  • union them together into one result set.
  • --->
  • <cfloop query="qHair">
  •  
  • (
  • SELECT
  • *
  • FROM
  • (
  • SELECT TOP 3
  • g.id,
  • g.name,
  • g.hair,
  • g.score
  • FROM
  • @girl g
  • WHERE
  • g.hair = <cfqueryparam value="#qHair.hair#" cfsqltype="cf_sql_varchar" />
  • ORDER BY
  • g.score DESC
  • ) AS t
  • )
  •  
  • <!--- Check for UNION ALL clause. --->
  • <cfif (qHair.CurrentRow LT qHair.RecordCount)>
  •  
  • UNION ALL
  •  
  • </cfif>
  •  
  • </cfloop>
  • </cfquery>

I feel like this last solution is really the best of all worlds. It's fairly short and fairly straightforward; but, at the same time, it's going to be pretty efficient as it doesn't run comparative sub queries (like our first solution did).

If anyone has a better solution, I would love to hear about it. I still feel like there should be something built into SQL that allows for this type of query, as I think this is something that people need to do a lot.

Tweet This Deep thoughts by @BenNadel - Selecting Top X From Each Group Thanks my man — you rock the party that rocks the body!



Reader Comments

Try this:

  • SELECT
  • c.*,
  • d.ranknum
  • FROM
  • @girl AS c
  • INNER JOIN
  • (
  • SELECT
  • a.id,
  • COUNT(*) AS ranknum
  • FROM
  • @girl AS a
  • INNER JOIN
  • @girl AS b
  • ON
  • (a.hair = b.hair)
  • AND
  • (a.score <= b.score)
  • GROUP BY
  • a.id
  • HAVING
  • COUNT(*) <= 3
  • ) AS d
  • ON
  • (c.id = d.id)

The secret is the self-join and the HAVING. The "a" table joins to the "b" table by COUNT()ing all of the girls from "b" that have a higher or equal score, thus ranking them. The HAVING then filters it out so that you only get the top 3 for each. You then join back to the original table to get all of the relevant other info.

(Technically, you can do this with just the subquery by creative use of aggregate functions, but I find it's actually faster to do this way in most cases unless you only want a very few columns.)

HTH,
-R

Reply to this Comment

@Rick,

As always, your SQL advice blows my mind. I don't quite understand it yet, but I just ran it, and I know that it works. I am having a little trouble visualizing what the sub query does. I think once I run that on it's own, and take away things like the HAVING clause, I will start to get a better feeling for how this is being built.

Very cool though.

Reply to this Comment

P.S. - There's a great zoo (O'Reilly) book called "Transact-SQL Cookbook" by Jonathan Gennick and Ales Spetic. It's chock full of really great answers and explanations of a lot of things that seem like really tricky SQL problems. It taught me more about SQL theory than any other book I've read.

This type of query, for example, is in section 2.12 of that book.

http://www.amazon.com/Transact-SQL-Cookbook-OReilly-Windows-Spetic/dp/1565927567/
or
http://www.oreilly.com/catalog/transqlcook/index.html

(And don't let the title fool you. Although it is geared towards T-SQL, instead of PL/SQL or something else, pretty much every solution also presents the "standard SQL" alternative. It may not be as succinct as the T-SQL solution, but at least it's there and you can see the differences.)

-R

Reply to this Comment

I'll whip up a quick blog post to detail how it works. I'm sure you're not the only one who would be curious.

Reply to this Comment

Maybe I'm overlooking something, but wouldn't this query do what you're looking for?

SELECT *
FROM @girl g1
WHERE id IN ( SELECT TOP(3) id FROM @girl g2 WHERE g1.hair = g2.hair ORDER BY Score DESC )
ORDER BY Hair, Score DESC

Reply to this Comment

@Jason,

Yep. That's what my first solution above was. My concern was that this would have to perform a sub-query for each record in the master table. No problem for a small record set, but this could be trouble for a large table.

Reply to this Comment

Anecdotally, I just created a version of the "girl" table with 30,000 records and random scores. The GROUP/HAVING solution took 43 seconds, and I had to kill the TOP version of the query after it ran 3+ minutes.

Reply to this Comment

However ... (yeah, apparently I'm in a comment-spam kind of mood today), indexing really, REALLY speeds up the TOP version of the query.

CREATE INDEX girl_all ON girl (hair, score, id)
/* That's the dirtiest-sounding index I think I've ever created. */

That reduces the TOP query time to under a second. Oddly enough, it also seems to throw off the query optimizer, as the GROUP/HAVING query now runs significantly longer.

So there you go -- proof that the best of theories can be thrown off by practice, and that looking at your indexes is always worth the time and effort.

Reply to this Comment

@Rick,

Sometimes, a little girl ON girl index action is really all that's needed at the end of the day ;)

Seriously though, thank you for your tremendous help and exploration of this idea. I can tell you that in my Googling, I didn't find anything half as helpful as our little conversation.

Reply to this Comment

I am the friend building the fantasy "sports" (more like sports entertainment) site.

I took the query that Rick wrote and was able to modify it to work for my system. It's a bit more intense since I don't just have the one convenient table, so I figured I'd share it, although it is essentially the same query. Thanks for the query, Rick!

  • SELECT
  • fw_league_id,
  • SUM(points) AS total_points,
  • SUM(num_correct) AS total_num_correct
  • FROM
  • (
  • SELECT
  • u.name,
  • u.fw_league_id,
  • d.ranknum,
  • t.points,
  • t.num_correct
  • FROM
  • full_fw_user u
  • INNER JOIN
  • fw_user_total t
  • ON
  • u.fw_user_id = t.fw_user_id
  • INNER JOIN
  • (
  • SELECT
  • a.fw_user_id,
  • COUNT(*) AS ranknum
  • FROM
  • (
  • SELECT
  • u.fw_user_id,
  • u.fw_league_id,
  • t.points,
  • t.num_correct
  • FROM
  • full_fw_user u
  • INNER JOIN
  • fw_user_total t
  • ON
  • t.fw_user_id = u.fw_user_id
  • ) AS a
  • INNER JOIN
  • (
  • SELECT
  • u.fw_user_id,
  • u.fw_league_id,
  • t.points,
  • t.num_correct
  • FROM
  • full_fw_user u
  • INNER JOIN
  • fw_user_total t
  • ON
  • t .fw_user_id = u.fw_user_id
  • ) AS b
  • ON
  • (
  • a.fw_league_id = b.fw_league_id
  • AND
  • a.points <= b.points
  • )
  • GROUP BY
  • a.fw_user_id
  • HAVING COUNT(*) <= 15
  • ) d
  • ON
  • u.fw_user_id = d .fw_user_id
  • ) t
  • GROUP BY
  • fw_league_id

I do have one question though - I'd like the TOP 15 results that I get back per league to be ordered first by points DESC, then by num_correct DESC. Any idea on how to add that into the above query? This is because two users from one league could have the same points, but different values for num_correct, and if they are the 15th and 16th users for the league, I might not get what I want for the SUM(num_correct).

Reply to this Comment

Just wanted to thank you guys for documenting and explaining all of this - I would have never figured it out without your help. You turned what I thoguht was going to be a logn weekend of searching SQL books and randomly trying stuff - to one I will be thinking about girl_all ON girl Indexes...

Reply to this Comment

For SQL Server, ROW_NUMBER() allows for a single scan of the table, which should provider better performance on larger tables, especially when filtering non-indexed columns versus a join/insert method.

SELECT id, name, hair, score, ranknum
FROM
(
SELECT id, name, hair, score, ROW_NUMBER() OVER ( PARTITION BY hair ORDER BY score DESC) ranknum
) g
WHERE ranknum <= 3

Reply to this Comment

@Jaden,

I've heard ROW_NUMBER() is really useful. I recently got SQL 2005 (where it first became available I think). I should play around with it some more. I've been using MySQL mostly these days for work.

Reply to this Comment

Super late to the party here, but I just came across your post, and later I separately found an MSSQL2005-specific solution that I thought might be useful to add to the conversation. It may be just like one of the other solutions posted; in this case the data's all in one table.

Here, I have a single table containing tweets. These tweets are pulled into the table using a scheduled task, from several twitter accounts. Today I needed to create some output where I'm only showing the top 5 most recent tweets from each account. The solution I came up with was:

SELECT *
FROM

(
SELECT id,
twitterAccountId,
status_created_at,
status_id,
status_text,
user_id,
user_name,
user_screen_name,
user_profile_image_url,
RANK() OVER (PARTITION BY twitterAccountId ORDER BY status_id DESC) AS rank
FROM Tweet) AS rs

WHERE Rank <= 5

This depends on twitter always assigning ever-increasing integer ID's to their tweets in order to find the "most recent" ones (which I wouldn't anticipate to be a problem...) but I could also sort by the status_created_at field in the PARTITION clause.

And, I'm not sure whether I should be using RANK() or simply ROW_NUMBER() as suggested in another comment -- but this query structure really came together quickly in my own head. If this helps anyone, great... if it's utterly wrong/unscalable/bad, I'd love to hear about that too.

I also came across NTILE() which is really cool too:

http://www.databasejournal.com/features/mssql/article.php/3780311/Exploring-SQL-2005s-Ranking-Functions--NTILE-and-ROWNUMBER.htm

HTH

Marc

Reply to this Comment

The posts above were very helpful but I have one other twist that perhaps you guys can help with.

I want to select the top 100 records but keep the amount from each group as evenly distributed as possible.

I was able to get to 100 by just increasing or decreasing the rank selection, but is there a piece of code that I can write to accomplish that for me?

Reply to this Comment

Finding TOP X records from each group

SQL> select * from emp e
2 where e.empno in (select d.empno from emp d
3 where d.deptno=e.deptno and rownum<3)
4 order by deptno
5 ;

EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
---------- ---------- --------- ---------- --------- ---------- ---------- ----------
7782 CLARK MANAGER 7839 09-JUN-81 2450 10
7839 KING PRESIDENT 17-NOV-81 5000 10
7369 SMITH CLERK 7902 17-DEC-80 800 20
7566 JONES MANAGER 7839 02-APR-81 2975 20
7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30
7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30

6 rows selected.

Reply to this Comment

@MCouvi

Here are two examples of selecting the top n rows, with rows distributed evenly across the groups.

These tables are from the MS AdventureWorksLT sample database. SQL Server 2005+ is required, since it uses a CTE and ROW_NUMBER().

This query selects across 3 groups (CountryRegion).

DECLARE @Rows INT
SELECT @Rows = 100

;WITH Q AS
(
SELECT A.StateProvince, A.CountryRegion,
ROW_NUMBER() OVER (PARTITION BY A.CountryRegion ORDER BY A.StateProvince, A.CountryRegion) GrpRow
FROM SalesLT.Address A
)
SELECT TOP(@Rows) Q.*
FROM Q
WHERE GrpRow <= 1 + CEILING(@Rows * 1.0 / ( SELECT COUNT(DISTINCT CountryRegion) FROM Q))

This query selects across 30+ groups (ProductCategory).

DECLARE @Rows INT
SELECT @Rows = 100

;WITH Q AS
(
SELECT P.Name PName, PC.Name PCName,
ROW_NUMBER() OVER (PARTITION BY PC.Name ORDER BY PC.Name, P.Name) GrpRow
FROM SalesLT.Product P
INNER JOIN SalesLT.ProductCategory PC ON P.ProductCategoryID = PC.ProductCategoryID
)
SELECT TOP(@Rows) Q.*
FROM Q
WHERE GrpRow <= 1 + CEILING(@Rows * 1.0 / ( SELECT COUNT(DISTINCT PCName) FROM Q))

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.