SQL Pivot Tables Are Rockin' Awesome (Thanks Rick Osborne)

By Ben Nadel on January 29, 2007

Tags: SQL

A few months ago, Rick Osborne introduced me to the idea of a SQL database Pivot Table. These pivot tables consist of a table with a single column that has an index that counts up from one (I'm sure there are variations on this) to a set number. For instance the pivot table [pivot100] has an ID column with values from 1 to 100. Rick told me that he has tables that go to 100, 1000, and 10000, but so far I have only rocked the 100 pivot table.

So anyway, this pivot table just cut a HUGE, NASTY, UGLY SQL query from about 90ms down to about 20ms. That is an outstanding performance increase. This query was huge and nasty and for reporting so it did a tone of things, but I just wanted to show you a small part of it. It is for a recycling program that has a pickups table [pickup] and join table from users to pickups [pickup_useraccount_jn]. As PART of this massive query, I had to find the number of users whose net weight of recycling over a certain date period fell into a given 5 pound bucket (ie. 0-5 lbs, 5-10 lbs, etc.) all the way up to 80 (which was 80+ NOT 80-85 lbs).

First, let's look at the original SQL query. You can assume that all the SQL variables have already been defined and that lstAvailableUsers is list of User IDs to be included in the report (I know that is lame... I am overhauling everything):

<!---
	Now, we have to loop over the 5 lb increments and find
	the number of users that recycled a net pickup weight
	in that sum category.
--->
<cfloop
	index="i"
	from="0"
	to="80"
	step="5">

	SET @number_#i#_#(i + 5)# = ISNULL(
		(
			SELECT
				COUNT(*)
			FROM
				(

					SELECT
						SUM(p.net_weight) AS net_weight
					FROM
						pickup p
					WHERE
						p.date_pickup >= @date_from
					AND
						p.date_pickup < @date_to
					AND
						p.user_id IN ( #lstAvailableUsers# )

					GROUP BY
						p.user_id

				) AS TempWeightTable#i#
			WHERE
				net_weight >= #i#

			<!--- Check for the catch-all weight bucket. --->
			<cfif (i LT 80)>
				AND
					net_weight < #(i + 5)#
			</cfif>
		),

		<!--- If above is null, return zero. --->
		0
	);

</cfloop>

As you can see, I am basically running that inner query 16 times stepping from 0 to 80. That is a SLOW query to be running so many times. That's where the pivot table comes into play. Well, not just the pivot table, but a pivot table and temporary table variable. To clean up the query, I start out be creating a table variable for the steps (0-5, 5-10, etc):

DECLARE @number_step_5 TABLE (
	step INT,
	user_count INT
);

Then, I go about populating that table using the help of the SQL pivot table:

<!---
	Set up the step-5 table. This will step from 0 to 80
	getting counts of users who recycled a sum weight in
	that 5 lb weight bucket.
--->
INSERT INTO @number_step_5
(
	step,
	user_count
)(
	SELECT
		tpiv.id AS step,
		COUNT( * )
	FROM
		<!---
			Take the pivot100 talbe and trim it so that we
			are only returning pivot IDs that are a 5lb
			increment. Also, we need to substract one from our
			1-based ID column as our span weibht buckets
			start at zero.
		--->
		(

			SELECT
				(piv.id - 1) AS id
			FROM
				pivot100 piv
			WHERE
				piv.id < 82
			AND
				((piv.id - 1) % 5) = 0

		) AS tpiv
	INNER JOIN
		<!---
			Create an in-memory temp table that has all the
			different net recycling weight sums by user. We
			don't care about the user at this point, just the
			total list of sums.
		--->
		(

			SELECT
				SUM( p.net ) AS net_sum
			FROM
				pickup p
			INNER JOIN
				pickup_useraccount_jn pujn
			ON
				(
						p.id = pujn.pickup_id
					AND
						p.date_picked_up BETWEEN @date_from AND @date_to
					AND
						pujn.useraccount_id IN ( #lstAvailableUsers# )
				)
			GROUP BY
				pujn.useraccount_id

		) AS tsum
	ON
		<!---
			When joining the sum table to the pivot table,
			we want the sum to be between the intial pivot
			value (our start value) and the increment value
			(5 lbs). However, the last bucket is a catch all
			for 80lbs - (infinisty).... which we will get by
			using 999999 (good enough).
		--->
		(
				tsum.net_sum >= tpiv.id
			AND
				tsum.net_sum <
				(
					CASE
						WHEN
							tpiv.id < 80
						THEN
							(tpiv.id + 5)
						ELSE
							999999
					END
				)
		)
	GROUP BY
		tpiv.id
);

Now, this might not be the easiest query to follow, but let me tell you that it works, and it is quite a bit faster than the original query. Notice that the BIG chunk of a query (the INNER JOIN temp table) is only being run once (and then joined to) as opposed to the 16 or so times it was being run in the original query. But, we are not quite done yet. Now that I have this temp table populated, I still have to set the variables that I set in the original query:

<!---
	Now, we have to loop over the increments and find the
	number of users that recyecled a net weight in the
	given bucket.
--->
<cfloop
	index="i"
	from="0"
	to="80"
	step="5">

	SET @number_#i#_#(i + 5)# = ISNULL(
		(
			SELECT
				user_count
			FROM
				@number_step_5
			WHERE
				step = #i#
		),
		0
	);

</cfloop>

This queries the temp table to find the value for that weight group. Since there might not be any pickups in that weight group, I must use an ISNULL() method call. The genius here though, is that the temp table [@number_step_5] should never have more than 16 or 17 rows in it and most likely less than that. That means that each of these subsequent queries that sets a variable will execute almost immediately with such a minimal amount of data to search.

Pivot tables are totally awesome! Thanks Rick, you helped me rock this SQL query hard core!

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/496

Reader Comments

Rob Jan 30, 2007 at 12:26 AM

1 Comments

Hey great explanation my old lead developer explained them like this

http://labs.redbd.net/blog/index.cfm/2006/9/4/Build-a-Pivot-Table-in-ColdFusion-in-four-easy-steps

Sammy Larbi Apr 24, 2008 at 11:06 AM

79 Comments

I just learned of pivots today, and found this post when looking for the MSDN reference. I don't know how I missed it the first time around, but I thought it was cool that you were in the top 10 for my search.

Ben Nadel Apr 24, 2008 at 11:12 AM

15,688 Comments

@Sammy,

Yeah, Rick saved me a lot of time / trouble with this idea.

Andrew Bauer Jul 16, 2009 at 7:50 PM

53 Comments

I know this is an older post, but is still relevant. Have you looked into OLAP databasing?

With OLAP all the aggregation is done in advanced in a data cube and is generally refreshed each time you update the OLAP database (eg: nightly, weekly, etc).

I am not sure about other OLAP products, but I know that Coldfusion can talk to MS SQL server's analytic services and it's cubes directly.

Ben Nadel Jul 18, 2009 at 1:30 PM

15,688 Comments

@Andrew,

I have not heard of an OLAP database before. What is it?

Andrew Bauer Jul 19, 2009 at 2:25 AM

53 Comments

There is no physical difference in the database itself but more to do with how you structure it.

The power of OLAP comes from pre-aggregating the data and storing redundant/visual data (like dates and bucket ranges) for joining purposes.

With your above example you could create 3 dimension tables (users, dates and bucket ranges)and a fact table (the aggregation of weights per the 3 dimensions); than change your multiple SQL queries into 1 MDX query.

It is rather difficult to explain OLAP structure in a simple comment, so instead I would suggest having a look at some other articles like http://www.cubeslice.com/session1.htm

There is also an interesting article on SQL vs MDX at http://msdn.microsoft.com/en-us/library/aa216779(SQL.80).aspx

Ben Nadel Jul 19, 2009 at 11:45 AM

15,688 Comments

@Andrew,

Thanks, I'll take a look.

Ben Nadel Jul 19, 2009 at 11:54 AM

15,688 Comments

@Andrew,

Wow, OLAP seems pretty interesting. I like the fact that it has like 7 different definitions :)

Andrew Bauer Jul 19, 2009 at 6:40 PM

53 Comments

@Ben,

I know what you mean. When I first got into OLAP (not too long ago), I found it rather difficult to understand with an OLTP mindset. Once I stopped trying to optimize the database structures (in my mind) to suit transactions, it made it a lot easier to understand the benefits.

If you are interested in learning OLAP for MS SQL, the MS press step by step book for analysis services is a good place to start. It doesn't really go into using data cubes online, however a quick Google search on ColdFusion and OLAP gives some examples.

Ben Nadel Jul 19, 2009 at 11:36 PM

15,688 Comments

@Andrew,

Do you use OLAP for all database situations? Or just for reporting type cases?

Andrew Bauer Jul 20, 2009 at 4:27 AM

53 Comments

@Ben,

I use it for cases where it will improve performance (which is primarily reporting cases). There is a bit of overhead with setting up the OLAP database and the transfer mechanisms to populate it.

For a small report that can be easily answered with a simple SQL query, it is just not worth it.

However, for something that has a few reports (especially historical, aggregated reports); once the setup is complete, adding extra tables and MDX queries to answer more questions, is similar to relational databasing.

Ben Nadel Jul 20, 2009 at 7:53 AM

15,688 Comments

@Andrew,

Ah, ok gotcah. So it's not an all or nothing approach to database design. I worked one time on a system that had a tremendous amount of data; the technical guy on the client side wanted to start implemented a parallel database for reporting and data mining in something called "star schemea", which I gathered was a denormalized version of the database meant to be super fast in reporting.

Reporting is something that I find quite interesting. I wish I had more opportunity to look into things like this.

Andrew Bauer Jul 20, 2009 at 6:26 PM

53 Comments

@Ben,

Star schema is a database structuring approach to OLAP, and your description also describes OLAP itself.

My previous post on the 19th outlines a star schema example. The fact table with the aggregated values is the center of the star; and the 3 dimension tables joining to the fact table, explaining the fact tables values with their own data, make the outside points of the star.

I used to struggle with having to denormalize my reporting tables inside of my transaction database; mainly because of my attitude towards keeping my data organized and structured (I know this attitude is rather stubborn at times, but meh). Then after discussing a training application I was working on with a friend of mine who does Business Intelligence, he suggested I look into OLAP and data cubes.

Now I separate what I can from my base transactional database, like reporting or archiving (soft-deletes). The thing I realized is, all the data is still held on the file system in the same structure. It is the DBMS that separates it into databases, tables and cells. Plus it is not too hard to do a multi-table or distributed (across different servers) query.

Ben Nadel Jul 27, 2009 at 4:57 PM

15,688 Comments

@Andrew,

Ahh, thanks for clearing that up. When I heard about Star Schema, I was really excited to learn more about it; unfortunately, when the project really got underway, we were bogged down with a whole bunch of other deadlines that took priority and I never got to learn more about it.

Andy Sandefer Jul 28, 2009 at 9:20 AM

7 Comments

@Ben,
I used to be more focused on data architecture and it has really paved the way for how I approach problems and data access in general. Any time that you spend learning about RDBMS Theory, OLAP Cubes, etc. is time well spent. Consider that Structure Query Language and the Relational Model has survived for more than 30 years and it becomes readily apparent to anyone that as a developer this is truly your most valuable and transferable skill. There's a lot of good information out there and I'd recommend reading anything written by Michael Hernandez, EF Codd (the father of the RDBMS) or Chris Date. Here is some very basic info regarding Codd's 12 rules of Online Analytical Processing...
Rule 0: The system must qualify as relational, as a database, and as a management system.

For a system to qualify as a relational database management system (RDBMS), that system must use its relational facilities (exclusively) to manage the database.

Rule 1: The information rule:

All information in the database is to be represented in one and only one way, namely by values in column positions within rows of tables.

Rule 2: The guaranteed access rule:

All data must be accessible. This rule is essentially a restatement of the fundamental requirement for primary keys. It says that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row.

Rule 3: Systematic treatment of null values:

The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.

Rule 4: Active online catalog based on the relational model:

The system must support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. That is, users must be able to access the database's structure (catalog) using the same query language that they use to access the database's data.

Rule 5: The comprehensive data sublanguage rule:

The system must support at least one relational language that

1. Has a linear syntax
2. Can be used both interactively and within application programs,
3. Supports data definition operations (including view definitions), data manipulation operations (update as well as retrieval), security and integrity constraints, and transaction management operations (begin, commit, and rollback).

Rule 6: The view updating rule:

All views that are theoretically updatable must be updatable by the system.

Rule 7: High-level insert, update, and delete:

The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.

Rule 8: Physical data independence:

Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure.

Rule 9: Logical data independence:

Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.

Rule 10: Integrity independence:

Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications.

Rule 11: Distribution independence:

The distribution of portions of the database to various locations should be invisible to users of the database. Existing applications should continue to operate successfully :

1. when a distributed version of the DBMS is first introduced; and
2. when existing distributed data are redistributed around the system.

Rule 12: The nonsubversion rule:

If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.

Andy Sandefer Jul 28, 2009 at 10:28 AM

7 Comments

There are numerous ways that this problem could've been solved but here's my take on it. Why not create a table that holds the brackets? This table's data could be updated by users when they want to change the behavior (the nubmer of buckets and from/to values of the report buckets) of the report instead of having to call a programmer.

The Bracket table would look like so...
BracketID INT
StartWeight INT
EndWeight INT

Then the entire logic of the report simply changes to something as condensed as...

SELECT a.BracketID,
a.StartWeight,
a.EndWeight,
(
SELECT COUNT(*)
FROM Pickup
GROUP BY UserID
HAVING SUM(NetWeight) BETWEEN a.StartWeight and a.EndWeight
) AS UsersInThisBracket
FROM ReportBracket a

And we now also have made the system friendlier by allowing end users to change the behavior of the report by merely adjustment the low and high values of the report's buckets by using a simple setup table (I called it Bracket in this instance).

Andrew Bauer Jul 28, 2009 at 6:45 PM

53 Comments

@Ben,

Your welcome. It is always frustrating to get excited about learning something new, to then not get the opportunity because of other commitments.

I would also suggest having a look at the blog by Kenneth Downs 'The Database Programmer' @ http://database-programmer.blogspot.com/

I have not found much on OLAP there, however what he has to say on relational databasing is really interesting. He also has a table of contents @ http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html

Ben Nadel Aug 5, 2009 at 7:24 PM

15,688 Comments

@Andy S,

I think having a solid understanding of databases and database architecture is even way more useful than people think it is. Database activity is almost always the limiting factor (bottle neck) on a page request, so being able to maximize understanding of that aspect is awesome.

I like to think I know a thing or two, but I have so much to learn. I have on the desk next to me, Transact-SQL Cookbook, but I have yet to read it. Just need to make some time!

@Andrew B,

Thanks for the links. I'll definitely take a look.

Karen Aug 31, 2011 at 9:46 PM

1 Comments

This is a really interesting find! I would like to use this to do a report (joining such that rows are displayed for dates and users even where there are no records) in PHP/MySQL. Is pivot table e.g., pivot100, available in MySQL? All my searches for this either end up back on your blog or pivot table that's referring to crosstab queries like MsAccess.

This is basically what I'm trying to do...

I have a table where users can enter amounts under their user id for the quarters selected, i.e., March, June, September or December of a year. I currently have a report on this which churns out the data like this:

As you can see, User 2 did not enter any records for Mar or Dec 2010 hence no display for those records. However, the requirement is to display an empty row if no records are available like this:

I was thinking of creating a temp table which stores unique combinations of users and quarters. Maybe the pivot table can help me? I do hope it's available in MySQL...

Thanks,
Karen

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.