Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at CFUNITED 2010 (Landsdown, VA) with:

SQL Pivot Tables Are Rockin' Awesome (Thanks Rick Osborne)

By Ben Nadel on
Tags: SQL

A few months ago, Rick Osborne introduced me to the idea of a SQL database Pivot Table. These pivot tables consist of a table with a single column that has an index that counts up from one (I'm sure there are variations on this) to a set number. For instance the pivot table [pivot100] has an ID column with values from 1 to 100. Rick told me that he has tables that go to 100, 1000, and 10000, but so far I have only rocked the 100 pivot table.

So anyway, this pivot table just cut a HUGE, NASTY, UGLY SQL query from about 90ms down to about 20ms. That is an outstanding performance increase. This query was huge and nasty and for reporting so it did a tone of things, but I just wanted to show you a small part of it. It is for a recycling program that has a pickups table [pickup] and join table from users to pickups [pickup_useraccount_jn]. As PART of this massive query, I had to find the number of users whose net weight of recycling over a certain date period fell into a given 5 pound bucket (ie. 0-5 lbs, 5-10 lbs, etc.) all the way up to 80 (which was 80+ NOT 80-85 lbs).

First, let's look at the original SQL query. You can assume that all the SQL variables have already been defined and that lstAvailableUsers is list of User IDs to be included in the report (I know that is lame... I am overhauling everything):

  • <!---
  • Now, we have to loop over the 5 lb increments and find
  • the number of users that recycled a net pickup weight
  • in that sum category.
  • --->
  • <cfloop
  • index="i"
  • from="0"
  • to="80"
  • step="5">
  •  
  • SET @number_#i#_#(i + 5)# = ISNULL(
  • (
  • SELECT
  • COUNT(*)
  • FROM
  • (
  •  
  • SELECT
  • SUM(p.net_weight) AS net_weight
  • FROM
  • pickup p
  • WHERE
  • p.date_pickup >= @date_from
  • AND
  • p.date_pickup < @date_to
  • AND
  • p.user_id IN ( #lstAvailableUsers# )
  •  
  • GROUP BY
  • p.user_id
  •  
  • ) AS TempWeightTable#i#
  • WHERE
  • net_weight >= #i#
  •  
  • <!--- Check for the catch-all weight bucket. --->
  • <cfif (i LT 80)>
  • AND
  • net_weight < #(i + 5)#
  • </cfif>
  • ),
  •  
  • <!--- If above is null, return zero. --->
  • 0
  • );
  •  
  • </cfloop>

As you can see, I am basically running that inner query 16 times stepping from 0 to 80. That is a SLOW query to be running so many times. That's where the pivot table comes into play. Well, not just the pivot table, but a pivot table and temporary table variable. To clean up the query, I start out be creating a table variable for the steps (0-5, 5-10, etc):

  • DECLARE @number_step_5 TABLE (
  • step INT,
  • user_count INT
  • );

Then, I go about populating that table using the help of the SQL pivot table:

  • <!---
  • Set up the step-5 table. This will step from 0 to 80
  • getting counts of users who recycled a sum weight in
  • that 5 lb weight bucket.
  • --->
  • INSERT INTO @number_step_5
  • (
  • step,
  • user_count
  • )(
  • SELECT
  • tpiv.id AS step,
  • COUNT( * )
  • FROM
  • <!---
  • Take the pivot100 talbe and trim it so that we
  • are only returning pivot IDs that are a 5lb
  • increment. Also, we need to substract one from our
  • 1-based ID column as our span weibht buckets
  • start at zero.
  • --->
  • (
  •  
  • SELECT
  • (piv.id - 1) AS id
  • FROM
  • pivot100 piv
  • WHERE
  • piv.id < 82
  • AND
  • ((piv.id - 1) % 5) = 0
  •  
  • ) AS tpiv
  • INNER JOIN
  • <!---
  • Create an in-memory temp table that has all the
  • different net recycling weight sums by user. We
  • don't care about the user at this point, just the
  • total list of sums.
  • --->
  • (
  •  
  • SELECT
  • SUM( p.net ) AS net_sum
  • FROM
  • pickup p
  • INNER JOIN
  • pickup_useraccount_jn pujn
  • ON
  • (
  • p.id = pujn.pickup_id
  • AND
  • p.date_picked_up BETWEEN @date_from AND @date_to
  • AND
  • pujn.useraccount_id IN ( #lstAvailableUsers# )
  • )
  • GROUP BY
  • pujn.useraccount_id
  •  
  • ) AS tsum
  • ON
  • <!---
  • When joining the sum table to the pivot table,
  • we want the sum to be between the intial pivot
  • value (our start value) and the increment value
  • (5 lbs). However, the last bucket is a catch all
  • for 80lbs - (infinisty).... which we will get by
  • using 999999 (good enough).
  • --->
  • (
  • tsum.net_sum >= tpiv.id
  • AND
  • tsum.net_sum <
  • (
  • CASE
  • WHEN
  • tpiv.id < 80
  • THEN
  • (tpiv.id + 5)
  • ELSE
  • 999999
  • END
  • )
  • )
  • GROUP BY
  • tpiv.id
  • );

Now, this might not be the easiest query to follow, but let me tell you that it works, and it is quite a bit faster than the original query. Notice that the BIG chunk of a query (the INNER JOIN temp table) is only being run once (and then joined to) as opposed to the 16 or so times it was being run in the original query. But, we are not quite done yet. Now that I have this temp table populated, I still have to set the variables that I set in the original query:

  • <!---
  • Now, we have to loop over the increments and find the
  • number of users that recyecled a net weight in the
  • given bucket.
  • --->
  • <cfloop
  • index="i"
  • from="0"
  • to="80"
  • step="5">
  •  
  • SET @number_#i#_#(i + 5)# = ISNULL(
  • (
  • SELECT
  • user_count
  • FROM
  • @number_step_5
  • WHERE
  • step = #i#
  • ),
  • 0
  • );
  •  
  • </cfloop>

This queries the temp table to find the value for that weight group. Since there might not be any pickups in that weight group, I must use an ISNULL() method call. The genius here though, is that the temp table [@number_step_5] should never have more than 16 or 17 rows in it and most likely less than that. That means that each of these subsequent queries that sets a variable will execute almost immediately with such a minimal amount of data to search.

Pivot tables are totally awesome! Thanks Rick, you helped me rock this SQL query hard core!




Reader Comments

I just learned of pivots today, and found this post when looking for the MSDN reference. I don't know how I missed it the first time around, but I thought it was cool that you were in the top 10 for my search.

Reply to this Comment

I know this is an older post, but is still relevant. Have you looked into OLAP databasing?

With OLAP all the aggregation is done in advanced in a data cube and is generally refreshed each time you update the OLAP database (eg: nightly, weekly, etc).

I am not sure about other OLAP products, but I know that Coldfusion can talk to MS SQL server's analytic services and it's cubes directly.

Reply to this Comment

There is no physical difference in the database itself but more to do with how you structure it.

The power of OLAP comes from pre-aggregating the data and storing redundant/visual data (like dates and bucket ranges) for joining purposes.

With your above example you could create 3 dimension tables (users, dates and bucket ranges)and a fact table (the aggregation of weights per the 3 dimensions); than change your multiple SQL queries into 1 MDX query.

It is rather difficult to explain OLAP structure in a simple comment, so instead I would suggest having a look at some other articles like http://www.cubeslice.com/session1.htm

There is also an interesting article on SQL vs MDX at http://msdn.microsoft.com/en-us/library/aa216779(SQL.80).aspx

Reply to this Comment

@Andrew,

Wow, OLAP seems pretty interesting. I like the fact that it has like 7 different definitions :)

Reply to this Comment

@Ben,

I know what you mean. When I first got into OLAP (not too long ago), I found it rather difficult to understand with an OLTP mindset. Once I stopped trying to optimize the database structures (in my mind) to suit transactions, it made it a lot easier to understand the benefits.

If you are interested in learning OLAP for MS SQL, the MS press step by step book for analysis services is a good place to start. It doesn't really go into using data cubes online, however a quick Google search on ColdFusion and OLAP gives some examples.

Reply to this Comment

@Ben,

I use it for cases where it will improve performance (which is primarily reporting cases). There is a bit of overhead with setting up the OLAP database and the transfer mechanisms to populate it.

For a small report that can be easily answered with a simple SQL query, it is just not worth it.

However, for something that has a few reports (especially historical, aggregated reports); once the setup is complete, adding extra tables and MDX queries to answer more questions, is similar to relational databasing.

Reply to this Comment

@Andrew,

Ah, ok gotcah. So it's not an all or nothing approach to database design. I worked one time on a system that had a tremendous amount of data; the technical guy on the client side wanted to start implemented a parallel database for reporting and data mining in something called "star schemea", which I gathered was a denormalized version of the database meant to be super fast in reporting.

Reporting is something that I find quite interesting. I wish I had more opportunity to look into things like this.

Reply to this Comment

@Ben,

Star schema is a database structuring approach to OLAP, and your description also describes OLAP itself.

My previous post on the 19th outlines a star schema example. The fact table with the aggregated values is the center of the star; and the 3 dimension tables joining to the fact table, explaining the fact tables values with their own data, make the outside points of the star.

I used to struggle with having to denormalize my reporting tables inside of my transaction database; mainly because of my attitude towards keeping my data organized and structured (I know this attitude is rather stubborn at times, but meh). Then after discussing a training application I was working on with a friend of mine who does Business Intelligence, he suggested I look into OLAP and data cubes.

Now I separate what I can from my base transactional database, like reporting or archiving (soft-deletes). The thing I realized is, all the data is still held on the file system in the same structure. It is the DBMS that separates it into databases, tables and cells. Plus it is not too hard to do a multi-table or distributed (across different servers) query.

Reply to this Comment

@Andrew,

Ahh, thanks for clearing that up. When I heard about Star Schema, I was really excited to learn more about it; unfortunately, when the project really got underway, we were bogged down with a whole bunch of other deadlines that took priority and I never got to learn more about it.

Reply to this Comment

@Ben,
I used to be more focused on data architecture and it has really paved the way for how I approach problems and data access in general. Any time that you spend learning about RDBMS Theory, OLAP Cubes, etc. is time well spent. Consider that Structure Query Language and the Relational Model has survived for more than 30 years and it becomes readily apparent to anyone that as a developer this is truly your most valuable and transferable skill. There's a lot of good information out there and I'd recommend reading anything written by Michael Hernandez, EF Codd (the father of the RDBMS) or Chris Date. Here is some very basic info regarding Codd's 12 rules of Online Analytical Processing...
Rule 0: The system must qualify as relational, as a database, and as a management system.

For a system to qualify as a relational database management system (RDBMS), that system must use its relational facilities (exclusively) to manage the database.

Rule 1: The information rule:

All information in the database is to be represented in one and only one way, namely by values in column positions within rows of tables.

Rule 2: The guaranteed access rule:

All data must be accessible. This rule is essentially a restatement of the fundamental requirement for primary keys. It says that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row.

Rule 3: Systematic treatment of null values:

The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.

Rule 4: Active online catalog based on the relational model:

The system must support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. That is, users must be able to access the database's structure (catalog) using the same query language that they use to access the database's data.

Rule 5: The comprehensive data sublanguage rule:

The system must support at least one relational language that

1. Has a linear syntax
2. Can be used both interactively and within application programs,
3. Supports data definition operations (including view definitions), data manipulation operations (update as well as retrieval), security and integrity constraints, and transaction management operations (begin, commit, and rollback).

Rule 6: The view updating rule:

All views that are theoretically updatable must be updatable by the system.

Rule 7: High-level insert, update, and delete:

The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.

Rule 8: Physical data independence:

Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure.

Rule 9: Logical data independence:

Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.

Rule 10: Integrity independence:

Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications.

Rule 11: Distribution independence:

The distribution of portions of the database to various locations should be invisible to users of the database. Existing applications should continue to operate successfully :

1. when a distributed version of the DBMS is first introduced; and
2. when existing distributed data are redistributed around the system.

Rule 12: The nonsubversion rule:

If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.

Reply to this Comment

There are numerous ways that this problem could've been solved but here's my take on it. Why not create a table that holds the brackets? This table's data could be updated by users when they want to change the behavior (the nubmer of buckets and from/to values of the report buckets) of the report instead of having to call a programmer.

The Bracket table would look like so...
BracketID INT
StartWeight INT
EndWeight INT

Then the entire logic of the report simply changes to something as condensed as...

SELECT a.BracketID,
a.StartWeight,
a.EndWeight,
(
SELECT COUNT(*)
FROM Pickup
GROUP BY UserID
HAVING SUM(NetWeight) BETWEEN a.StartWeight and a.EndWeight
) AS UsersInThisBracket
FROM ReportBracket a

And we now also have made the system friendlier by allowing end users to change the behavior of the report by merely adjustment the low and high values of the report's buckets by using a simple setup table (I called it Bracket in this instance).

Reply to this Comment

@Ben,

Your welcome. It is always frustrating to get excited about learning something new, to then not get the opportunity because of other commitments.

I would also suggest having a look at the blog by Kenneth Downs 'The Database Programmer' @ http://database-programmer.blogspot.com/

I have not found much on OLAP there, however what he has to say on relational databasing is really interesting. He also has a table of contents @ http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html

Reply to this Comment

@Andy S,

I think having a solid understanding of databases and database architecture is even way more useful than people think it is. Database activity is almost always the limiting factor (bottle neck) on a page request, so being able to maximize understanding of that aspect is awesome.

I like to think I know a thing or two, but I have so much to learn. I have on the desk next to me, Transact-SQL Cookbook, but I have yet to read it. Just need to make some time!

@Andrew B,

Thanks for the links. I'll definitely take a look.

Reply to this Comment

This is a really interesting find! I would like to use this to do a report (joining such that rows are displayed for dates and users even where there are no records) in PHP/MySQL. Is pivot table e.g., pivot100, available in MySQL? All my searches for this either end up back on your blog or pivot table that's referring to crosstab queries like MsAccess.

This is basically what I'm trying to do...

I have a table where users can enter amounts under their user id for the quarters selected, i.e., March, June, September or December of a year. I currently have a report on this which churns out the data like this:

User 1 | Mar 2010 | NNNN.NN | NNNN.NN | NNNN.NN
User 1 | Jun 2010 | NNNN.NN | NNNN.NN | NNNN.NN
User 1 | Sep 2010 |NNNN.NN | NNNN.NN | NNNN.NN
User 1 | Dec 2010 |NNNN.NN | NNNN.NN | NNNN.NN
User 2 | Jun 2010 | NNNN.NN | NNNN.NN | NNNN.NN
User 2 | Sep 2010| NNNN.NN | NNNN.NN | NNNN.NN

As you can see, User 2 did not enter any records for Mar or Dec 2010 hence no display for those records. However, the requirement is to display an empty row if no records are available like this:

User 1 | Mar 2010 | NNNN.NN | NNNN.NN | NNNN.NN
User 1 | Jun 2010 | NNNN.NN | NNNN.NN | NNNN.NN
User 1 | Sep 2010 |NNNN.NN | NNNN.NN | NNNN.NN
User 1 | Dec 2010 |NNNN.NN | NNNN.NN | NNNN.NN
User 2 | Mar 2010 | | |
User 2 | Jun 2010 | NNNN.NN | NNNN.NN | NNNN.NN
User 2 | Sep 2010| NNNN.NN | NNNN.NN | NNNN.NN
User 2 | Dec 2010| | |

I was thinking of creating a temp table which stores unique combinations of users and quarters. Maybe the pivot table can help me? I do hope it's available in MySQL...

Thanks,
Karen

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.