Thoughts On Storing Duplicate / Calculated Data In The Database
The other day, I was talking to this girl, Franziska, and all I could think about was how funny she was. Then, as I was walking home later that day, I kept thinking about how funny she was. As I was doing this, I started to think about database design. And, it occurred to me that by thinking about how funny Franziska was when she was not right in front of me was really the equivalent of storing duplicate data in a database. Think about it - the assessment of "humor" that this Franziska possessed could be seen as an aggregate of her individual data points?
Now, I walk around with this vision in my head all the time. However, if I didn't want to store duplicate data or calculated data (being vague here cause I am really just thinking out loud), I would have to see this Franziska every time I wanted to assess her humor. Could you imagine: wanting to see someone every time you wanted to think about an over-arching feeling about them? Seems crazy, right? Well, how is this applied to database design?
If I had an eCommerce order in the database, do I store the total? Or do I always find the total by summing the product prices at the time of checkout? My preference is to store the sub totals, tax totals, discount totals, grand totals, etc. in some sort of a WebOrder or Cart record. I don't want to have to re-sum all the cart-products to get the cart total. I think this is a commonly held feeling.
But, what about something like dollars spent over the life time of an eCommerce account? Do I want to store that number all the time? Or, when asked for this value, do I simply want to add all the available transaction values for an account? In this case, I would opt for the on-the-fly sum any time that this value is requested. But, why? After all, is this concept not very similar to the idea of orders and products? Well, similar but different. In an order, the value is never changing once the order has been placed. With a complete transactional total, the value is always changing (every time a new order is placed or a void transaction added).
I was going to say that the frequency of reference was also different. My gut said to say that since order history information is accessed more regularly than transaction totals, it should be static so that you don't have the overhead of aggregate queries. But this isn't really true, is it? If I think about my Amazon.com account, I would probably be more interested in total spent over my lifetime than I would about any individual order history (interest here is relative as I probably never look up either of these values).
So then, I thought about something more work related; I have this database that has contacts that have continuing education credits. At one point, they can have a credit for teaching, at which point they can teach a certain course. How do I want to store this? Everytime I need to see if they can to teach a course, should I test to see if a given CEC (continuing education credit) exists? That's seems a little silly for such an important data point. Or, rather, should I have a table that joins contacts to courses (that they can teach)? This feels more "static", but is it really? I am still needing to join my contact to a teaching table, so why is that any different from joining the contact to a CEC table?
Ok, let's switch to a really small scenario. What about the idea of flagging an order as Shipped. Let's say we have an Order record that has these two columns:
- is_shipped (boolean)
- date_shipped (date/time)
Is this considered duplicate data? Think about this for a second. If is_shipped is true, then we MUST have a date in the date_shipped column. And vice versa, if we have no date_shipped value, then we know that is_shipped is false. Really, is_shipped is just a calculated column that you could think of as:
is_shipped = (date_shipped IS NOT NULL)
So really, this is_shipped column is totally not necessary. However; does that mean it's bad? Does that mean it's overkill? When I query for all shipped orders, is it better to filter on a (is_shipped = 1) column or a (date_shipped IS NOT NULL) column?
But, it's not that simple! We have to think about intent, right? What is my intent. Am I really querying for all records that have a shipped date? Not really. My intent is only to query records that have been shipped - date of shipping is of no concern to me. But, does my intent have to line up exactly with the implementation of the code?
Switching modes again, let's think of a transactional system where ever order placed has an associated transaction. Now, let's say that this system allows us to VOID transactions. So now, we have an Order record that can join to a Transaction record that might have a (status = 'void') value. Seems straight forward, right? But now stop and think about the Transaction table; this table is going to contain every transaction for every user for the entire life of the system! Is this really a table to which you want to JOIN every time you need to get information about an Order? A join of that size must have a huge overhead even with amazing indexes.
So, then, if we Void an order, is it overkill to store the Void status in both the transaction record and the order record? My gut says no, it's not overkill since it will allow us to view order status without having to join to the transaction table. But this is clearly duplicate data.
I don't know if there are any best practices or design patterns in this area. I just go with my gut, but I feel guiltily that I don't really know what is considered right. Does anyone have any "rules of thumb" that they use in this area of design? When is duplication OK? When is it overkill? Must my query intent line up exactly with the data implementation?
Views my friend... views.
Pfft... let your DBA or Database Developer worry about that sort of thing ;P
Seriously though... like Tony Said, views help out a lot when it comes to this sort of thing.
I am not sure I follow. Wouldn't value inside the view still also suffer from the concept of calculated vs. static data?
My only rule of thumb is to store aggregate values in a new table or view if querying the detail table gets takes too long. One company I worked at had 20 million records in one table, and grouping aggregates by month and userID was taking too long.
By rolling up the monthly results by userID into a new table and running the job nightly, I was able to reduce the page load time by several seconds.
The only drawback was that the totals updated nightly. If any new information was added to the detail table on any single day, that information would not show up until the next morning. I think I remember posting something on the web page informing the user that this was refreshed nightly.
The other cool thing about this sort of solution is that query caching comes into play, further improving performance.
It was the sort of situation where any way to speed up page load time was considered a good thing.
Yeah, I am thinking that this sort of stuff is maybe a "per basis" decision to be made. Not sure there are any rules of thumb that can be applied.
I think they are "rules of thumb" (I feel funny using that expression after it's explanation in "The Boondock Saints", you must see that if you haven't already).
That being said, my rules that would apply here are:
- First and foremost a good database design. Get familiar (if you are not already) with the 3 normal forms of database design and follow them. You can read more at http://en.wikipedia.org/wiki/Database_normalization.
- Store most details information
- Use views, stored procedures and temporary table to return data as needed in whatever format (aggregates and custom stuff)
You are completely wrong.
And absolutely right.
The voices in your head arguing are what I like to call the "scientist" and the "engineer". The scientist knows there's one best way to do this, that is perfect and awe-inspiring in it's correctness; but the engineer knows that in the real world, fifth normal form often performs like crap and is difficult to access via code. Views can help with the latter problem, but can exacerbate the former problem (in cases where an important filter is omitted that was needed for the view to be efficient.)
Having this argument in your head ("but this is more correct!"... "But this will perform better!") several times a day is normal and healthy. I try to bias strongly toward the scientist. If you're going to break the rules, there should be a very good, documented reason. Resist the temptation to performance tune before you have a performance problem, though, as that road leads to unnecessarily "sloppy" code and data structures.
Ha ha, that was a good explanation. I see what you are saying, and it make me feel more comfortable.
Dude....you need an intervention!
i think you situation would be the perfect example where an indexed view could be used to increase performance. However if your DB doesn't support them, then yes I would have done EXACTLY what you did.
Thanks, will keep in mind. I remember being bummed over the 24 hr. wait period for new data, and would have liked a better solution.
I'm not so good at indexing, although I've heard it can work wonders.
Sounds like you want either indexed (SQL Server) or materialized (Oracle) views.
Computing some value and storing it in the database like that is really bad unless you take precautions.
First, quite often this kind of optimization is not needed. For instance the old cfunited scheduler (pre 2007) had "time blocks" which were numbered and then used a comma delimited list in a column in the database that stored which blocks each of the sessions overlapped with. This makes things very fragile since a single schedule change makes that entire column potentially invalid. The new scheduler uses a simple query that compares times to figure out overlaps at query time and there was no noticeable performance difference.
Remember your database is very fast, and joining on a properly indexed table, even with thousands or millions or rows need not be that slow.
" Is this really a table to which you want to JOIN every time you need to get information about an Order? A join of that size must have a huge overhead even with amazing indexes."
Isn't really true. The database can page data and use all kinds of tricks to do this fast. Provided it can keep the entire index in memory a index search is O(log n), the data will be clustered by the index, so the join is wicked fast once the position is found. Possibly event near O(1) if the data is cached or other tricks are used.
Second, always, *always* use CONSTRAINT's to make sure this data is valid. What happens if the is_shipped column gets set to true by some errant application code or a developer but the date doesn't match that? What if an order is voided in once place, but not the other or unvoided by someone. It's bad practice to have duplicate, dependent data unless you have constraints that makes sure that no data can be changed to invalidate the entire set.
(Actually, for that matter you should be using constraint whenever possible on your tables and columns, even if they aren't dependent....)
Also, what happens if you write a new application in say, php, java, or flex that doesn't interact with CF? Now you have to duplicate all this special logic about these properties that relate the two columns on the tables.
So yeah, I'd definitely try and stay away from this. It sounds a lot like a pre-optimization that's just going to cause maintainability issues later. Better to use a View of some kind and attach triggers to it so you can insert and delete directly.
Also, don't underestimate how fast your database is. A well designed database structure and well written queries can handle insane amounts of data very fast.
Have you heard of data transformation and data warehouse?
Wait....am I the only one who wants to hear more about the insanely cute German chick? The curved hips? The "cute bottom"? The fierce German accent....umm, what?
We tend to try to look what the users need today and then what will they ask for next and in many cases we end up doing what I call data staging - ie: the king of data duplication - We have a large data warehouse of transactional sales and deliveries (SAP ERP data). Our reporting interface is all done in CF with the MS-SQL backend doing the heavy lifting. When we load the data we do the logic/analysis to profile the orders into groups or classes and then add to the records a common set of relational data. For example, users want to see all their orders, well their orders can only be found by relating the order to the material master (1.7M records) - so why would I want to join millions of sales orders to 1.7M records each time a users wants to see their stuff - much more efficient to do the join on data load and add to our transactional data the material data we need - now our app only gets data from one table versus a join. Since we have CF set for 30 sec timeouts - I hate waiting for data and end users hate it more - we do what it takes to get done what needs in LT 30 secs and this means non-normalized data and letting the DB do what it needs vs having CF do what DBs where built to do. I find it easier to manage data via ETL vs where clauses in CF select statements
I like the scientist vs. engineer metaphor. You could apply that to pretty much any situation in IT where you have customers or deadlines. That is worthy of a lengthy blog post in its own right!
I was as surprised as Davo that most of the commenters seemed way more interested in fifth normal form than in your romantic escapades. Granted, this *is* a technical blog... but I too yearn for more details! :)
I agree with your model of tending to store "one time" computations, such as Order Totals, in the DB at the time they're 'fixed' in place. I think one of the main reasons not to also keep a master roll-up at a level beyond that (total sales for the month, or whatever) is that so often subsequent querying isn't predictable. The vendor wants to see "just how many photos of the German chick did we sell last month?" or maybe "show me this Feb total vs last Feb total", while the customer might only care to look up a specific order to double-check his credit card receipt when it comes in.
Dunno, doesn't answer your question, but I think I agree with the scientist vs engineer paradigm in that regard. Clearly, once datasets reach sizes where simple queries bog down, then things change and indexing needs to be tweaked and views adjusted as necessary.
I frequently store the total or other similar data when dealing with information that is spread across multiple tables (ie, Order with multiple OrderItem).
a) I like to think its faster. Most of my databases don't take up that much space, relative to system ram.
b) I make mistakes and having this extra data helps to figure out what happened. When the order has a total of $10.00 but when I sum up the OrderItems and get $8.00 I have an easier time of finding where the problem occured. (Of course, I could just remove all this information once I've finished testing, but I don't think it hurts)
Denormalizing data is a pretty controversial topic - why couldn't you have just talked more about Franziska, Politics or Money - now you're really going to stir the pot! LOL
In all seriousness though, the case for denormalizing data is not a very good one. The simple fact is that modern relational databases are just too damn efficient to make a case for denormalization. We've all been guilty of doing it at one time or another but it all boils down to planned programmer laziness - when you design tables like this you're planning on being lazy down the road when you're building the app - am I right?
Using that volume of data if you still wanted to keep the reports up to date you could aggregate the collated information in the nightly table with data from the source table only using a where clause to restrict the date/time being checked in the source table.
So long as the date/time field was indexed (which it probably should be in this scenario anyway) then the query should run fast even with 20million+ rows.
Disclaimer: I've never been a fan of views
To latch on to what David just said - it is important to really think about your indexes when constructing the table. A good basic rule to follow relates heavily to the "cardinality" of your data. Meaning, how different are the values from one record to the next for that particular column. Dates typically will have a very high degree of cardinality. The more the rdbms knows about the data the better. So, how do you help keep the rdbms aware of what it's holding? If you're on say SQL Server as your platform then one way to do this is to realize that it is important to UPDATE STATISTICS on the table(s) in question.
When you're designing the database think hard about the cardinality rule. For instance, what will ultimately end up in your WHERE clauses throughout the application and how different is the data for those columns from one rec to the next? Also keep in mind that based on data cardinality you can also safely assume that people who put boolean data types in an index are idiots.
Good luck - hope you get the girl!
What about triggers? A trigger that will calculate one field based on other existing fields will give you the overall benefit of a calculated field along with the benefit of not requiring the db load when totaling multiple rows. The calculation is done as rows are being updated, so this load for reporting can be considered "distributed".
I agree it "feels bad" to store additional data, but this method has always satiated both sides of the brain for me.
Yes, but what constitutes a trigger? A smile from the girl in question, or just a certain look that triggers the trigger - and does that set in motion and endless loop causing your DBA major headaches? And then there's the question of
<cfif girlInQuestionAction CONTAINS "looking_at_her_boyfriend_behind_you">
Seriously though, great (mind-boggling) article to begin with - I really enjoy reading your "development blog with philosophical questions thrown in". I think you should write a book, you've got a unique style that is blissfully absent of "too much geekiness"! :-)
Sorry I haven't been an active part of this conversation; I have been upstate for the last few days making some business happen. I really love all the comments given here.
From what everyone has to say, I am starting to see that my desire to store duplicate data stems from the fact that my databases might not be performing as well as I would like them to. This, in turn, stems in part from the fact that I am relatively new to indexes, especially any sort of multi-column index.
I can really see that database design is something that requires much more than just properly normalized data. I have to start making DB "best practices" a priority learning component.
Gee, Ben, reding about your interest in the cute girl puts a whole new spin on data "model"ing.
I try to bring in the fresh perspective from time to time.
Actually, on the way to work this morning, I started thinking about Albert Einstein. In an audio book I listened to (I think it was the Magic of Thinking Big), a story was told of the time someone asked Einstein how many feet there were in a mile. Einstein did not know the answer. When it was then asked how one of the greatest minds of all time didn't know such a simple fact, he responded that it was silly to waste time trying to remember information that could easily be looked up.
I thought this was a relevant to the conversation. Just as Al didn't want to waste his time storing easily-gotten information, it seems people don't want the overhead of storing duplicate information when the original data can be easily looked up (assuming proper indexes and normalized data).
I don't really have a point here, just another real-world take.
In regards to your "Order" scenario, I would definitely not have an "is_shipped" column and a "Date_Shipped" column. But I do typically have an status_id column that is has a foreign key to an "oder_status" table which would contain all the possible states that an order could be in. So if I had the statuses:
Status_ID | Status
1 | New
2 | Cancelled
3 | On Hold
4 | Paid
5 | Shipped
6 | Complete
7 | RMA Issued
8 | Returned
9 | Refunded
and in my application I just wanted to get a count of how many orders have been shipped. I would do
WHERE status_id >= 5
If I wanted to exclude orders from that list which have been returned I would do
WHERE status_id in (5,6,7)
I will still typically have date columns for any events I need to track for the order like date_entered, date_paid, date_shipped, etc, that way if I need to report on something that occurred within a specific date range, I can do that. But it avoids me having to create columns like is_shipped, is_paid, etc, just so I can can make some report faster.
This approach usually works out quite nicely for me, but in some of the larger scale operations I have been involved with, where the company has millions of order records, there have been times where an indexed view was necessary for particular reports, and there have also been times when I end up creating an "order_summary" table that I write a job to truncate and repopulate on a nightly basis. This will usually contain some sort of aggregate data about orders that can be used to fuel certain financial reports that don't necessarily need to be "real time" but need to be accurate as of the close of business for the previous day, and run too slow when queried directly against the gigantic orders table.
I don't know that much about Coldfusion but in RoR, something like this would be taken care of by the ORM. ActiveRecord auto maintains aggregated fields and I believe has different mechanisms for how they are updated.
I don't know much of anything about Ruby on Rails, or object relation mapping, for that matter. But, from what I *think*, we are talking about a layer of code that sits above the database - we are talking about the domain model and service layer in ActiveRecord?? This is different than the actual database implementation, which may or may not contain duplicate / calculated data?
I don't really know enough about Ruby or ActiveRecord to say. Maybe someone else here can add a comment.
I see your point, which is essentially where/if aggregated data should be stored in the database. Caching information that is too expensive to runtime query for. My feelings are that this type of dare i see hack is fine, even though it denormalizes the database under the condition that it is abstractly enabled. Rails accomplishes this transparently, so a programmer has no real knowledge of the cached field after its set up. Activerecord ensures that the field is properly maintained and no side effects can occur that would allow a loss of synchronization between the cached value and the actual data. If you can provide this guarantee in coldfusion all is well. If not, in my opinion the code becomes less "clean".
So, did you ask her out?