In my post yesterday on clearing
NULLable database values in ColdFusion, I was using the concept of "Task Management" as my exploratory context. And, in the
task database table that I created for the demo, I included both an
isComplete column and a
completedAt column. In theory, I could have written the demo using a single column,
completedAt, since a non-
NULL value within the
completedAt column would indicate that the Task in question had been completed. But, I ended up using two columns because I believe they actually answer two different semantic questions.
Here is the
CREATE TABLE statement for the Task table in this thought-experiment:
CREATE TABLE `task` ( `id` int unsigned NOT NULL AUTO_INCREMENT, `description` varchar(255) NOT NULL, `isComplete` tinyint unsigned NOT NULL, `createdAt` datetime NOT NULL, `completedAt` datetime DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB;
NOTE: I don't have any secondary indices defined on this table. But, ignore that fact for this post - this isn't a discussion about query performance or optimization, it's a discussion about column semantics.
As you can see, I am defining both an
isComplete and a
completedAt column. Understandably, one might therefore assume that the following SQL statements both select the same data:
SELECT t.* FROM task t WHERE -- Using "isComplete" to filter query for COMPLETED tasks. t.isComplete = TRUE ; SELECT t.* FROM task t WHERE -- Using "completedAt" to filter query for COMPLETED tasks. t.completedAt IS NOT NULL ;
Which could mean that I can get
isComplete column altogether and just use the
NULL vs non-
NULL value of the
completedAt column to determine the "completed" state of a given task.
I spent a few minutes going back-and-forth on whether I wanted to do this in my table design. Ultimately, I decided to include both columns because I felt that using a single
completedAt column would be overloading the semantics of the column. Meaning, the
completedAt column would be using a single value to store two different pieces of information:
Is the task complete?
When was the task marked as completed?
The problem that I couldn't get past was that these two pieces of information may be incidentally linked in some cases but are not inherently linked in all cases. I kept asking myself questions about the history of the data and the application in which it is being consumed (remember, this was just a thought-experiment so I didn't have all that information on hand):
completedAtcolumn added from the start? Or, was it added later on in the application development life-cycle? And, if it's the latter, were there existing records wherein I knew that the task had been completed (via
isComplete) but I didn't actually have a record of when the task was completed?
If a completed task is marked as incomplete, do I necessarily need to
completedAtcolumn? Or, can I use that column to indicate when the task was "last marked as complete" - something that I might want to show the user within the application interface?
In the future, might I want to move the
completedAtcolumn into another table altogether so that I can create an "audit trail" regarding the "Who" and "When" of task completion?
Obviously, there's something to be said about solving the problem you have in front of you and not prematurely optimizing for scenarios that don't (yet) exist. But, the fact that I had all of these questions floating around in my head felt a bit like a "code smell" - an indication that using a single column to store multiple pieces of semantically-separate information was likely not the right choice.
There's also something to be said about the fundamental readability of the SQL itself. I would argue that most people - myself very much included - find a positive assertion easier to read and to understand than a negative assertion. Meaning:
isComplete = TRUE
... has quite a bit less cognitive load when compared to:
completedAt IS NOT NULL
And, since storage is cheap and we simple humans spend the majority of our time reading code, not writing code, I would argue that the approach with the least cognitive load is implicitly the better option.
Database schema design isn't simple. And, it certainly isn't static - it must evolve in lock-step with the needs of the application. It's also not cut-and-dry - there are usually multiple ways to solve the same problem. And, in my case, I believe two columns with slightly different semantics is better than a single column with somewhat overloaded semantics. Your mileage may vary.
Want to use code from this post? Check out the license.