SQL Queries That Look The Same Are Not Violating The DRY Principle
A couple of months ago, I tried to list out several lessons that I wish I had learned much earlier in my web development career. One of the more important lessons in that list is the proper application of the DRY principle; otherwise known as the, "Don't Repeat Yourself", principle. After writing that post, it occurred to me that it would be worth exploring SQL queries in the context of DRY. At first, it can feel like two SQL queries represent duplicate code if they contain the same query. But, this does not violate DRY; not unless the queries contain "business logic."
Imagine a SQL query in your application that looks like this:
SELECT id, name, email FROM account WHERE id = 1
Now, imagine a second query in the same application that looks like this:
SELECT id, name, email, passwordHash -- This field is the only difference. FROM account WHERE id = 1
To the newer developer, these two SQL queries are essentially the same. Which is why this scenario feels so much like a violation of the DRY principle. This is especially true when you are first learning about Design Patterns and are very eager to refactor everything you see into the clean abstractions that you just read about.
Trust me, I know how that feels. I've been there. I've used Design Patterns that I didn't fully understand. I've made my life much harder than it has to be.
The trap here is that the developer inappropriately applies the DRY principle to "syntax". Meaning, the developer sees the same (or similar) series of tokens, references, and operators and they believe that this represents "duplication". But, the DRY principle applies to the duplication of business logic, not the duplication of "code".
These queries have no inherent business logic. They are just data-access.
If we zoom-out from the queries a little bit, the picture starts to come into focus. Let's step up one level in the data-access context:
<!--- Our first query. ---> <cfquery name="local.results" cachedwithin="#one_hour#" datasource="#read_replica#"> SELECT id, name, email FROM account WHERE id = 1 </cfquery> <!--- Our second query. ---> <cfquery name="local.results"> SELECT id, name, email, passwordHash -- This field is the only difference. FROM account WHERE id = 1 </cfquery>
Now that we see a bit more of the context, albeit a heavily truncated view, we can see that these two queries are executing under very different constraints. The first query is running against a read replica and is also using some low-level ColdFusion caching. The second query, on the other hand, is running against the master database.
What we begin to understand is that these two queries change for different reasons, which is a great indication that "duplication" is not taking place. The first query is clearly being used in a context in which replica lag and stale data is an "acceptable" trade-off. The second query, on the other hand, is being run in a context where only immediately-consistent data is acceptable for consumption.
Now, to be clear, I am not saying that you should be continually duplicating query logic all over your application. If you have query logic that can be wrapped-up and reused, by all means do it:
<cffunction name="getByFilter" returntype="query"> <cfargument name="id" type="numeric" required="false" /> <cfquery name="local.results"> SELECT id, name, email, passwordHash FROM account WHERE 1 = 1 <cfif structKeyExists( arguments, "id" )> AND id = <cfqueryparam value="#id#" cfsqltype="cf_sql_integer" /> </cfif> </cfquery> <cfreturn results /> </cffunction> <!--- .... then, elsewhere in your application .... ---> <cfscript> var account1 = someAbstraction.getByFilter( id = 1 ); var account2 = someAbstraction.getByFilter( id = 2 ); var account3 = someAbstraction.getByFilter( id = 3 ); </cfscript>
Here, we've created an abstraction that allows us to simplify some of our calling code. But, this is not an exercise in "DRYing the code out"; nor does this fundamentally change the situation - it has only changed the "syntax" of the data-access. No "logic" has been encapsulated - we've only created a "short-hand" for a query that lends itself to re-use.
You can think of this change more along the lines of the "Extract Method" refactoring pattern, not an exercise in removing duplication. It is an implementation detail only; granted, one that happens to reduce the total lines-of-code (LOC) and will likely lead to greater developer velocity.
The DRY principle is about "business logic". If a SQL query contains business logic, it should not be duplicated. Take, for example, the following deletion query:
INSERT IGNORE INTO account_archive SELECT id, name, email, ( '[sanitized]' ) AS passwordHash FROM account WHERE id = 1 ; DELETE FROM account WHERE id = 1 ;
This query is not just data-access. This query contains business logic that defines how two concepts - "account" and "account archive" - have to change in lock-step. It also defines the very act of deletion as a "soft action"; meaning, the data is not removed from the database, only moved around.
This kind of logic should be in one place and once place only. Duplicating this particular query would be a duplication of business logic; it would violate the DRY principle; and, it would almost certainly lead to bugs and inconsistent data over time.
Once you realize that duplicating SQL is not, inherently, a violation of the DRY principle, it frees you up to embrace "performance as a feature". It grants you permission to write queries that are optimized for a particular use-case; queries that change for different reasons; and, queries that are easy to delete. As long as you are not duplicating "business logic", you should feel free to duplicate as much SQL as you feel is necessary.
Epilogue: What if my SQL queries contain lots of conditions?
If a SQL query contains a lot of conditions, it can be tempting to view those conditions as "business logic". For example:
SELECT id, name, createdAt, updatedAt FROM project WHERE type = 'simple' AND isLocked = 0 AND isArchived = 0 AND isSample = 0 ;
If this set of
WHERE clauses were to show up in two different queries in the same application, it's easy to feel like some sort of duplication is taking place. But, it is not. This is not "business logic" - this is just "data-access". You happen to be querying for data that adheres to these conditions in this particular context. And, later on, you might need to query for data in a different context that is slightly different.
The trick it to ask yourself the question: Is it possible that one query will ever need to change while the other one doesn't. If so, then these two queries change for different reasons and are not violations of DRY.
Epilogue: What if I make a fundamental change and have to update a lot of SQL queries?
Dovetailing with the query above, applications evolve over time. That's natural. And, as an application evolves, so too will its queries. Imagine creating an application where the concept of "archiving" an entity didn't get introduced until much later in the application life-cycle. There's no doubt that you would have to add an
isArchived check to many of the existing SQL queries.
Having to make a change in a lot of places may feel like "duplication"; or rather, a byproduct of duplicated logic. But, it is not. A fundamental change to an application is never easy. And, often involves touching a lot of code. That's what makes the change, "fundamental".
When making such a change to an application, each interface has to be reviewed and evaluated uniquely in the context of said change. For example, an
isArchived check may need to be added to a "user's view" of the application; but, probably not to an "Support administrator's view". As such, fundamental changes can rarely be applied in as a sweeping update; instead, fundamental changes demand a holistic view of the application landscape.
Epilogue: What if I forget to add a necessary condition to a SQL query?
That's a bug. Bugs happen. If you can use the "Extract Method" refactoring pattern, it may help reduce the likelihood of bugs in the future. But, if you can't, just fix the bug and move on. Bugs are not inherently an indication of duplicated business logic.
Want to use code from this post? Check out the license.
If the same query is used for multiple customers and one customer has a new requirement, you also have to take into consideration if the change will cause an unavoidable performance hit ( no matter how small) if it is worth it to make a version just for that customer. Now in the case of business logic, a flexible rules engine can be a lifesaver.
Exactly! At work, we actually had a somewhat similar thing happen recently. One team had to add a query for an edge-case; but, there was no index that made the query fast. And, while it only ran a few times, it had to do a full-collection scan each time. The burden on the Database was so great that it ended up crashing the service, even tough most of the service didn't even care about that query.
In the example below:
Just keep in mind if you rely heavily on queries with multiple dynamic criteria, the database server may cache each rendered version of the passed sql separately, adding to the amount of cached queries on your server causing them to rotate much faster than anticipated.
One approach that adds a bit of verbose code but helps, is to write the sql in a way that the criteria passed determines what is filtered but the passed sql statement remains the same:
Very useful information. I would usually add an 'isArchived' check right from the start. So, these kind of issues should be resolved during the application blue printing phase. I think the trick is to imagine what sort of features, an application is likely to require, at the beginning of an application build. This is where experienced developers become invaluable, because they are able to draw on decades of knowledge, wisdom, failure & success.
However, I do acknowledge that it is virtually impossible to predict absolutely every feature that might or might not be required during the lifetime of an application, so understanding things like the DRY principle, is definitely an added benefit when constructing scalable systems.
That's super interesting! I know that one of the benefits of using the parameterized queries in the caching; but, I don't think I ever considered the fact that dynamic
WHEREclauses actually create multiple versions of the query that is passed over-the-wire and therefore different queries that are cached. Outstanding point!
And, the fact that you are setting the value to
nullwhen the argument doesn't exist -- so well thought-out. Because that is something that I've actually run into before (realizing that
<cfqueryparam>"executes" even if it is inside a SQL conditional because, obviously it does that, in retrospect).
I still write the vast majority of my SQL in tag-based components using
<cfqueryparam>, so I am gonna noodle on how I might want to format this in that context. You've given me much to think about.
So much this:
Having more experience people involved in the discussion of a project is so helpful. Actually, it's helpful to have both experienced and novice people. Having that dichotomy can be great for sparking discussions. The advanced devs can share insights and the novice devs can ask for more information and explanation. This forces everyone to really think through an idea rather than taking anything at face-value.
And, hopefully, with more experience, you can get better at wearing both hats - pulling on your past experience, but then challenging yourself to really confirm that your ideas are worth adding.
I am fascinated by your approach. I had to sit down and try it out:
One byproduct of this approach is that it would really open up the possibility for me to move to an all CFScript syntax. And that, in and of itself, might be enough to sell me on this concept.
I also feel like, in a way I can't really articulate, makes the dynamic query easier to reason about. Since the query is always the same, it feels like there's less mental overhead.
I ran into this approach once we realized that our database was caching too many prepared statements needlessly, some search queries had multiple possible criteria for custom user dashboards, for example.
I always have been a big proponent of letting the database do what is best for, especially large record set retrievals and our search queries were meant to help expedite performance and not bog down the server.
Glad to be able to contribute to all the great discussions on your blog buddy, keep them coming!
Well, it has certainly given me a lot to think about.
A few years ago, I needed to build parameterised queries dynamically.
It came about when I needed a routine to add over 250 global setting column values to a MySQL table. But, the routine would never know ahead of time, how many of the 250, would need updating.
People kept telling me that it couldn't be done, which is like a red rag to a bull, for me. So, I built a little library that allows one to pass in an unlimited amount of WHERE clause parameters.
But, the reason I wanted to join the discussion, is that I built these functions, using CF tags rather than CFScript. So, it is possible to do this kind of thing, in either. Ironically, I use CFScript much more, nowadays, but at the time, CFScript was like a foreign language to me:)
Anyway, I was going to post one of the functions that converts the WHERE clause params, but decided against it, because it is quite lengthy.
The only thing, I didn't build in, was the ability to handle sub queries, but it handles INNER JOINS etc.
I am just wondering whether this kind of approach would be an example of DRY programming?
That sounds like an intense piece of SQL :D If you want to share it, I'd suggest putting it in a GitHub Gist and then sharing the Gist URL here -- that way, you can also edit it if you need to in the gist.
Ha ha, I 100% understand this mentality :D In some ways, it's also why I change technologies very slowly - I am always convinced I can solve the problems using the technologies I already know.
As far as "DRY", I am not sure it relates to your query mechanics. It sounds like you had to create something complex to solve a complex problem. In and of itself, though, I wouldn't be able to say of that speaks to DRY.
DRY, YAGNI, Rule of Three, and certain statements made by a certain Algorithms professor at Stanford often get taken out of context. And worse, as hard and fast rules to which we must adhere at all costs.
I was just talking about this to a group of developers. I had given them a code challenge where they were going to have to sacrifice something to get a good result. Whether doing something in a very unorthodox manner, or having a very poorly performing application, or whatever, something was going to have to go. The point was to show that these concepts are ideas to evaluate in the context of what makes sense for your application. And by extension, the purpose of a software developer isn't to write code. We're writing code that does that. Our purpose is to decide what is important and what is not when that code gets written.
100%. In a different post that I recently wrote about indexes, I made sure to keep iterating, index design is about trade-offs. But, really, so is all programming. We are constantly trying to balance many competing forces that affect both immediate and long-term development of the application.
One of the most frustrating things that I've had to deal with is when developers continue to augment an existing query because they think that is "better" than writing the same query twice in two different context. So, what started off as a query of the
usertable slowly gets joined to
account... and then to
subscription... and then to
password_policy... and then has calculated columns .... and then as sub-queries.
And then, before you know it, a "simple query" that gets called on every page request, suddenly has 8-JOINs in it and 3 sub-queries for calculated columns and it becomes a tight coupling hot mess!!!