Last night, I finally finished reading Release It! Design and Deploy Production-Ready Software by Michael T. Nygard. Now, given my recent reviews of Scalability Rules and Building Microservices, you might just assume that this is the latest in a series of books that I'm reading about web application and system design. But, you'd be wrong - this was actually the first one. It's taken me close to two-and-a-half years to finish reading this book (I purchased it July 2014). This is a great book. Unfortunately, I didn't have the emotional fortitude to finish it all at once. Instead, I had to spread the "joy" of this book over several years.
Release It! was my "come to Jesus" book. It was the first book that I read that really explored the complexity of building scalable and available applications. But, more than that, it was the first book that I read that really articulated - in cringe-worthy detail - just how much everything can go wrong at any moment. I think my first attempt to read Release It! got as far as this passage:
One way to prepare for every possible failure is to look at every external call, every I/O, every use of resources, and every expected outcome and ask, "What are all the ways this can go wrong? Think about the different types of impulse and stress that can be applied:
* What if I can't make the initial connection?
* What if it takes ten minutes to make the connection?
* What if I can make the connection and then it gets disconnected?
* What if I can make the connection and I just can't get any response from the other end?
* What if it takes two minutes to respond to my query?
* What if 10,000 requests come in at the same time?
* What if my disk is full when I try to log the error message about the SQLException that happened because the network was bogged down with a worm?
I'm pretty sure that I had a minor panic attack (seriously not joking) after this set of questions and had to stop reading the book (for the first time). If I think back to where we were two-and-a-half years ago, InVision was such a different company. We were still just a handful of engineers trying to manage a web application that was becoming wildly popular. Nothing was automated. Nothing was scalable. Nothing was a microservice. I still remember late nights where me and the invaluable Jon Dowdle would perform rolling updates by manually removing servers from the F5 load balancer so that we could RDP into those servers and manually restart processes in order to assimilate JAR file updates.
Even now, my heart is beginning to race just thinking about it.
Part of what makes Release It! so effective - and so terrifying for me - was that it wasn't just a technical exploration of system design. The book's teachings are underscored by several accounts of "from the trenches" failures that Nygard experienced first-hand. Over and over again, Nygard walks the reader through whole-day and multi-day failure scenarios that cost his employers hundreds of thousands of dollars.
There goes my heart again.
In retrospect, I wish that I had muscled through my discomfort and finished the book. There's quite a bit of value in it beyond the war-stories that he shares. The primary lesson of the book hinges on the fact that inter-system communication is basically the cause of all your problems.
Integration points are the number-one killer of systems.
If you've heard of the concept of a Circuit Breaker, it's probably because of this book. Michael Nygard popularized the concept of Circuit Breakers, which allow a system to start "failing fast" when it believes external systems have stopped working. A Circuit Breaker is just one of the ways a system can protect itself - it's one of the ways that a system can rightly express its own cynical nature.
Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn't even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.
By being cynical, an application can prevent problems from a downstream system from propagating upward, towards the end-user.
Cascading failures require some mechanism to transmit the failure from one layer to another. The failure "jumps the gap" when bad behavior in the calling layer gets triggered by the failure condition in the called layer.
Cascading failures often result from resource pools that get drained because of a failure in a lower layer. Integration Points without Timeouts is a surefire way to create Cascading Failures.
Just as integration points are the number-one source of cracks, cascading failures are the number-one crack accelerator. Preventing cascading failure is the very key to resilience. The most effective patterns to combat cascading failures are Circuit Breakers and Timeouts.
What I love is that Nygard will articulate these problems and then drop some minor comment that shines a spotlight on all of the secret insecurity that you harbor about your application:
Hope is not a design method.
Is it getting tachycardic in here, or is it just me?
But in all seriousness, I really did enjoy this book and would certainly recommend it. In some ways, books like Scalability Rules are more practical; but, I think a book like Release It! does a wonderful job of codifying problems and demonstrating just how important good application design is for the success of your business.
Before I end, I wanted to share just a few more passages that I found particularly interesting. To start, this book was the first time that I've ever read the original quote pertaining to "premature optimization":
C.A.R. Hoare famously said, "Premature optimization is the root of all evil." This has often been misused as an excuse for sloppy design. Hoare's full quote said, "We should foget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." His true warning was against chasing small gains at the expense of complexity and development time.
The problem is that optimization happens late, which often means "not at all" when schedules are tight. (When aren't they tight?) Furthermore, optimization can increase the performance of individual routines by percentages, but it cannot lead you to fundamentally better designs. You would never optimize your way from a bubble-sort to a quicksort. Choosing a better design or an architecture optimized for scaling effects is the opposite of premature optimization; it obviates the need for optimization altogether.
I think that really puts the concept in a different light - at least from how I am used to hearing about "premature" optimization.
I also thought this quote about design was very interesting, especially as someone who fancys himself a product designer:
In "The Evolution of Useful Things", Henry Petroski argues that the old dictum "form follows function" is false. In its place, he offers the rule of design evolution, "form follows failure." That is, changes in the design of such commonplace things as forks and paper clips are motivated more by the things early designs do poorly than those things they do well. Not even the humble paper clip sprang into existence in its present form. Each new attempt differs from its predecessor mainly in its attempts to correct flaws.
And then, there was this quote about an engineer's non-tangible connection to their systems - this really hit home for me:
Experienced engineers on ships can tell when something is about to go wrong by the sound of the giant diesel engines. They've learned, by living with their engines, to recognize normal, nominal, and abnormal. In their case, they cannot help being surrounded by the sounds and rhythms of their environment. When something is wrong, the engineers' knowledge of the linkages with in the engines can lead them to the problem with a speed and accuracy - and with just one or two clues - in a way tha can seem psychic.
To be honest, this is how I feel about InVision. Well, maybe more so when we had Fusion Reactor installed on all of the ColdFusion servers. I used to spend so much time looking at the metrics for the system that I swear I could just "feel" when something wasn't right. If the database activity looked "wrong" or the response times "seemed off", I would know. I would just feel it. Unfortunately, I could never describe the problem well, so people rarely took me seriously; but like a momma-bear, I would just intuitively know that something was wrong with my baby cub.
Nowadays, the InVision is too large and distributed to have that kind of emotional bond. But, there are still portions of the application that I can connect with at that level. For certain aspects, I can see error rates, CPU graphs, and statsD charts and know, at a glance, if anything is wrong. But, those experiences are fewer and farther between.
And then, of course - OH MY CHICKENS - so much this:
Integration databases -- don't do it! Seriously! Not even with views. Not even with stored procedures. Take it up a level, and wrap a web service around the database. Then make the web service redundant and access through a virtual IP. build a test harness to verify what happens when the web service is down. That's an enterprise integration technology. Reaching into another system's database is just ... icky.
Nothing hobbles a system's ability to adapt quite like having other systems poking into its guts. Database "integrations" are pure evil. They violate encapsulation and information hiding by exposing the most intimate details about a system's inner workings. They encourage inappropriate coupling at both the structural and semantic levels.
I can't tell you how many of my refactoring attempts have been hampered by the sudden realization that some other "microservice" (he says with distain) reaches directly into "my" database. This is one of my biggest sources of sadness and frustration.
Well, that and "debug" style log entries:
While I'm on the subject of logging levels, I'll address a pet peeve of mine: "debug" logs in production. This is rarely a good idea and can create so much noise that real issues get buried in tons of method traces or trivial checkpoints.... I recommend adding a step to your build process that automatically removes any configs that enable debug or trace log levels.
Our application emits so much data that it can make it difficult to pinpoint any problem. This is particularly frustrating during an incident where I am digging around in logs for an area of the application with which I am not familiar. In such cases, you're likely to hear me utter phrases like, "Oh my god - why on earth are they logging entire Node.js stream objects? Do they hate me? Is that why they do it?".
This whole book is valuable; but, these were just a few of the passages that struck a particular chord or seemed to stick like burs to my mental socks. Overall, definitely a recommended book, especially for anyone that deals with large web-based application development.