Over the weekend, I read Distributed Systems Observability: A Guide To Building Robust Systems by Cindy Sridharan. It's a short e-book (about 36-pages), provided as a free download by Humio (in collaboration with O'Reilly Media). And, while it's short, Sridharan paints a good overview of the complexities around observing a system that's layered on top of the highly dynamic spin-up, spin-down cloud-based infrastructures that are available today. And for me, as someone who is still very much a novice in this brave new world, having continuous reminders, like this e-book, are helping to keep my ship pointed in the right direction.
First off, let me just point out one line from this book that every developer should have tattooed on the inside of their eyelids:
... research has shown that something as simple as "testing error handling code could have prevented 58% of catastrophic failures" in many distributed systems. (Page 14)
Sridharan brings this up in the context of "distributed systems;" but, this take-away can be universally applied to all programming situations, big and small. I literally can't stress this enough. Literally! Designing for failure is only helpful if you actually test the failure. And, I'm not talking about "Chaos Testing" or complex integration testing. I'm talking about things like writing a try/catch statement then explicitly throwing an Error to make sure that your catch-block actually works as expected. It's just that simple. And yet, it's never as obvious as it should be.
I find it very helpful to write my try/catch/finally blocks first and code them to do little more than turn around and call other methods. Not only does this provide a nice separation of concerns, it ensures that I have at least some error handling in place. And, it primes my brain for the expectation that errors will happen; and, that my code needs to embrace them rather than deny them.
Error-handling soap-box aside, one of the most salient points that Sridharan makes in her book is that we really have to adjust the way that we think about alerting. When we're building containerized systems with health checks, readiness probes, liveness probes, etc., we're building systems that are designed to fail and recover. Which puts us in the somewhat uncomfortable position of having to embrace failure as both an expected and an acceptable outcome (can you see the trend appearing?).
Building systems on top of relaxed guarantees means that such systems are, by design, not necessarily going to be operating while 100% healthy at any given time. It becomes unnecessary to try to predict every possible way in which a system might be exercised that could degrade its functionality and alert a human operator. It's now possible to design systems where only a small sliver of the overall failure domain is of the hard, urgently human-actionable sort. Which begs the question: where does that leave alerting? (Page 7)
This is something that I wrestle with every day as I adjust alert thresholds in our Logging and Time Series Databases. My guiding star - my true North in this sea of confusion - is the point that Sridharan underscores in the book:
.... all alerts (and monitoring signals used to derive them) need to be actionable. (Page 7)
When an alert is not actionable, it creates fatigue. And when we suffer from alert fatigue, we end up missing the alerts that really matter.
Another part of the book that really struck a chord in me is the idea that pre-production testing is no longer a sufficient testing plan. As our distributed systems get more complex and unpredictable, it's no longer possible to understand and account for all failure modes that may occur (again, the trend of embracing failure).
However, it's becoming increasingly clear that a sole reliance on pre-production testing is largely ineffective in surfacing even the known-unknowns of a system. Testing for failure, as such, involves acknowledging that certain types of failures can only be surfaced in the production environment. (Page 14)
This notion is front-of-mind for me because we've been talking a lot lately about load-testing at work. Now, granted, my area of expertise (if you can even call it that) is very narrow. But, in my experience, load-testing outside of production has never felt like a fruitful venture. No testing environment is truly like production. And, no set of automation steps can really represent the demand that flesh-and-blood users will place on a system and its underlying data-stores
As such, I do all of my load-testing in production. With real users. Using incremental roll-outs. I feel safe doing this because I always have the three prerequisites that Sridharan outlines in her book:
Being able to test in production requires that testing be halted if the need arises. This in turn means that one can test in production only if one has the following:
- A quick feedback loop about the behavior of the system under test.
- The ability to be on the lookout for changes to key performance indicators of the system.
To be honest, this approach to testing has always made me feel a little bit reckless. Or, perhaps more accurately, a bit fraudulent in so much as I wasn't "smart enough" or "motivated enough" to figure out how to better test in a pre-production environment. This is why it was like a breath of fresh air to read Sridharan's thoughts on the matter. It gave me some permission to shed this load of guilt that I tend to carry around with me.
At 36-pages, I hope you can see that Distributed Systems Observability by Cindy Sridharan is short but dense with information. I think it could be worthwhile for everyone to take an hour and read it. Not only does it emphasize the importance of observability, monitoring, and alerting, it also undrescores the idea that these are organization-wide concerns. And, that they are most effective when they are woven into the very fabric of a company's culture. Maybe that cultural change starts with you.