In my ongoing journey to better understanding monitoring and logging (my big hairy goal for 2018), I just read Practical Monitoring: Effective Strategies for the Real Work by Mike Julian. I really enjoyed this book. Not only does the author write with a friendly and welcoming tone, he strikes an excellent balance between breadth of content and the depth of detail. Simply, he makes monitoring feel like an approachable landscape, despite the fact that it is multi-faceted and requires a great deal of effort to implement in a meaningful way.
Last week, I reviewed the book, Effective Monitoring and Alerting by Slawek Ligus. Whereas Effective Monitoring was a deeper dive into the technical details of configuring and maintaining monitors and alerts for Operations, Practical Monitoring by Mike Julian approaches monitoring and alerting from a much broader, more holistic viewpoint. Mike does touch on many of the same metrics covered in Effective Monitoring, including a brief discussion of counters, timers, gauges, percentiles, and standard deviations. But, neither book really left me feeling like I had a solid understanding of how statistics truly empower me in the context of monitoring.
But, I now believe that this is due to false assumptions on my part, not a shortcoming of the books. As someone who is relatively new to monitoring and alerting, I had hoped that if I could just wrap my head around timing metrics and counters then, suddenly, the systems that I was observing would start "working". Mike quickly dispels this notion:
Many people I've spoken to over the years are operating on the misapprehension that "rubbing a little stats on it" will result in magic coming out the other end. Unfortunately, that isn't quite the case. (Kindle Location 889)
I had hoped that if I could pick just the right statistic and observe it within just the right timing window and apply just the right smoothing and anomaly detection algorithms, then my alert would tell me that "everything is OK." I focused so heavily on the stats because I knew, in my heart, that the system was not OK. And, part of me wanted to believe that I could find the right maths to make it OK.
As a company (or perhaps this is just my personal struggle), I believe that we becamse so obsessed with the idea of "measuring everything" that we lost sight of the fact that metrics don't actually solve any problems. Metrics answer questions and offer insights; but, they won't actually improve the state of an application:
I once worked with a team that ran a legacy PHP app. This app had a large amount of poorly written and poorly understood code. As things tended to break, the team's usual response was to add more monitoring around whatever it was that broke. Unfortunately, while this response seems at first glance to be the correct response, it does little to solve the real problem: a poorly built app.
Avoid the tendency to lean on monitoring as a crutch. Monitoring is great for alerting you to problems, but don't forget the next step: fixing the problems. If you find yourself with a finicky service and you're constantly adding more monitoring to it, stop and invest your effort into making the service more stable and resilient instead. More monitoring doesn't fix a broken system, and it's not an improvement in your situation. (Kindle Location 300)
In a weird way, this may have been the most meaningful statement in the book. It was the reality check that I needed! It was the reminder that, as Mike bluntly puts it (quoting his colleague), at some point, you just have to fix your shit! This may be obvious to those that have more experience; but, I have to admit that I got so mired in the very act of measuring the system that I forgot - or didn't want to acknowledge - that measuring is only the tip of the iceberg.
Monitoring doesn't fix anything. You need to fix things after they break. To get out of firefighting mode, you must spend time and effort on building better underlying systems. More resilient systems have less show-stopping failures, but you'll only get there by putting in the effort on the underlying problems. There are two effective strategies to get into this habit:
- Make it the duty of on-call to work on systems resiliency and stability during their on-call shift when they aren't fighting fires.
- Explicitly plan for systems resiliency and stability work during the following week's sprint planning / team meeting (you are doing those, right?), based on the information collected from the previous on-call week. (Kindle Location 762)
As someone who is currently in the stage of bolting metrics on after the fact, I also enjoyed the notion that monitoring and alerting is part of what makes a feature "production ready":
Strive to make monitoring a first-class citizen when it comes to building and managing services. Remember, it's not ready for production until it's monitored. The end result will be far more robust monitoring with great signal-to-noise ratio, and likely far better signal than you've ever had before. (Kindle Location 253)
... but, again, I truly appreciated the reality check that not every metric needs to feed into an alert:
I have found that many people seem to build monitoring without understanding its purpose. They seem to believe that the driving purpose of a monitoring system is to alert you when things go wrong. While older monitoring systems like Nagios certainly lead you to that conclusion, monitoring has a higher purpose. As a friend of mine once said:
"Monitoring is for asking questions." Dave Josephsen, Monitorama 2016
That is, monitoring doesn't exist to generate alerts: alerts are just one possible outcome. With this in mind, remember that every metric you collect and graph does not need to have a corresponding alert. (Kindle Location 525)
Now, I don't want you to think that the entire value-add of this book is that it forced me to question my assumptions. That was really important for me on my own personal journey. But, there's more to this book than a few paragraphs of hard-hitting commentary. Mike touches on monitoring across every layer of the application, from the browser to the web servers to the application servers to the database, and all the way through to network security, compliance, and intrusion detection systems. He even discussed things I had never heard of before, like "network taps" that acts as a sort of "man in the middle" observer within your own private network.
Some of these topics are covered in lesser or greater depth. But, in each case, Mike paints a picture of the landscape and, at the very least, offers some advice on how to learn more about the topic; and, what tools may be helpful.
And, with each topic, Mike always comes back to the question of "Why?" Why are we measuring some value? What purpose does it play? How does it make the application better? He is continually taking the technical discussion and reframing the conversation in a pragmatic viewport. He does this both from the user's perspective, recommending that we measure user touchpoints first:
The best place to add monitoring first is at the point(s) users interact with your app. A user doesn't care about the implementation details of your app.... One of the most effective things to monitor is simply HTTP response codes (especially of the HTTP 5xx variety). Monitoring request times (aka latency) after that is also useful. Neither of these will tell you what is wrong, only that something is and that it's impacting users. (Kindle Location 551)
But, he also does this from the business' perspective as well, recommending that we, as engineers, understand what Key Performance Indicators (KPI) are for the business and how we can measure them such that we can spot trends over time:
We learned that starting your monitoring efforts from the outside, rather than deep in the bowels of the infrastructure where most people start, is a far better approach as it provides you with immediate insight into the actual questions people are asking ("Is the site up?" "Are users impacted?") and sets the stage to iteratively go deeper.
The questions asked by business owners are often vastly different than those asked by software engineers or infrastructure engineers, and I think this is an area where we as engineers can improve our skills and understanding. Once we learn to ask the questions the executives are asking, we can really begin to work on the most important and highest-leverage problems facing the business.
.... A key performance indicator (KPI) is a metric that measures how your company is doing along lines the company has deemed important to the health of the business as a whole. A KPI, like a performance metric does for the app and infrastructure, tells you how your business is doing. Also like performance metrics, some metrics can be rather fuzzy about what they tell you and may require some degree of judgment in order to make decisions with them. (Kindle Location 1033)
To underscore the concept of KPIs, Mike lists a few questions that a high-level executive might ask:
- Are customers able to use the app/service?
- Are we making money?
- Are we growing, shrinking, or stagnant?
- How profitable are we? Is profitability increasing, decreasing, or stagnant?
- Are our customers happy?
To be honest, I don't know how to measure some of these values. At InVision, many of these questions are handled by non-product-engineers. For example, our Customer Success Team knows more about customer happiness by performing customer interviews and watching NPS (Net Promoter Score) results. And, our Data Science and Growth teams know more about profitability, customer cost, and customer lifetime value (LTV). As a product engineer, I think it would be great to understand more about the business. I have no doubt that it would help color my view of the product and of the decisions I make (and of the impact that those decisions have).
Hopefully you're beginning to see the 360-degree perspective that Mike Julian takes in this book. Monitoring and alerting isn't just about metrics. It isn't just about alerts. It isn't just about request tracing or network security or SOC2 compliance. It isn't just about driving the business forward and creating a better user experience.
It's about all of that!
And, as he closes the book, Mike reminds us that this is just the beginning:
Monitoring is never done, since the business, application, and infrastructure will continue to evolve over time. (Kindle Location 2579)
Practical Monitoring: Effective Strategies for the Real World by Mike Julian is a quick and easy read. I read it in about 3-days. And, I got a lot of value out of it. If you're interested in monitoring your web application, it's definitely a book that I would recommend.