At work, we've fully embraced the concept of "measure everything." We've wrapped StatsD (DogStatsD) timing metrics around all external calls; we record the latency of all web requests; we monitor HTTP status code volumes in nginx; we use New Relic to observe CPU utilization and Garbage Collection frequency. We have system metrics coming in from every which way. But, the reality is, most of the time, I feel like I have no idea what I'm doing. My life has become an exercise in trying to remain vigilant in the face of a flood of false positives. In an effort to bring more meaning to the madness, I've read Effective Monitoring and Alerting: For Web Operations by Slawek Ligus. To be honest, the book wasn't exactly what I was expecting. But, I found it to be fascinating and helpful nonetheless.
| || || |
| || |
| || || |
I believe that much of my flummoxing is due to two facts: one, our system has some growing pains (that we are continuously addressing); and two, large-scale systems are fundamentally different than small-scale systems. Together, these two conditions create a context in which every little spark feels like a potential fire that could rage out of control at any moment. Which is, needless to say, an exhausting mental state to live in.
As such, I took much comfort in Ligus' statement that continuous errors are just a fact of large-scale systems. In the section on RCA (Root Cause Analysis) docuemnts, Ligus writes:
In large-scale system operation failure is a norm. Transient errors often occur very briefly, sometimes in spikes at unpredictable intervals. Low percentage errors occur continuously during normal system operation and constitute a tiny percentage of failed events (not more than 0.1%) in context of the all successful events.
Both types of errors will crop up at large scale and are seen as potential threats to availability levels agreed in the SLAs. This belief is not unjustified, but it is important to keep a healthy sense of proportion as the real threat to availability comes from long lasting outages and not occasional errors. Despite that fact, there is a tendency for more human effort to be invested in root cause analysis of petty issues than prevention of potentially disastrous outages. (Page 114)
Along the same lines, Ligus recommends that we shouldn't be too quick to alert on recoverable events:
.... recoverable events should not trigger alerts until it becomes apparent that their recovery would take too long or could evolve into more serious problems. (Page 52)
This is really where I'm currently having a lot of trouble. Much of my alerting is based around web-request latency. And, what I monitor is the 95th-percentile of times, maxed over any given cluster. So, while the average latency of any cluster is typically fine, the 95th-percentile latency is very spikey and somewhat concerning. But, inevitably, it recovers.
Now, given the recoverability of my 95th-percentile metrics, I'm left with two questions. Building on what Ligus said above about recoverable events, I'm wondering if I need to increase the time-frame in which my latency threshold is breached (before I alert)? Or, should I be using something other than the 95th-percentile?
On the topic of thresholds breaches, Ligus has some specific timing recommendations abased on the urgency of the given issue:
The selection of breach and clear delay is almost as important as accurate threshold estimation.
Selecting the number of data points to alarm should reflect alerting goals. For critical and urgent issues it makes sense to alarm as early as possible, because quick response in these cases is vital. For less urgent and recoverable issues, it's okay to wait a little longer, because these are unlikely to immediately catch the operator's attention anyway. Letting a few more data points arrive raises the confidence level that something important is happening and improves precision.
The question is how to set off alarms as soon as possible, but not too soon. Unfortunately, it's hard to get an easy answer along these lines, and the objectives will have to be balanced. The sooner the alarm comes, the more likely it's just an anomaly. Of course, with issues of high criticality it is better to be safe than sorry, but too many false positives lead to desensitization in operators, which has some serious adverse effects.
The following [list] is a recommendation for an allocation of the monitor's breach delay, based on my experience of working in ops teams.
* Super Critical: 1-2 mins ( ex: Shutdown, high-visibility outage )
* High Priority: 3-5 mins ( ex: Partial loss of availability, high latency )
* Normal: 6-10 mins ( ex: Approaching resource saturation )
* Recoverable: 11+ mins ( ex: Failed back-end build )
Here, he recommends using 3-5 minutes for high latency, which is what I'm measuring. But, the reality is, I'm already using [at least] 10 minutes for my breach windows because using anything less increases the number of alerts that I get. So, I'm already using a higher value than he recommends.
Which then speaks to my second question - should I even be using the 95th-percentile for this kind of monitor? According to Ligus, maybe not. When Ligus discusses the different types of metrics and what they're good for, he definitely mentions that the 95th-percentile metric is meant for failures that require immediate intervention. And, for events that have a high degree of recoverability, such as my random web-request latency spikes, he says that an average is preferred to the high-percentiles:
n, sum: good fit for measuring the rate of inflow and outflow, such as traffic levels, revenue stream, ad clicks, items processed, etc.
Average, median (p50): suitable for monitoring a measurement of center. Timeseries generated from these statistics give a feel for what the common performance level is and reliably illustrate its sudden changes. When components and processes have a fair degree of recoverability, an average is preferred to percentiles. When looking for most typical input in the population, the median is preferred to the average.
High and low percentiles: suited to monitoring failures that require immediate intervention. The extreme percentiles of the input distribution can reveal potential bottlenecks early, through making observations about small populations for which performance has degraded drastically. For speedy detection of faults, percentiles are preferred to the average because extreme percentile values deviate from their baseline more readily and thus cross the threshold sooner. (Page 65)
Based on this, and the fact that request latency historically recovers (in our application), I feel like I should be switching from the 95th-percentile to the average; and, lowering my threshold breach window down to 3-5 minutes. After that, I have to figure out what a new, reasonable threshold limit works well for an average.
Which segues nicely into the topic of automation. Ligus would be saddened by the thought of me trying to figure out a "reasonable threshold" for my latency monitors because he believes manual configuration to be a brittle and non-scalable approach to monitoring and alerting. In the book, he recommends that all of the setup and the upkeep of your application's monitors should be managed by an automated system. And, that "reasonable thresholds" should be automatically based on historical data; and, that those thresholds should be automatically adjusted over time using scheduled tasks.
To be frank, this part of the book - which is like 1/3 of the book - went over my head. Here, Ligus talks about creating configuration files that contain meta-data about each monitor. And then, consuming those configuration files in programmatic Scripts that query the monitoring system for historical data, calculate historical trends, and then use the monitoring system's API to update monitoring thresholds on an ongoing basis.
At work, we do have some configuration-driven dashboard generation. But, nothing near the level of automation that he outlines in teh book. In theory, it sounds amazing. In reality, however, I don't have the experience to even begin to know how to get to that kind of a place. But, it sure does sound beautiful!
As a final thought, there was one other recommendation that I found particularly interesting: don't try to be creative with your monitors and alerts. Meaning, don't even try to give them descriptive names or bylines. The names of the alerts should simply describe the system and metric being monitored:
The alarm setup should reflect the logical topography of the system. Giving alarms namespace-like names brings order and allows you to maintain a hierarchical view. Namespacing provides a convenient abstract container and helps you divide and conquer huge amounts of independent monitors and alarms by functional classification. It's a much better idea to give your alarms systematic rather than descriptive names, such as "All throttled requests for the EU website." (Page 83)
... and, the alert should describe the threshold breach:
Put specific symptoms in ticket's title. "Slow response times" is less informative than "Response times p99 exceeded 3 seconds for 3 data points." (Page 72)
I'm tickled by the idea of removing the thinking from this task. Monitoring and Alerting is already difficult enough - why add the overhead of trying to be creative. This isn't something I'm definitely sold on yet; but, it is an approach that I'll be experimenting with.
When I sat down to write this review, I had the thought that I didn't get as much out of this book as I would have liked. But now, having gone through my notes and attempting to overlay the recommendations on my current work situation, I think I actually received a lot more value than I initially realized. I'm definitely walking away with some concrete action items; and, I'm also leaving with a sense of inspiration about what a powerful and automated monitoring system could look like. For a relatively quick read, Ligus certainly packs a lot of information into each page.
In his preface, Ligus writes:
I would like this book to be a tribute to all these invisible ops guys who struggle daily to maintain the highest standards of service availability.
I would just like to state, emphatically, that at InVision, these ops guys are anything but invisible. They are the rock stars that keep this train on the tracks, moving in the right direction. Without them, I can't even imagine how we'd function. Or, if we'd even be in business at this point.
If you haven't yet seen it, you may be interested in the Site Reliability Workbook, a companion to the Site Reliability Engineering (SRE) book.
Along the same lines, Ligus recommends that we shouldn't be too quick to alert on recoverable events:
.... recoverable events should not trigger alerts until it becomes apparent that their recovery would take too long or could evolve into more serious problems.
The Workbook addresses this in Chapter 5: Alerting on SLOs , which outlines 6 different approaches to alerting in increasing sophistication.
The advised method, which can be challenging to implement, is to alert based on the burn rate of an Error Budget. This is derived as the difference between 100% and the acceptable error rate. By doing this, the speed with which the alert fires can be proportional to the impact of the problem.