Create tools that allow us to discover variances and ever weaker failure signals hidden in our production telemetry so we can avert catastrophic failures.
Outlier detection - abnormal running conditions from which significant performance degradation may result. First compute current normal and then identify which nodes do not fit that pattern.
One of the simplest statistical techniques that we can use to analyze a production metric is computing its mean and standard deviation.
Notify on-call staff at 2 a.m. to investigate when database queries are significantly slower than average.
Alert fatigue is the biggest problem. We need to be more intelligent with our alerts.
This simple type of statistical analysis is valuable, because no one had to define a static threshold value, something which is infeasible if we are tracking hundreds of thousands of production metrics.
Analyze our most severe incidents in the recent past and create a list of telemetry that could have enabled earlier and faster detection and diagnosis of the problem, as well as easier and faster confirmation that an effective fix had been implemented.
Look at the leading indicators that could have warned us earlier that we were starting to deviate from standard operations:
Each of these metrics is a potential precursor to a production incident. For each, we would configure our alerting systems to notify them when they deviate sufficiently from the mean, so that we can take action.
By repeating this process on ever weaker failure signals, we find problems ever earlier in the life cycle, resulting in fewer customer impacting incidents and near misses. In other words, we are preventing problems as well as enabling quicker detection and correction.