My notes from the DevOps Handbook

by Gene Kim, Jez Humble, Patrick Debois, John Willis

57. Decrease incident tolerances to find ever-weaker failure signals

Decrease the threshold of what constitutes a problem in order to keep learning = amplify weak failure signals.

The need to amplify weak failure signals is critical to averting catastrophic failures.

Our work in the technology value stream, should be approached as a fundamentally experimental endeavor and managed that way. All work we do is a potentially important hypothesis and a source of data, rather than a routine application and validation of past practice.

Redefine failure and encourage calculated risk-taking

We need leaders to continually reinforce that everyone should feel both comfortable with and responsible for surfacing and learning from failures. High performing DevOps organizations will fail and make mistakes more often. Not only is this okay, it's what organizations need. If high performers are performing thirty times more frequently but with only half the change failure rate, they're obviously having more failures.

Inject production failures to enable resilience and learning

Injecting faults into the production environment is one way we can increase our resilience. Regularly perform tests to make certain that our systems fail gracefully.

Resilience requires that we first define our failure modes and then perform testing to ensure that these failure modes operate as designed. Rehearse large scale failures so we are confident we can recover from accidents when they occur.

Repeated and regular failure exercising, even in the persistence (database) layer, should be part of every company's resilience planning.

Architectural patterns: fail fasts (setting aggressive timeouts), fallbacks (designing each feature to degrade or fall back to a lower quality representation), and feature removal (removing non critical features when they run slowly).