My notes from the DevOps Handbook

by Gene Kim, Jez Humble, Patrick Debois, John Willis

56. Schedule blameless post-mortem meetings after accidents

We schedule the post mortem as soon as possible after the accident occurs and before memories and the links between cause and effect fade or circumstances change. In the blameless post-mortem meeting, we will do the following:

Stakeholders at the meeting:

Record our best understanding of the timeline of relevant events as they occurred. This includes all actions we took and at what time, what effects we observed (metrics from our production telemetry), all investigation paths we followed, and what resolutions were considered.

During the meeting disallow the phrases "would have" or "could have"

Reserve enough time for brainstorming and deciding which countermeasures to implement. Create timeline for implementation.

It is not acceptable to have a countermeasure to merely "be more careful" or "be less stupid" design real countermeasures to prevent these errors from happening again.

Examples of such countermeasures include new automated tests to detect dangerous conditions in our deployment pipeline, adding further production telemetry, identifying categories of changes that require additional peer review and conducting rehearsals of this category of failure as part of team scheduled Game Day exercises.

Publish our post-mortems as widely as possible

The meeting notes should be placed in a centralized location where our entire organization can access it and learn from the incident.