My notes from the DevOps Handbook

by Gene Kim, Jez Humble, Patrick Debois, John Willis

56. Schedule blameless post-mortem meetings after accidents

We schedule the post mortem as soon as possible after the accident occurs and before memories and the links between cause and effect fade or circumstances change. In the blameless post-mortem meeting, we will do the following:

Construct a timeline and gather details from multiple perspectives on failures, ensuring we don't punish people for making mistakes
Empower all engineers to improve safety by allowing them to give detailed accounts of their contributions to failures
Enable and encourage people who do make mistakes to be the experts who educate the rest of the organization on how not to make them in the future
Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgment of those decisions lies in hindsight
Propose countermeasures to prevent a similar accident from happening in the future and ensure these countermeasures are recorded with a target date and an owner for follow-up

Stakeholders at the meeting:

people involved in decisions that may have contributed to the problem
people who identified the problem
who diagnosed
who were affected
anyone interested to join

Record our best understanding of the timeline of relevant events as they occurred. This includes all actions we took and at what time, what effects we observed (metrics from our production telemetry), all investigation paths we followed, and what resolutions were considered.

During the meeting disallow the phrases "would have" or "could have"

Reserve enough time for brainstorming and deciding which countermeasures to implement. Create timeline for implementation.

It is not acceptable to have a countermeasure to merely "be more careful" or "be less stupid" design real countermeasures to prevent these errors from happening again.

Examples of such countermeasures include new automated tests to detect dangerous conditions in our deployment pipeline, adding further production telemetry, identifying categories of changes that require additional peer review and conducting rehearsals of this category of failure as part of team scheduled Game Day exercises.

Publish our post-mortems as widely as possible

The meeting notes should be placed in a centralized location where our entire organization can access it and learn from the incident.