My notes from the DevOps Handbook

by Gene Kim, Jez Humble, Patrick Debois, John Willis

Enable Feedback to Safely Deploy Code

We need to respond to market conditions within minutes. This means that dev have to be able to quickly make code changes and get them into production as soon as possible, otherwise we would lose to the faster competitors. Having a separate group for testing and even deployment is simply too slow. Integrate everyone into one group with shared responsibilities and goals. The biggest challenge is to get developers to overcome their fear of deploying their own code.

Dev often complains about Ops being afraid to deploy code. But in this case, when given the power to deploy their own code, developers became just as afraid to perform code deployments. Increase their confidence by providing faster and more frequent feedback and reducing work batch size.

Team wants to improve the outcomes of their deployments, so they should get more peer reviews, everyone helps each other write better automated tests so we can find errors before deployment. Because everyone now knows that the smaller our production changes, the fewer problems we will have, developers start checking ever smaller increments of code more frequently into the deployment pipeline, ensuring that their change is working successfully in production before moving to their next change.

It's not enough to merely automate the deployment process - we must also integrate the monitoring of production telemetry into our deployment work, as well as establish the cultural norms that everyone is equally responsible for the health of the entire value stream.

Use telemetry to make deployments safer

Ensure we are actively monitoring production. Never consider our code deployment or production change to be done until it is operating as designed in the production environment.

Actively monitor the metrics associated with our feature during our deployment. If our change breaks or impairs any functionality, we quickly work to restore service, bringing in whoever else is required to diagnose and fix the issue.

Optimize for mean time to resolve, instead of mean time between failures.