My notes from the DevOps Handbook

by Gene Kim, Jez Humble, Patrick Debois, John Willis

Dev shares pager rotation duties with ops

Left unfixed, these can cause recurring problems and suffering for ops engineers downstream.

Even if the problem results in a defect being assigned to the feature team, it may be prioritized below the delivery of new features. This is an example of how upstream work centers can locally optimize for themselves but actually degrade performance for the entire value stream.

To prevent this from happening, we will have everyone in the value stream share the downstream responsibilities of handling operational incidents. We can do this by putting developers, development managers, and architects on pager rotation.

We found that when we woke up developers at 2 a.m., defects were fixed faster than ever.

Business goals are not achieved simply because features have been marked as "done". Instead, the feature is only done when it is performing as designed in production, without causing excessive escalations or unplanned work for either dev or ops. ITIL defines warranty as when a service can run in production reliably without intervention for a predefined period of time. This definition of warranty should ideally be integrated into our collective definition of done.

It's becoming less and less common for companies to have a dedicated on-call teams. Instead everyone who touches production code and environments is expected to be reachable in the even of downtime.