My notes from the DevOps Handbook

by Gene Kim, Jez Humble, Patrick Debois, John Willis

Use telemetry to guide problem solving

Telemetry enables us to use the scientific method to formulate hypotheses about what is causing a particular problem and what is required to solve it.

Enable creation of production metrics as part of daily work

Enable everyone to create metrics in their daily work that can be easily created, displayed, and analyzed. Create infrastructure and libraries to make it easy for anyone in dev or ops to create telemetry. In the ideal, it should be as easy as writing one line of code to create a new metric that shows up in a common dashboard where everyone in the value stream can see it.

StatsD can generate timers and counters with one line of code.

When we generate graphs of our telemetry, we will also overlay onto them when production changes occur.

By generating production telemetry as part of our daily work, we create an ever improving capability to not only see problems as they occur, but also to design our work so that problems in design and operations can be revealed, allowing an increasing number of metrics to be tracked, as we saw in the Etsy case study.

Create self-service access to telemetry and information radiators

Ensure that anyone who wants information about any of the services running in can get it without production systems access or privileged account, or having to open a ticket and wait for a few days to get a graph.

Make production telemetry highly visible.

Information radiator - the generic term for any number of handwritten or electronic displays which a team places in a highly visible location, so that all team members as well as passersby can see the latest information at a glance such as count of automated tests, velocity, incidents, CI status.

This demonstrates the following values:

We may also choose to broadcast this information to our internal customers or even our external customers via a publicly viewable status page.

Creating a simple dashboard should be a part of creating any new product or service - automated tests should confirm that both the service and dashboard are working correctly, helping both our customers and our ability to safely deploy.