My notes from the DevOps Handbook

by Gene Kim, Jez Humble, Patrick Debois, John Willis

Find and fill telemetry gaps

Create enough telemetry at all levels of the application stack for all our environments as well as for the deployment pipeline.

Metrics from the following levels

See the health of everything that our service relies upon.

Detect security relevant events by monitoring faults.

After every production incident, identify missing telemetry that could have enabled faster detection and recovery.

Application and business metrics

Ensure we generate enough telemetry not only around application health but also to what extent we achieve organizational goals (new users, users log in, session lengths, active users).

By radiating how customers interact with what we build in the context of our goals, we enable fast feedback to feature team so they can see whether the capabilities we are building are actually being used and to what extent they achieve business goals.

Infrastructure metrics

Generate enough telemetry for production and non-production environments, so when the problem occurs, we can see if the infrastructure is contributing to it.

Infrastructure telemetry should be visible across all the technology stakeholders.

Metrics allow us to detect when things go wrong.

It's much better to measure both Dev and Ops against the real business consequences of downtime: how much revenue should we have attained but didn't. This could be cost of downtime and cost associated with a late feature (cost of delay).

We also need telemetry for test environments so that we can find and fix issues before they go into production.

We want to overlay other operation activities such as maintenance, backup or when we want to suppress alerts.