My notes from the DevOps Handbook

by Gene Kim, Jez Humble, Patrick Debois, John Willis

Create telemetry to enable seeing and solving problems

To enable disciplined problem solving behavior, we need to design our systems so that they are continually creating telemetry, widely defined as automated communications process by which measurements and other data are collected at remote points and are subsequently transmitted to receiving equipment for monitoring.

Our goal is to create telemetry within our applications and environments, both in production and preproduction environments as well as in our deployment pipeline.

Centralized telemetry infrastructure

In order for us to see all problems as they occur, we must design and develop our applications and environments so that they generate sufficient telemetry, allowing us to understand how our system is behaving as a whole. When all levels of our application stack have monitoring and logging, we enable other important capabilities, such as graphing and visualizing our metrics, anomaly detection, proactive alerting and escalation.

Components of such architecture

data collection at the business logic, application and environments layer
evens, logs, metrics
logs sent to a common service that enables easy centralization, rotation and deletion
we gather metrics at all layers of the application stack to better understand how our system is behaving
at the OS level: CPU, memory, disk, network
event router responsible for storing events and metrics
enables visualization, trending, alerting, anomaly detection
collecting, storing, aggregating telemetry -> further analysis and health checks
threshold based health checks