To enable disciplined problem solving behavior, we need to design our systems so that they are continually creating telemetry, widely defined as automated communications process by which measurements and other data are collected at remote points and are subsequently transmitted to receiving equipment for monitoring.
Our goal is to create telemetry within our applications and environments, both in production and preproduction environments as well as in our deployment pipeline.
In order for us to see all problems as they occur, we must design and develop our applications and environments so that they generate sufficient telemetry, allowing us to understand how our system is behaving as a whole. When all levels of our application stack have monitoring and logging, we enable other important capabilities, such as graphing and visualizing our metrics, anomaly detection, proactive alerting and escalation.
at the OS level: CPU, memory, disk, network
event router responsible for storing events and metrics