Create enough telemetry at all levels of the application stack for all our environments as well as for the deployment pipeline.
See the health of everything that our service relies upon.
Detect security relevant events by monitoring faults.
After every production incident, identify missing telemetry that could have enabled faster detection and recovery.
Ensure we generate enough telemetry not only around application health but also to what extent we achieve organizational goals (new users, users log in, session lengths, active users).
By radiating how customers interact with what we build in the context of our goals, we enable fast feedback to feature team so they can see whether the capabilities we are building are actually being used and to what extent they achieve business goals.
Generate enough telemetry for production and non-production environments, so when the problem occurs, we can see if the infrastructure is contributing to it.
Infrastructure telemetry should be visible across all the technology stakeholders.
Metrics allow us to detect when things go wrong.
It's much better to measure both Dev and Ops against the real business consequences of downtime: how much revenue should we have attained but didn't. This could be cost of downtime and cost associated with a late feature (cost of delay).
We also need telemetry for test environments so that we can find and fix issues before they go into production.
We want to overlay other operation activities such as maintenance, backup or when we want to suppress alerts.