Begin with questions, then collect signals

Before installing dashboards, write the questions an on-call engineer must answer: Is the service available? Are requests slower than normal? Is a container restarting? Did an external dependency fail? Useful observability reduces the time from alert to explanation.

Three signal types cover most investigations. Metrics identify changes and trends. Logs provide event detail. Traces connect latency and errors across service boundaries. For a small platform, strong metrics and structured logs are a sensible first milestone.

Measure service health and resource pressure

Prometheus-style metrics and Grafana dashboards work well for container services because they separate application outcomes from infrastructure pressure. Application dashboards should show request volume, success rate, error rate and latency percentiles. Infrastructure dashboards should expose CPU, memory, disk, network and restart activity.

# Useful dashboard groups
service: request_rate, error_rate, p95_latency
runtime: container_cpu, memory_working_set, restarts
dependency: database_connections, queue_depth, api_timeouts
A CPU alert alone says a machine is busy. A rising p95 latency paired with CPU and error data says users may be impacted.

Make logs searchable, not decorative

Use structured JSON logs with consistent keys such as service, environment, request_id, level, duration_ms and event. Loki or a similar log store can then filter the event stream quickly without each service inventing its own format.

  • Never log credentials, access tokens or personal data without a strict requirement and protection plan.
  • Log expected validation problems at an appropriate level rather than treating them as incidents.
  • Propagate a correlation ID across HTTP calls and asynchronous tasks.

Alert on outcomes with runbooks

An alert should represent action: sustained elevated errors, user-facing latency, failed scheduled processing or resource exhaustion likely to cause impact. Alerts that trigger on harmless short spikes train engineers to ignore the system.

Incident-ready observability
  • Dashboards describe both application and container health.
  • Alerts include a dashboard link and first diagnostic steps.
  • Logs can be searched by request or deployment version.
  • After incidents, new signals are added only when they improve diagnosis.

The goal is not the largest monitoring stack. It is a platform that tells the truth quickly enough for an engineer to act confidently.