Monitoring & Observability Roadmap
Complete observability roadmap covering Prometheus, Grafana, Loki, OpenTelemetry, distributed tracing, alerting, and SLO-driven reliability practices.
Observability Foundations
The 3 pillars: metrics, logs, and traces
What to learn
- Metrics — what they measure and when to use them
- Logs — structured vs unstructured, log levels, cardinality
- Traces — distributed request tracking across services
- Events — discrete occurrences vs continuous telemetry
- The difference between monitoring (known unknowns) and observability (unknown unknowns)
- RED method — Rate, Errors, Duration for services
- USE method — Utilization, Saturation, Errors for resources
Key tools
Prometheus
The de-facto metrics standard for cloud-native
What to learn
- Prometheus architecture — scraping, TSDB, querying
- Metric types — counter, gauge, histogram, summary
- PromQL — selectors, functions, aggregations, recording rules
- Instrumentation — exposing metrics from your apps
- Service discovery — Kubernetes SD, file-based SD
- Alertmanager — routing, inhibition, silences, receivers
- Prometheus Operator on Kubernetes — ServiceMonitor, PrometheusRule
Key tools
Grafana Dashboards
Visualize everything — make data tell a story
What to learn
- Panel types — time series, stat, gauge, table, heatmap
- Variables and templating — dynamic dashboards
- Annotations — mark deployments on graphs
- Alerting in Grafana — unified alert manager
- Dashboard as code — Grafonnet / JSON provisioning
- Grafana provisioning — auto-load datasources and dashboards on start
- Community dashboards — Node Exporter Full, K8s cluster overview
Key tools
Log Aggregation with Loki
Logs at Prometheus cost — no indexing overhead
What to learn
- Loki architecture — Promtail, Loki, Grafana
- Label strategy — choose labels wisely to avoid cardinality explosion
- LogQL — log stream selectors, filter expressions, metric queries
- Structured logging — JSON logs for easier filtering
- Promtail configuration — pipeline stages for log parsing
- Loki on Kubernetes — Helm deployment, persistent storage
- Log correlation — jump from metric spike to relevant logs
Key tools
Distributed Tracing
Follow a request across every microservice
What to learn
- Trace anatomy — spans, trace IDs, baggage, parent-child relationships
- Instrumentation — auto-instrumentation vs manual spans
- Jaeger — local tracing backend, UI, sampling strategies
- Grafana Tempo — Prometheus-style traces storage
- Trace sampling — head-based vs tail-based sampling
- Finding bottlenecks — waterfall view, service maps
- Correlating traces with metrics and logs in Grafana
Key tools
OpenTelemetry
The vendor-neutral standard for all observability signals
What to learn
- OTel architecture — SDK, Collector, exporters
- Auto-instrumentation — zero-code setup for common frameworks
- OTel Collector — receive, process, and export all 3 signals
- Processor pipelines — filtering, sampling, batching
- Kubernetes operator — auto-inject instrumentation into pods
- Semantic conventions — standard attribute naming
- Migrating from Jaeger / Zipkin to OTel
Key tools
SLOs & Alerting Strategy
Alert on what matters — silence the noise
What to learn
- SLI/SLO/SLA — define meaningful reliability targets
- Error budget math — 99.9% = 8.7h downtime/year
- Multi-window burn rate alerts — fast and slow burn detection
- Alert fatigue — reduce noise with severity levels and grouping
- On-call hygiene — escalation policies, runbook links in alerts
- Sloth / OpenSLO — SLO as code tools
- DORA metrics — embed in observability dashboards
Key tools
Production Observability at Scale
Manage observability costs and scale gracefully
What to learn
- Thanos or Cortex — HA Prometheus with long-term storage
- VictoriaMetrics — Prometheus drop-in with 10x better compression
- Observability cost control — reduce cardinality, sample rates
- Exemplars — link metrics directly to traces in Grafana
- AI-assisted analysis — LLM-powered log summarization
- Continuous profiling — Pyroscope for production profiling
Key tools
Interview Prep
DevOps Interview Prep Bundle — 1000+ Q&A
Every topic on this roadmap has interview questions in the bundle — Docker, Kubernetes, AWS, CI/CD, Linux, SRE, FinOps, System Design. Grab it before your next interview.