Site Reliability Engineer Roadmap
Complete SRE roadmap covering observability, incident response, SLOs, error budgets, chaos engineering, and reliability practices from the Google SRE playbook.
Linux & Systems Internals
SREs debug live production systems — know Linux cold
What to learn
- Process management — ps, top, kill, systemd, journalctl
- Memory and CPU analysis — vmstat, iostat, sar, free
- Network debugging — ss, netstat, tcpdump, curl, dig
- Log analysis — grep, awk, journalctl, log rotation
- File system — inodes, disk usage, permissions, lsof
- Performance profiling — perf, strace, ltrace
Key tools
Observability — Metrics, Logs, Traces
If you can't measure it, you can't improve it
What to learn
- The 3 pillars — metrics, structured logs, distributed traces
- Prometheus — scraping, metric types (counter, gauge, histogram)
- Grafana — dashboards, alerts, variables, templating
- Loki — log aggregation, LogQL, label strategies
- OpenTelemetry — instrumentation, collectors, exporters
- Distributed tracing with Jaeger or Tempo — trace IDs
Key tools
Kubernetes Operations
Most SRE teams run on Kubernetes — own it
What to learn
- Pod lifecycle, restarts, OOMKilled, CrashLoopBackOff debugging
- Resource requests, limits, QoS classes (Guaranteed, Burstable, BestEffort)
- HPA, VPA, Cluster Autoscaler, Karpenter
- RBAC, NetworkPolicies, PodSecurityAdmission
- Kubernetes events, audit logs, and debugging toolkit
- Draining nodes, PodDisruptionBudgets, rolling updates
Key tools
Incident Response & On-Call
When prod breaks at 3am, you own it
What to learn
- Incident lifecycle — detect, triage, mitigate, resolve, review
- Incident command structure — IC, comms lead, scribe roles
- Runbooks — writing actionable, tested remediation guides
- Blameless postmortems — 5 whys, timeline reconstruction
- On-call rotation design — sustainable, low burnout
- Alert quality — actionable alerts only, suppress noise
Key tools
SLOs, Error Budgets & Reliability
The mathematical core of the SRE model
What to learn
- SLIs — latency, availability, throughput, error rate
- SLOs — setting realistic targets (99.9% vs 99.99% math)
- SLAs — contractual obligations vs internal targets
- Error budgets — balancing feature velocity and reliability
- Error budget policies — what triggers freeze or rollback
- DORA metrics — deploy frequency, MTTR, change failure rate
Key tools
Automation & Toil Reduction
If humans repeat it, automate it
What to learn
- Defining toil — repetitive, manual, scalable, devoid of value
- Runbook automation — runbooks as code, triggered automatically
- Self-healing systems — auto-remediation on alert trigger
- CI/CD for infra — Terraform + GitOps for all changes
- KEDA for event-driven autoscaling
- AI-assisted operations — LLM-powered root cause analysis
Key tools
Chaos Engineering
Break things safely before production does it for you
What to learn
- Chaos principles — steady-state hypothesis, minimize blast radius
- LitmusChaos on Kubernetes — pod kill, network delay, CPU stress
- GameDays — planned failure exercises with stakeholders
- Blast radius control — limit scope with namespaces and feature flags
- Observability during chaos — watch SLOs in real time
- Reporting — what broke, what held, what needs fixing
Key tools
Capacity Planning & FinOps
Right-size everything, never over or under-provision
What to learn
- Load testing and traffic forecasting — predict before it breaks
- Kubernetes resource rightsizing — VPA-informed manual tuning
- Cloud cost optimization — Spot, Reserved Instances, Savings Plans
- Chargeback and showback — cost attribution per team/service
- Unit economics — cost per request, cost per user
- FinOps culture — engineers own their cloud costs
Key tools
Interview Prep
DevOps Interview Prep Bundle — 1000+ Q&A
Every topic on this roadmap has interview questions in the bundle — Docker, Kubernetes, AWS, CI/CD, Linux, SRE, FinOps, System Design. Grab it before your next interview.
Frequently Asked Questions
Common questions about the Site Reliability Engineer roadmap