🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Roadmaps
Technology Roadmap

Site Reliability Engineer Roadmap

Complete SRE roadmap covering observability, incident response, SLOs, error budgets, chaos engineering, and reliability practices from the Google SRE playbook.

6–9 months
8 phases
FoundationIntermediateAdvancedExpert
Phase 1

Linux & Systems Internals

SREs debug live production systems — know Linux cold

Foundation4–5 weeks

What to learn

  • Process management — ps, top, kill, systemd, journalctl
  • Memory and CPU analysis — vmstat, iostat, sar, free
  • Network debugging — ss, netstat, tcpdump, curl, dig
  • Log analysis — grep, awk, journalctl, log rotation
  • File system — inodes, disk usage, permissions, lsof
  • Performance profiling — perf, strace, ltrace
Phase 2

Observability — Metrics, Logs, Traces

If you can't measure it, you can't improve it

Foundation4–5 weeks

What to learn

  • The 3 pillars — metrics, structured logs, distributed traces
  • Prometheus — scraping, metric types (counter, gauge, histogram)
  • Grafana — dashboards, alerts, variables, templating
  • Loki — log aggregation, LogQL, label strategies
  • OpenTelemetry — instrumentation, collectors, exporters
  • Distributed tracing with Jaeger or Tempo — trace IDs
Phase 3

Kubernetes Operations

Most SRE teams run on Kubernetes — own it

Intermediate4–5 weeks

What to learn

  • Pod lifecycle, restarts, OOMKilled, CrashLoopBackOff debugging
  • Resource requests, limits, QoS classes (Guaranteed, Burstable, BestEffort)
  • HPA, VPA, Cluster Autoscaler, Karpenter
  • RBAC, NetworkPolicies, PodSecurityAdmission
  • Kubernetes events, audit logs, and debugging toolkit
  • Draining nodes, PodDisruptionBudgets, rolling updates
Phase 4

Incident Response & On-Call

When prod breaks at 3am, you own it

Intermediate2–3 weeks

What to learn

  • Incident lifecycle — detect, triage, mitigate, resolve, review
  • Incident command structure — IC, comms lead, scribe roles
  • Runbooks — writing actionable, tested remediation guides
  • Blameless postmortems — 5 whys, timeline reconstruction
  • On-call rotation design — sustainable, low burnout
  • Alert quality — actionable alerts only, suppress noise

Key tools

PagerDutyOpsgenieSlackGrafana OnCallStatuspage
Phase 5

SLOs, Error Budgets & Reliability

The mathematical core of the SRE model

Intermediate2–3 weeks

What to learn

  • SLIs — latency, availability, throughput, error rate
  • SLOs — setting realistic targets (99.9% vs 99.99% math)
  • SLAs — contractual obligations vs internal targets
  • Error budgets — balancing feature velocity and reliability
  • Error budget policies — what triggers freeze or rollback
  • DORA metrics — deploy frequency, MTTR, change failure rate

Key tools

Prometheus SLO alertsSlothOpenSLOGrafana
Phase 6

Automation & Toil Reduction

If humans repeat it, automate it

Advanced3–4 weeks

What to learn

  • Defining toil — repetitive, manual, scalable, devoid of value
  • Runbook automation — runbooks as code, triggered automatically
  • Self-healing systems — auto-remediation on alert trigger
  • CI/CD for infra — Terraform + GitOps for all changes
  • KEDA for event-driven autoscaling
  • AI-assisted operations — LLM-powered root cause analysis
Phase 7

Chaos Engineering

Break things safely before production does it for you

Advanced2–3 weeks

What to learn

  • Chaos principles — steady-state hypothesis, minimize blast radius
  • LitmusChaos on Kubernetes — pod kill, network delay, CPU stress
  • GameDays — planned failure exercises with stakeholders
  • Blast radius control — limit scope with namespaces and feature flags
  • Observability during chaos — watch SLOs in real time
  • Reporting — what broke, what held, what needs fixing

Key tools

LitmusChaosGremlink6Chaos Toolkit
Phase 8

Capacity Planning & FinOps

Right-size everything, never over or under-provision

ExpertOngoing

What to learn

  • Load testing and traffic forecasting — predict before it breaks
  • Kubernetes resource rightsizing — VPA-informed manual tuning
  • Cloud cost optimization — Spot, Reserved Instances, Savings Plans
  • Chargeback and showback — cost attribution per team/service
  • Unit economics — cost per request, cost per user
  • FinOps culture — engineers own their cloud costs

Key tools

k6LocustKubecostInfracostAWS Cost Explorer

Interview Prep

DevOps Interview Prep Bundle — 1000+ Q&A

Every topic on this roadmap has interview questions in the bundle — Docker, Kubernetes, AWS, CI/CD, Linux, SRE, FinOps, System Design. Grab it before your next interview.

Frequently Asked Questions

Common questions about the Site Reliability Engineer roadmap

1What is the difference between SRE and DevOps?
DevOps is a culture and set of practices. SRE is Google's implementation of DevOps with a specific methodology — SLOs, error budgets, and the philosophy that reliability is a feature. SREs write code to solve operational problems, while DevOps engineers focus on automation and CI/CD.
2How long does it take to become an SRE?
Most engineers transition to SRE after 2–3 years in DevOps, sysadmin, or software engineering. The full roadmap — Linux internals, observability, Kubernetes operations, incident response, and chaos engineering — takes 6–9 months of focused study.
3Do I need to code to be an SRE?
Yes. SREs are expected to write production-quality code — automation scripts, internal tools, runbook automation, and infrastructure code. Python and Go are the most common SRE languages. You don't need to be a software engineer, but coding is non-negotiable.
4What is an SRE salary in 2026?
In the US, junior SRE: $110K–$140K, mid-level SRE: $150K–$200K, senior SRE at top tech: $250K–$400K+ TC. In India, SRE roles at top companies: ₹15L–₹60L+. SRE roles typically pay 20–30% more than equivalent DevOps roles.
5What is an error budget in SRE?
An error budget is the acceptable amount of downtime or errors in a given period, derived from your SLO. If your SLO is 99.9% availability, your monthly error budget is ~43 minutes. When the budget is exhausted, feature deployments freeze and reliability work takes priority.