Technology Roadmap

Site Reliability Engineer Roadmap

Complete SRE roadmap covering observability, incident response, SLOs, error budgets, chaos engineering, and reliability practices from the Google SRE playbook.

6–9 months

8 phases

FoundationIntermediateAdvancedExpert

Phase 1

Linux & Systems Internals

SREs debug live production systems — know Linux cold

Foundation4–5 weeks

What to learn

Process management — ps, top, kill, systemd, journalctl
Memory and CPU analysis — vmstat, iostat, sar, free
Network debugging — ss, netstat, tcpdump, curl, dig
Log analysis — grep, awk, journalctl, log rotation
File system — inodes, disk usage, permissions, lsof
Performance profiling — perf, strace, ltrace

Key tools

bashstraceperftmuxvim

Resources

Linux Commands for DevOps Engineers Interview Q&A for this topic â†’ Bundle

Phase 2

Observability — Metrics, Logs, Traces

If you can't measure it, you can't improve it

Foundation4–5 weeks

What to learn

The 3 pillars — metrics, structured logs, distributed traces
Prometheus — scraping, metric types (counter, gauge, histogram)
Grafana — dashboards, alerts, variables, templating
Loki — log aggregation, LogQL, label strategies
OpenTelemetry — instrumentation, collectors, exporters
Distributed tracing with Jaeger or Tempo — trace IDs

Key tools

PrometheusGrafanaLokiOpenTelemetryAlertmanager

Resources

Prometheus + Grafana Guide OpenTelemetry Complete Guide Grafana Loki Log Aggregation Interview Q&A for this topic â†’ Bundle

Phase 3

Kubernetes Operations

Most SRE teams run on Kubernetes — own it

Intermediate4–5 weeks

What to learn

Pod lifecycle, restarts, OOMKilled, CrashLoopBackOff debugging
Resource requests, limits, QoS classes (Guaranteed, Burstable, BestEffort)
HPA, VPA, Cluster Autoscaler, Karpenter
RBAC, NetworkPolicies, PodSecurityAdmission
Kubernetes events, audit logs, and debugging toolkit
Draining nodes, PodDisruptionBudgets, rolling updates

Key tools

kubectlk9sHelmKarpentermetrics-server

Resources

OOMKilled Fix Guide Kubernetes Troubleshooting Guide Karpenter vs Cluster Autoscaler Interview Q&A for this topic â†’ Bundle

Phase 4

Incident Response & On-Call

When prod breaks at 3am, you own it

Intermediate2–3 weeks

What to learn

Incident lifecycle — detect, triage, mitigate, resolve, review
Incident command structure — IC, comms lead, scribe roles
Runbooks — writing actionable, tested remediation guides
Blameless postmortems — 5 whys, timeline reconstruction
On-call rotation design — sustainable, low burnout
Alert quality — actionable alerts only, suppress noise

Key tools

PagerDutyOpsgenieSlackGrafana OnCallStatuspage

Resources

AI-Powered Incident Response Interview Q&A for this topic â†’ Bundle

Phase 5

SLOs, Error Budgets & Reliability

The mathematical core of the SRE model

Intermediate2–3 weeks

What to learn

SLIs — latency, availability, throughput, error rate
SLOs — setting realistic targets (99.9% vs 99.99% math)
SLAs — contractual obligations vs internal targets
Error budgets — balancing feature velocity and reliability
Error budget policies — what triggers freeze or rollback
DORA metrics — deploy frequency, MTTR, change failure rate

Key tools

Prometheus SLO alertsSlothOpenSLOGrafana

Resources

DORA Metrics Guide Interview Q&A for this topic â†’ Bundle

Phase 6

Automation & Toil Reduction

If humans repeat it, automate it

Advanced3–4 weeks

What to learn

Defining toil — repetitive, manual, scalable, devoid of value
Runbook automation — runbooks as code, triggered automatically
Self-healing systems — auto-remediation on alert trigger
CI/CD for infra — Terraform + GitOps for all changes
KEDA for event-driven autoscaling
AI-assisted operations — LLM-powered root cause analysis

Key tools

AnsibleTerraformKEDAArgoCDGitHub Actions

Resources

AI DevOps Agent with LangGraph Agentic SRE: Self-Healing Infrastructure Interview Q&A for this topic â†’ Bundle

Phase 7

Chaos Engineering

Break things safely before production does it for you

Advanced2–3 weeks

What to learn

Chaos principles — steady-state hypothesis, minimize blast radius
LitmusChaos on Kubernetes — pod kill, network delay, CPU stress
GameDays — planned failure exercises with stakeholders
Blast radius control — limit scope with namespaces and feature flags
Observability during chaos — watch SLOs in real time
Reporting — what broke, what held, what needs fixing

Key tools

LitmusChaosGremlink6Chaos Toolkit

Resources

Chaos Engineering Guide Interview Q&A for this topic â†’ Bundle

Phase 8

Capacity Planning & FinOps

Right-size everything, never over or under-provision

ExpertOngoing

What to learn

Load testing and traffic forecasting — predict before it breaks
Kubernetes resource rightsizing — VPA-informed manual tuning
Cloud cost optimization — Spot, Reserved Instances, Savings Plans
Chargeback and showback — cost attribution per team/service
Unit economics — cost per request, cost per user
FinOps culture — engineers own their cloud costs

Key tools

k6LocustKubecostInfracostAWS Cost Explorer

Resources

FinOps Guide for DevOps Engineers Interview Q&A for this topic â†’ Bundle

Interview Prep

DevOps Interview Prep Bundle â€” 1000+ Q&A

Every topic on this roadmap has interview questions in the bundle â€” Docker, Kubernetes, AWS, CI/CD, Linux, SRE, FinOps, System Design. Grab it before your next interview.

Get the Bundle Learn More

Frequently Asked Questions

Common questions about the Site Reliability Engineer roadmap

1What is the difference between SRE and DevOps?

DevOps is a culture and set of practices. SRE is Google's implementation of DevOps with a specific methodology — SLOs, error budgets, and the philosophy that reliability is a feature. SREs write code to solve operational problems, while DevOps engineers focus on automation and CI/CD.

2How long does it take to become an SRE?

Most engineers transition to SRE after 2–3 years in DevOps, sysadmin, or software engineering. The full roadmap — Linux internals, observability, Kubernetes operations, incident response, and chaos engineering — takes 6–9 months of focused study.

3Do I need to code to be an SRE?

Yes. SREs are expected to write production-quality code — automation scripts, internal tools, runbook automation, and infrastructure code. Python and Go are the most common SRE languages. You don't need to be a software engineer, but coding is non-negotiable.

4What is an SRE salary in 2026?

In the US, junior SRE: $110K–$140K, mid-level SRE: $150K–$200K, senior SRE at top tech: $250K–$400K+ TC. In India, SRE roles at top companies: ₹15L–₹60L+. SRE roles typically pay 20–30% more than equivalent DevOps roles.

5What is an error budget in SRE?

An error budget is the acceptable amount of downtime or errors in a given period, derived from your SLO. If your SLO is 99.9% availability, your monthly error budget is ~43 minutes. When the budget is exhausted, feature deployments freeze and reliability work takes priority.

Explore More Roadmaps