What is SRE (Site Reliability Engineering)?
Google's approach to applying software engineering practices to operations problems.
Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to operations. Created at Google, SRE teams use code and automation (instead of manual processes) to manage production systems. Core SRE concepts: SLOs/SLIs/SLAs, error budgets, toil reduction, blameless postmortems, and on-call management. SREs write software to automate operational work. The goal: reduce toil, improve reliability, and create a virtuous cycle of reliability investment.
Deep Dive Guide
sre will absorb devops roles
More DevOps Terms
Chaos Engineering
Deliberately injecting failures into a system to discover weaknesses before they cause incidents.
DevOps
A culture and practice combining software development and IT operations for faster, reliable delivery.
DORA Metrics
Four key metrics for measuring software delivery performance: deploy frequency, lead time, MTTR, and change failure rate.
FinOps
The practice of bringing financial accountability to cloud spending.
Idempotent
An operation that produces the same result no matter how many times it's executed.
MLOps
The practice of applying DevOps principles to machine learning model lifecycle management.
Test your knowledge of SRE (Site Reliability Engineering) and 130 other DevOps concepts