DevOps

What is SRE (Site Reliability Engineering)?

Google's approach to applying software engineering practices to operations problems.

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to operations. Created at Google, SRE teams use code and automation (instead of manual processes) to manage production systems. Core SRE concepts: SLOs/SLIs/SLAs, error budgets, toil reduction, blameless postmortems, and on-call management. SREs write software to automate operational work. The goal: reduce toil, improve reliability, and create a virtuous cycle of reliability investment.

Deep Dive Guide

sre will absorb devops roles

More DevOps Terms

Chaos Engineering

Deliberately injecting failures into a system to discover weaknesses before they cause incidents.

DevOps

A culture and practice combining software development and IT operations for faster, reliable delivery.

DORA Metrics

Four key metrics for measuring software delivery performance: deploy frequency, lead time, MTTR, and change failure rate.

FinOps

The practice of bringing financial accountability to cloud spending.

Idempotent

An operation that produces the same result no matter how many times it's executed.

MLOps

The practice of applying DevOps principles to machine learning model lifecycle management.

Test your knowledge of SRE (Site Reliability Engineering) and 130 other DevOps concepts

Interview Prep Full Glossary

What is SRE (Site Reliability Engineering)?

Related Terms

More DevOps Terms