What Is an SLO, SLI, and Error Budget? Explained Simply
SRE teams talk about SLOs, SLIs, and error budgets constantly, but the terms get used loosely. Here's what each one actually means, with real numbers, and how they connect to decide when to ship vs slow down.
These three terms come as a set, and understanding one without the others doesn't get you very far. Let's build them up in order, with a real example.
SLI — Service Level Indicator
An SLI is a measurement. A number you can actually calculate from your system's behavior.
SLI = (successful requests / total requests) over a time window
Common SLIs:
- Availability: percentage of requests that succeed
- Latency: percentage of requests faster than X ms
- Error rate: percentage of requests returning 5xx
- Throughput: requests handled per second
An SLI by itself is just data. "99.95% of requests succeeded last hour" is an SLI measurement — neither good nor bad until you compare it to a target.
SLO — Service Level Objective
An SLO is the target you set for an SLI, over a defined time window.
SLO: 99.9% of requests will succeed, measured over a rolling 30-day window
This is the number your team commits to internally. Note it's not the same as an SLA (Service Level Agreement) — an SLA is the external, often contractual promise to customers, usually set looser than your internal SLO, so you have margin to miss your own target occasionally without breaching a customer contract.
Internal SLO: 99.9% (your real target, the one you design and operate around)
External SLA: 99.5% (the contractual promise, with financial penalties if missed)
The gap between SLO and SLA is your safety margin against actually paying customers money for downtime.
Error Budget — The Most Useful Part
Once you have an SLO, the error budget is just the inverse — how much failure you're allowed before you breach it.
SLO: 99.9% success rate over 30 days
Error budget: 0.1% of requests can fail over 30 days
If you serve 10,000,000 requests in 30 days:
Error budget = 10,000 failed requests allowed
This number changes everything about how a team makes decisions. Instead of "zero downtime, always," you have a concrete, spendable budget.
def calculate_error_budget_remaining(total_requests: int, failed_requests: int, slo: float) -> dict:
allowed_failures = total_requests * (1 - slo)
budget_consumed_pct = (failed_requests / allowed_failures * 100) if allowed_failures else 0
return {
"allowed_failures": int(allowed_failures),
"actual_failures": failed_requests,
"budget_remaining_pct": max(0, 100 - budget_consumed_pct),
"budget_exhausted": failed_requests >= allowed_failures
}
# Example: SLO of 99.9%, 10M requests this month, 7,500 failures so far
result = calculate_error_budget_remaining(10_000_000, 7_500, 0.999)
# {'allowed_failures': 10000, 'actual_failures': 7500,
# 'budget_remaining_pct': 25.0, 'budget_exhausted': False}Why This Actually Changes Behavior
Without an error budget, every outage feels equally urgent and every "should we ship this risky change" decision is argued from scratch. With an error budget, the decision becomes mechanical:
Budget remaining is healthy (>50%) → ship new features, take calculated risks, deploy faster.
Budget is getting consumed fast → slow down, prioritize reliability work over new features, add more testing before risky deploys.
Budget is exhausted → freeze non-critical changes, all hands on reliability until you're back in budget.
# This becomes an actual policy, not a vibe-based argument in Slack
# Example: automated deploy freeze when error budget is exhausted
- alert: ErrorBudgetExhausted
expr: error_budget_remaining_pct < 10
labels:
severity: critical
annotations:
summary: "Error budget below 10% — freeze non-critical deploys until reliability improves"
action: "Notify #platform-team, block non-hotfix merges to main"Picking the Right SLO — The Part Teams Get Wrong
The instinct is to aim for 99.99% or 99.999% everywhere because it sounds impressive. Don't. Two real costs scale with stricter SLOs:
Engineering cost — going from 99.9% to 99.99% often costs disproportionately more engineering effort than going from 99% to 99.9%, because you're now defending against rarer, harder-to-reproduce failure modes.
Reduced shipping velocity — a tighter SLO means a smaller error budget, which means less room to take risks on new features.
99.9% SLO → 43 minutes of downtime/month allowed
99.99% SLO → 4.3 minutes of downtime/month allowed
99.999% SLO → 26 seconds of downtime/month allowed
Pick the SLO based on what users actually need, not what sounds aspirational. An internal admin dashboard used twice a day doesn't need the same SLO as a payment processing API — and pretending it does just burns engineering effort that should go elsewhere.
Putting It Together
SLI: "98.7% of checkout requests succeeded in the last hour" (a measurement)
SLO: "99.9% of checkout requests will succeed over 30 days" (the target)
Error budget: "We can afford 10,000 failed checkouts this month" (the allowance)
The SLI tells you where you stand right now. The SLO tells you what you're aiming for. The error budget translates the gap between them into an actual number your team can plan around — and that's the piece that turns reliability from an abstract value into an operational decision-making tool.
Build a tool to track this automatically: Build an AI SLO Budget Tracker
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Observability vs Monitoring — What's the Actual Difference?
Everyone says 'observability' now but most teams are still just doing monitoring. Here's what actually separates the two — and why it matters when your system breaks in a way you didn't expect.
What is a Service Mesh? Explained Simply (No Jargon)
Service mesh sounds complicated but the concept is simple. Here's what it actually does, why teams use it, and whether you need one — explained without the buzzwords.
What is Chaos Engineering — Explained Simply
Chaos engineering sounds like deliberately breaking things. It is — but in a controlled way that makes your systems stronger. Here's what it is, how it works, and how to start.