🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

What Is an SLO, SLI, and Error Budget? Explained Simply

SRE teams talk about SLOs, SLIs, and error budgets constantly, but the terms get used loosely. Here's what each one actually means, with real numbers, and how they connect to decide when to ship vs slow down.

DevOpsBoysJun 16, 20264 min read
Share:Tweet

These three terms come as a set, and understanding one without the others doesn't get you very far. Let's build them up in order, with a real example.

SLI — Service Level Indicator

An SLI is a measurement. A number you can actually calculate from your system's behavior.

SLI = (successful requests / total requests) over a time window

Common SLIs:

  • Availability: percentage of requests that succeed
  • Latency: percentage of requests faster than X ms
  • Error rate: percentage of requests returning 5xx
  • Throughput: requests handled per second

An SLI by itself is just data. "99.95% of requests succeeded last hour" is an SLI measurement — neither good nor bad until you compare it to a target.

SLO — Service Level Objective

An SLO is the target you set for an SLI, over a defined time window.

SLO: 99.9% of requests will succeed, measured over a rolling 30-day window

This is the number your team commits to internally. Note it's not the same as an SLA (Service Level Agreement) — an SLA is the external, often contractual promise to customers, usually set looser than your internal SLO, so you have margin to miss your own target occasionally without breaching a customer contract.

Internal SLO: 99.9% (your real target, the one you design and operate around)
External SLA: 99.5% (the contractual promise, with financial penalties if missed)

The gap between SLO and SLA is your safety margin against actually paying customers money for downtime.

Error Budget — The Most Useful Part

Once you have an SLO, the error budget is just the inverse — how much failure you're allowed before you breach it.

SLO: 99.9% success rate over 30 days
Error budget: 0.1% of requests can fail over 30 days

If you serve 10,000,000 requests in 30 days:
Error budget = 10,000 failed requests allowed

This number changes everything about how a team makes decisions. Instead of "zero downtime, always," you have a concrete, spendable budget.

python
def calculate_error_budget_remaining(total_requests: int, failed_requests: int, slo: float) -> dict:
    allowed_failures = total_requests * (1 - slo)
    budget_consumed_pct = (failed_requests / allowed_failures * 100) if allowed_failures else 0
    
    return {
        "allowed_failures": int(allowed_failures),
        "actual_failures": failed_requests,
        "budget_remaining_pct": max(0, 100 - budget_consumed_pct),
        "budget_exhausted": failed_requests >= allowed_failures
    }
 
# Example: SLO of 99.9%, 10M requests this month, 7,500 failures so far
result = calculate_error_budget_remaining(10_000_000, 7_500, 0.999)
# {'allowed_failures': 10000, 'actual_failures': 7500, 
#  'budget_remaining_pct': 25.0, 'budget_exhausted': False}

Why This Actually Changes Behavior

Without an error budget, every outage feels equally urgent and every "should we ship this risky change" decision is argued from scratch. With an error budget, the decision becomes mechanical:

Budget remaining is healthy (>50%) → ship new features, take calculated risks, deploy faster.

Budget is getting consumed fast → slow down, prioritize reliability work over new features, add more testing before risky deploys.

Budget is exhausted → freeze non-critical changes, all hands on reliability until you're back in budget.

yaml
# This becomes an actual policy, not a vibe-based argument in Slack
# Example: automated deploy freeze when error budget is exhausted
- alert: ErrorBudgetExhausted
  expr: error_budget_remaining_pct < 10
  labels:
    severity: critical
  annotations:
    summary: "Error budget below 10% — freeze non-critical deploys until reliability improves"
    action: "Notify #platform-team, block non-hotfix merges to main"

Picking the Right SLO — The Part Teams Get Wrong

The instinct is to aim for 99.99% or 99.999% everywhere because it sounds impressive. Don't. Two real costs scale with stricter SLOs:

Engineering cost — going from 99.9% to 99.99% often costs disproportionately more engineering effort than going from 99% to 99.9%, because you're now defending against rarer, harder-to-reproduce failure modes.

Reduced shipping velocity — a tighter SLO means a smaller error budget, which means less room to take risks on new features.

99.9% SLO  → 43 minutes of downtime/month allowed
99.99% SLO → 4.3 minutes of downtime/month allowed
99.999% SLO → 26 seconds of downtime/month allowed

Pick the SLO based on what users actually need, not what sounds aspirational. An internal admin dashboard used twice a day doesn't need the same SLO as a payment processing API — and pretending it does just burns engineering effort that should go elsewhere.

Putting It Together

SLI:          "98.7% of checkout requests succeeded in the last hour" (a measurement)
SLO:          "99.9% of checkout requests will succeed over 30 days" (the target)
Error budget: "We can afford 10,000 failed checkouts this month" (the allowance)

The SLI tells you where you stand right now. The SLO tells you what you're aiming for. The error budget translates the gap between them into an actual number your team can plan around — and that's the piece that turns reliability from an abstract value into an operational decision-making tool.

Build a tool to track this automatically: Build an AI SLO Budget Tracker

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments