🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build an AI-Powered SLO Budget Tracker with Python + Claude (2026)

Track your error budget automatically and get AI-generated burn rate alerts and incident summaries. Build a real SLO monitoring tool with Python, Prometheus, and Claude API.

DevOpsBoysMay 24, 20268 min read
Share:Tweet

Error budgets are the heart of SRE practice — but most teams track them manually or with static dashboards. This project builds an AI-powered SLO tracker that monitors your error budget burn rate, detects anomalies, and generates natural language incident summaries using Claude.


What We're Building

A Python service that:

  1. Queries Prometheus for SLO metrics
  2. Calculates error budget burn rate
  3. Detects fast burn (budget exhaustion risk)
  4. Uses Claude API to generate plain-English summaries
  5. Sends Slack alerts when burn rate is critical

Real output example:

🚨 SLO ALERT: api-gateway (99.9% availability)

Error budget: 43.2 minutes/month remaining (28.8% left)
Current burn rate: 4.2x (burning 4x faster than allowed)
At this rate: budget exhausted in ~6.1 hours

AI Summary: "The api-gateway service is experiencing elevated error rates
primarily on the /checkout endpoint (HTTP 503s, ~2.1% error rate). This
started at 14:32 UTC — correlates with the deployment of v2.4.1. The fast
burn is driven by a 6x spike in 5xx errors on POST /api/orders. Recommend
immediate rollback or traffic reduction to affected endpoint."

Prerequisites

bash
pip install anthropic prometheus-client requests python-dotenv slack-sdk
bash
# .env
ANTHROPIC_API_KEY=sk-ant-...
PROMETHEUS_URL=http://your-prometheus:9090
SLACK_BOT_TOKEN=xoxb-...
SLACK_CHANNEL=#alerts

Project Structure

slo-budget-tracker/
├── main.py              # Main tracker loop
├── slo_config.py        # SLO definitions
├── prometheus.py        # Prometheus query client
├── budget_calc.py       # Error budget calculations
├── ai_summarizer.py     # Claude AI summary generation
├── slack_notifier.py    # Slack alerts
└── .env

Step 1: Define Your SLOs

python
# slo_config.py
 
SLO_DEFINITIONS = [
    {
        "name": "api-gateway-availability",
        "service": "api-gateway",
        "type": "availability",
        "target": 0.999,          # 99.9%
        "window_days": 30,
        "good_query": 'sum(rate(http_requests_total{job="api-gateway",code!~"5.."}[5m]))',
        "total_query": 'sum(rate(http_requests_total{job="api-gateway"}[5m]))',
    },
    {
        "name": "api-gateway-latency",
        "service": "api-gateway",
        "type": "latency",
        "target": 0.95,           # 95% of requests < 200ms
        "window_days": 30,
        "good_query": 'sum(rate(http_request_duration_seconds_bucket{job="api-gateway",le="0.2"}[5m]))',
        "total_query": 'sum(rate(http_request_duration_seconds_count{job="api-gateway"}[5m]))',
    },
    {
        "name": "checkout-availability",
        "service": "checkout",
        "type": "availability",
        "target": 0.9995,         # 99.95%
        "window_days": 30,
        "good_query": 'sum(rate(http_requests_total{job="checkout",code!~"5.."}[5m]))',
        "total_query": 'sum(rate(http_requests_total{job="checkout"}[5m]))',
    },
]

Step 2: Prometheus Client

python
# prometheus.py
import requests
from datetime import datetime, timedelta
 
 
class PrometheusClient:
    def __init__(self, base_url: str):
        self.base_url = base_url.rstrip("/")
 
    def query(self, promql: str) -> float:
        """Execute instant query, return scalar value."""
        response = requests.get(
            f"{self.base_url}/api/v1/query",
            params={"query": promql},
            timeout=10,
        )
        response.raise_for_status()
        data = response.json()
 
        if data["status"] != "success":
            raise ValueError(f"Prometheus query failed: {data}")
 
        results = data["data"]["result"]
        if not results:
            return 0.0
 
        return float(results[0]["value"][1])
 
    def query_range(self, promql: str, hours: int = 1) -> list[dict]:
        """Query range for trend data."""
        end = datetime.utcnow()
        start = end - timedelta(hours=hours)
 
        response = requests.get(
            f"{self.base_url}/api/v1/query_range",
            params={
                "query": promql,
                "start": start.timestamp(),
                "end": end.timestamp(),
                "step": "60",   # 1-minute resolution
            },
            timeout=10,
        )
        response.raise_for_status()
        data = response.json()
 
        if data["data"]["result"]:
            return data["data"]["result"][0]["values"]
        return []

Step 3: Error Budget Calculator

python
# budget_calc.py
from dataclasses import dataclass
 
 
@dataclass
class BudgetStatus:
    slo_name: str
    service: str
    target: float
    current_sli: float          # Current good ratio
    error_budget_minutes: float  # Total allowed error minutes/month
    consumed_minutes: float      # Minutes consumed so far
    remaining_minutes: float     # Minutes remaining
    remaining_pct: float         # % of budget remaining
    burn_rate_1h: float          # Current 1h burn rate (multiple of allowed)
    burn_rate_6h: float          # 6h burn rate
    is_fast_burn: bool           # True if burning too fast
    alert_level: str             # "ok", "warning", "critical"
 
 
def calculate_budget(slo: dict, good_rate: float, total_rate: float,
                     good_rate_1h: float, total_rate_1h: float,
                     good_rate_6h: float, total_rate_6h: float) -> BudgetStatus:
 
    # Current SLI (good / total)
    current_sli = good_rate / total_rate if total_rate > 0 else 1.0
 
    # Total error budget for the window (in minutes)
    window_minutes = slo["window_days"] * 24 * 60
    allowed_error_pct = 1.0 - slo["target"]
    error_budget_minutes = window_minutes * allowed_error_pct
 
    # How much budget is consumed based on current SLI
    # Simplified: assume current error rate has been constant
    # In production, use actual historical data
    current_error_rate = 1.0 - current_sli
    consumed_pct = current_error_rate / allowed_error_pct if allowed_error_pct > 0 else 0
    consumed_minutes = consumed_pct * error_budget_minutes
    remaining_minutes = error_budget_minutes - consumed_minutes
    remaining_pct = (remaining_minutes / error_budget_minutes * 100) if error_budget_minutes > 0 else 100
 
    # Burn rate: how fast are we consuming budget vs allowed rate?
    # Burn rate of 1.0 = consuming exactly at the allowed rate (will exhaust at end of window)
    # Burn rate of 2.0 = consuming 2x faster (will exhaust in half the window)
    allowed_error_rate_per_minute = allowed_error_pct  # allowed error fraction
    current_error_rate_1h = 1.0 - (good_rate_1h / total_rate_1h) if total_rate_1h > 0 else 0
    current_error_rate_6h = 1.0 - (good_rate_6h / total_rate_6h) if total_rate_6h > 0 else 0
 
    burn_rate_1h = current_error_rate_1h / allowed_error_pct if allowed_error_pct > 0 else 0
    burn_rate_6h = current_error_rate_6h / allowed_error_pct if allowed_error_pct > 0 else 0
 
    # Google SRE fast burn thresholds:
    # Critical: 1h burn > 14.4x AND 5min burn > 14.4x
    # Warning: 1h burn > 6x AND 6h burn > 6x  
    is_fast_burn = burn_rate_1h > 14.4 or (burn_rate_1h > 6 and burn_rate_6h > 6)
 
    if burn_rate_1h > 14.4:
        alert_level = "critical"
    elif burn_rate_1h > 6 or remaining_pct < 20:
        alert_level = "warning"
    else:
        alert_level = "ok"
 
    return BudgetStatus(
        slo_name=slo["name"],
        service=slo["service"],
        target=slo["target"],
        current_sli=current_sli,
        error_budget_minutes=error_budget_minutes,
        consumed_minutes=consumed_minutes,
        remaining_minutes=remaining_minutes,
        remaining_pct=remaining_pct,
        burn_rate_1h=burn_rate_1h,
        burn_rate_6h=burn_rate_6h,
        is_fast_burn=is_fast_burn,
        alert_level=alert_level,
    )

Step 4: AI Summarizer with Claude

python
# ai_summarizer.py
import anthropic
from budget_calc import BudgetStatus
 
 
client = anthropic.Anthropic()
 
 
def generate_slo_summary(status: BudgetStatus, recent_errors: list) -> str:
    """Generate an AI summary of the SLO breach situation."""
 
    error_sample = recent_errors[:20] if recent_errors else []
    error_text = "\n".join(error_sample) if error_sample else "No recent error log samples available."
 
    prompt = f"""You are an SRE analyst. Analyze this SLO status and provide a concise incident summary.
 
SLO: {status.slo_name} ({status.service})
Target: {status.target * 100:.3f}% availability
Current SLI: {status.current_sli * 100:.3f}%
Error budget remaining: {status.remaining_minutes:.1f} minutes ({status.remaining_pct:.1f}%)
Burn rate (1h): {status.burn_rate_1h:.1f}x normal
Burn rate (6h): {status.burn_rate_6h:.1f}x normal
Alert level: {status.alert_level}
 
Recent error samples:
{error_text}
 
Write a 3-4 sentence incident summary that:
1. States what's happening and how severe it is
2. Identifies the likely cause based on error patterns (if visible)
3. States when the budget will be exhausted at current rate
4. Gives a specific recommended action
 
Be direct and technical. No filler phrases."""
 
    message = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}]
    )
 
    return message.content[0].text
 
 
def generate_daily_report(all_statuses: list[BudgetStatus]) -> str:
    """Generate a daily SLO health report for all services."""
 
    services_summary = "\n".join([
        f"- {s.slo_name}: {s.remaining_pct:.1f}% budget remaining, burn rate {s.burn_rate_1h:.1f}x, status: {s.alert_level}"
        for s in all_statuses
    ])
 
    critical = [s for s in all_statuses if s.alert_level == "critical"]
    warning = [s for s in all_statuses if s.alert_level == "warning"]
 
    prompt = f"""You are an SRE lead writing a daily reliability report.
 
SLO Status Summary ({len(all_statuses)} SLOs tracked):
{services_summary}
 
Critical: {len(critical)} SLOs
Warning: {len(warning)} SLOs
Healthy: {len(all_statuses) - len(critical) - len(warning)} SLOs
 
Write a brief daily reliability report (5-8 sentences) covering:
1. Overall reliability health
2. Any critical concerns requiring immediate attention
3. Trends or patterns worth watching
4. Recommended focus areas for today
 
Be concise and actionable."""
 
    message = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=400,
        messages=[{"role": "user", "content": prompt}]
    )
 
    return message.content[0].text

Step 5: Slack Notifier

python
# slack_notifier.py
from slack_sdk import WebClient
from budget_calc import BudgetStatus
 
 
def send_alert(token: str, channel: str, status: BudgetStatus, ai_summary: str):
    client = WebClient(token=token)
 
    # Color based on alert level
    color = {"ok": "#36a64f", "warning": "#ffa500", "critical": "#ff0000"}[status.alert_level]
    emoji = {"ok": "✅", "warning": "⚠️", "critical": "🚨"}[status.alert_level]
 
    hours_until_exhausted = (
        (status.remaining_minutes / 60) / status.burn_rate_1h
        if status.burn_rate_1h > 0 else float("inf")
    )
 
    client.chat_postMessage(
        channel=channel,
        text=f"{emoji} SLO Alert: {status.slo_name}",
        attachments=[
            {
                "color": color,
                "title": f"{emoji} SLO {status.alert_level.upper()}: {status.slo_name}",
                "fields": [
                    {"title": "Service", "value": status.service, "short": True},
                    {"title": "SLO Target", "value": f"{status.target * 100:.3f}%", "short": True},
                    {"title": "Current SLI", "value": f"{status.current_sli * 100:.3f}%", "short": True},
                    {"title": "Burn Rate (1h)", "value": f"{status.burn_rate_1h:.1f}x", "short": True},
                    {
                        "title": "Budget Remaining",
                        "value": f"{status.remaining_minutes:.1f} min ({status.remaining_pct:.1f}%)",
                        "short": True,
                    },
                    {
                        "title": "Exhausted In",
                        "value": f"~{hours_until_exhausted:.1f}h" if hours_until_exhausted < 168 else "Safe",
                        "short": True,
                    },
                ],
                "text": f"*AI Analysis:*\n{ai_summary}",
                "footer": "SLO Budget Tracker",
            }
        ],
    )

Step 6: Main Loop

python
# main.py
import os
import time
import logging
from dotenv import load_dotenv
from prometheus import PrometheusClient
from slo_config import SLO_DEFINITIONS
from budget_calc import calculate_budget
from ai_summarizer import generate_slo_summary, generate_daily_report
from slack_notifier import send_alert
 
load_dotenv()
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
 
prom = PrometheusClient(os.environ["PROMETHEUS_URL"])
alerted_slos = set()   # Prevent alert spam
 
 
def check_slos():
    statuses = []
 
    for slo in SLO_DEFINITIONS:
        try:
            # Current rates
            good = prom.query(slo["good_query"])
            total = prom.query(slo["total_query"])
 
            # 1h rates
            good_1h = prom.query(slo["good_query"].replace("[5m]", "[1h]"))
            total_1h = prom.query(slo["total_query"].replace("[5m]", "[1h]"))
 
            # 6h rates
            good_6h = prom.query(slo["good_query"].replace("[5m]", "[6h]"))
            total_6h = prom.query(slo["total_query"].replace("[5m]", "[6h]"))
 
            status = calculate_budget(slo, good, total, good_1h, total_1h, good_6h, total_6h)
            statuses.append(status)
 
            logging.info(
                f"{slo['name']}: SLI={status.current_sli:.4f} "
                f"burn={status.burn_rate_1h:.2f}x budget={status.remaining_pct:.1f}% "
                f"status={status.alert_level}"
            )
 
            # Alert on critical/warning (once per incident)
            alert_key = f"{slo['name']}-{status.alert_level}"
            if status.alert_level in ("critical", "warning") and alert_key not in alerted_slos:
                summary = generate_slo_summary(status, recent_errors=[])
                send_alert(
                    os.environ["SLACK_BOT_TOKEN"],
                    os.environ["SLACK_CHANNEL"],
                    status,
                    summary,
                )
                alerted_slos.add(alert_key)
                logging.warning(f"Alert sent for {slo['name']}")
 
            # Clear alert state when recovered
            if status.alert_level == "ok":
                alerted_slos.discard(f"{slo['name']}-critical")
                alerted_slos.discard(f"{slo['name']}-warning")
 
        except Exception as e:
            logging.error(f"Error checking {slo['name']}: {e}")
 
    return statuses
 
 
def main():
    logging.info("SLO Budget Tracker started")
    check_count = 0
 
    while True:
        statuses = check_slos()
        check_count += 1
 
        # Daily report every 288 checks (24h at 5min intervals)
        if check_count % 288 == 0 and statuses:
            report = generate_daily_report(statuses)
            logging.info(f"Daily report:\n{report}")
            # Optionally send to Slack as daily digest
 
        time.sleep(300)   # Check every 5 minutes
 
 
if __name__ == "__main__":
    main()

Running It

bash
# Local
python main.py
 
# Docker
docker build -t slo-tracker .
docker run -d \
  --env-file .env \
  --name slo-tracker \
  slo-tracker
 
# Kubernetes CronJob for scheduled checks
kubectl apply -f k8s/cronjob.yaml

Sample Output

2026-05-24 14:32:01 INFO  api-gateway-availability: SLI=0.9961 burn=4.21x budget=28.8% status=warning
2026-05-24 14:32:02 INFO  api-gateway-latency: SLI=0.9712 burn=1.02x budget=81.3% status=ok
2026-05-24 14:32:03 INFO  checkout-availability: SLI=0.9991 burn=0.98x budget=92.1% status=ok
2026-05-24 14:32:03 WARNING Alert sent for api-gateway-availability

Extensions

  • Grafana dashboard — expose metrics as Prometheus metrics, visualize in Grafana
  • Historical trending — store budget snapshots in PostgreSQL, predict future exhaustion
  • Multi-window SLOs — track 7-day and 30-day windows simultaneously
  • Auto-rollback trigger — when burn rate > 14.4x, trigger ArgoCD rollback automatically
  • PagerDuty integration — escalate critical burns to on-call rotation

Related: OpenTelemetry Complete Guide | Prometheus + Grafana Monitoring Guide | AI-Powered Incident Response with LLM Runbooks

Affiliate note: This project uses Anthropic Claude API (claude-haiku-4-5-20251001 for cost-efficient summaries — ~$0.25/million tokens). For production deployments, Grafana Cloud has a generous free tier with built-in SLO tracking dashboards.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments