Build an AI-Powered SLO Budget Tracker with Python + Claude (2026)
Track your error budget automatically and get AI-generated burn rate alerts and incident summaries. Build a real SLO monitoring tool with Python, Prometheus, and Claude API.
Error budgets are the heart of SRE practice — but most teams track them manually or with static dashboards. This project builds an AI-powered SLO tracker that monitors your error budget burn rate, detects anomalies, and generates natural language incident summaries using Claude.
What We're Building
A Python service that:
- Queries Prometheus for SLO metrics
- Calculates error budget burn rate
- Detects fast burn (budget exhaustion risk)
- Uses Claude API to generate plain-English summaries
- Sends Slack alerts when burn rate is critical
Real output example:
🚨 SLO ALERT: api-gateway (99.9% availability)
Error budget: 43.2 minutes/month remaining (28.8% left)
Current burn rate: 4.2x (burning 4x faster than allowed)
At this rate: budget exhausted in ~6.1 hours
AI Summary: "The api-gateway service is experiencing elevated error rates
primarily on the /checkout endpoint (HTTP 503s, ~2.1% error rate). This
started at 14:32 UTC — correlates with the deployment of v2.4.1. The fast
burn is driven by a 6x spike in 5xx errors on POST /api/orders. Recommend
immediate rollback or traffic reduction to affected endpoint."
Prerequisites
pip install anthropic prometheus-client requests python-dotenv slack-sdk# .env
ANTHROPIC_API_KEY=sk-ant-...
PROMETHEUS_URL=http://your-prometheus:9090
SLACK_BOT_TOKEN=xoxb-...
SLACK_CHANNEL=#alertsProject Structure
slo-budget-tracker/
├── main.py # Main tracker loop
├── slo_config.py # SLO definitions
├── prometheus.py # Prometheus query client
├── budget_calc.py # Error budget calculations
├── ai_summarizer.py # Claude AI summary generation
├── slack_notifier.py # Slack alerts
└── .env
Step 1: Define Your SLOs
# slo_config.py
SLO_DEFINITIONS = [
{
"name": "api-gateway-availability",
"service": "api-gateway",
"type": "availability",
"target": 0.999, # 99.9%
"window_days": 30,
"good_query": 'sum(rate(http_requests_total{job="api-gateway",code!~"5.."}[5m]))',
"total_query": 'sum(rate(http_requests_total{job="api-gateway"}[5m]))',
},
{
"name": "api-gateway-latency",
"service": "api-gateway",
"type": "latency",
"target": 0.95, # 95% of requests < 200ms
"window_days": 30,
"good_query": 'sum(rate(http_request_duration_seconds_bucket{job="api-gateway",le="0.2"}[5m]))',
"total_query": 'sum(rate(http_request_duration_seconds_count{job="api-gateway"}[5m]))',
},
{
"name": "checkout-availability",
"service": "checkout",
"type": "availability",
"target": 0.9995, # 99.95%
"window_days": 30,
"good_query": 'sum(rate(http_requests_total{job="checkout",code!~"5.."}[5m]))',
"total_query": 'sum(rate(http_requests_total{job="checkout"}[5m]))',
},
]Step 2: Prometheus Client
# prometheus.py
import requests
from datetime import datetime, timedelta
class PrometheusClient:
def __init__(self, base_url: str):
self.base_url = base_url.rstrip("/")
def query(self, promql: str) -> float:
"""Execute instant query, return scalar value."""
response = requests.get(
f"{self.base_url}/api/v1/query",
params={"query": promql},
timeout=10,
)
response.raise_for_status()
data = response.json()
if data["status"] != "success":
raise ValueError(f"Prometheus query failed: {data}")
results = data["data"]["result"]
if not results:
return 0.0
return float(results[0]["value"][1])
def query_range(self, promql: str, hours: int = 1) -> list[dict]:
"""Query range for trend data."""
end = datetime.utcnow()
start = end - timedelta(hours=hours)
response = requests.get(
f"{self.base_url}/api/v1/query_range",
params={
"query": promql,
"start": start.timestamp(),
"end": end.timestamp(),
"step": "60", # 1-minute resolution
},
timeout=10,
)
response.raise_for_status()
data = response.json()
if data["data"]["result"]:
return data["data"]["result"][0]["values"]
return []Step 3: Error Budget Calculator
# budget_calc.py
from dataclasses import dataclass
@dataclass
class BudgetStatus:
slo_name: str
service: str
target: float
current_sli: float # Current good ratio
error_budget_minutes: float # Total allowed error minutes/month
consumed_minutes: float # Minutes consumed so far
remaining_minutes: float # Minutes remaining
remaining_pct: float # % of budget remaining
burn_rate_1h: float # Current 1h burn rate (multiple of allowed)
burn_rate_6h: float # 6h burn rate
is_fast_burn: bool # True if burning too fast
alert_level: str # "ok", "warning", "critical"
def calculate_budget(slo: dict, good_rate: float, total_rate: float,
good_rate_1h: float, total_rate_1h: float,
good_rate_6h: float, total_rate_6h: float) -> BudgetStatus:
# Current SLI (good / total)
current_sli = good_rate / total_rate if total_rate > 0 else 1.0
# Total error budget for the window (in minutes)
window_minutes = slo["window_days"] * 24 * 60
allowed_error_pct = 1.0 - slo["target"]
error_budget_minutes = window_minutes * allowed_error_pct
# How much budget is consumed based on current SLI
# Simplified: assume current error rate has been constant
# In production, use actual historical data
current_error_rate = 1.0 - current_sli
consumed_pct = current_error_rate / allowed_error_pct if allowed_error_pct > 0 else 0
consumed_minutes = consumed_pct * error_budget_minutes
remaining_minutes = error_budget_minutes - consumed_minutes
remaining_pct = (remaining_minutes / error_budget_minutes * 100) if error_budget_minutes > 0 else 100
# Burn rate: how fast are we consuming budget vs allowed rate?
# Burn rate of 1.0 = consuming exactly at the allowed rate (will exhaust at end of window)
# Burn rate of 2.0 = consuming 2x faster (will exhaust in half the window)
allowed_error_rate_per_minute = allowed_error_pct # allowed error fraction
current_error_rate_1h = 1.0 - (good_rate_1h / total_rate_1h) if total_rate_1h > 0 else 0
current_error_rate_6h = 1.0 - (good_rate_6h / total_rate_6h) if total_rate_6h > 0 else 0
burn_rate_1h = current_error_rate_1h / allowed_error_pct if allowed_error_pct > 0 else 0
burn_rate_6h = current_error_rate_6h / allowed_error_pct if allowed_error_pct > 0 else 0
# Google SRE fast burn thresholds:
# Critical: 1h burn > 14.4x AND 5min burn > 14.4x
# Warning: 1h burn > 6x AND 6h burn > 6x
is_fast_burn = burn_rate_1h > 14.4 or (burn_rate_1h > 6 and burn_rate_6h > 6)
if burn_rate_1h > 14.4:
alert_level = "critical"
elif burn_rate_1h > 6 or remaining_pct < 20:
alert_level = "warning"
else:
alert_level = "ok"
return BudgetStatus(
slo_name=slo["name"],
service=slo["service"],
target=slo["target"],
current_sli=current_sli,
error_budget_minutes=error_budget_minutes,
consumed_minutes=consumed_minutes,
remaining_minutes=remaining_minutes,
remaining_pct=remaining_pct,
burn_rate_1h=burn_rate_1h,
burn_rate_6h=burn_rate_6h,
is_fast_burn=is_fast_burn,
alert_level=alert_level,
)Step 4: AI Summarizer with Claude
# ai_summarizer.py
import anthropic
from budget_calc import BudgetStatus
client = anthropic.Anthropic()
def generate_slo_summary(status: BudgetStatus, recent_errors: list) -> str:
"""Generate an AI summary of the SLO breach situation."""
error_sample = recent_errors[:20] if recent_errors else []
error_text = "\n".join(error_sample) if error_sample else "No recent error log samples available."
prompt = f"""You are an SRE analyst. Analyze this SLO status and provide a concise incident summary.
SLO: {status.slo_name} ({status.service})
Target: {status.target * 100:.3f}% availability
Current SLI: {status.current_sli * 100:.3f}%
Error budget remaining: {status.remaining_minutes:.1f} minutes ({status.remaining_pct:.1f}%)
Burn rate (1h): {status.burn_rate_1h:.1f}x normal
Burn rate (6h): {status.burn_rate_6h:.1f}x normal
Alert level: {status.alert_level}
Recent error samples:
{error_text}
Write a 3-4 sentence incident summary that:
1. States what's happening and how severe it is
2. Identifies the likely cause based on error patterns (if visible)
3. States when the budget will be exhausted at current rate
4. Gives a specific recommended action
Be direct and technical. No filler phrases."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
def generate_daily_report(all_statuses: list[BudgetStatus]) -> str:
"""Generate a daily SLO health report for all services."""
services_summary = "\n".join([
f"- {s.slo_name}: {s.remaining_pct:.1f}% budget remaining, burn rate {s.burn_rate_1h:.1f}x, status: {s.alert_level}"
for s in all_statuses
])
critical = [s for s in all_statuses if s.alert_level == "critical"]
warning = [s for s in all_statuses if s.alert_level == "warning"]
prompt = f"""You are an SRE lead writing a daily reliability report.
SLO Status Summary ({len(all_statuses)} SLOs tracked):
{services_summary}
Critical: {len(critical)} SLOs
Warning: {len(warning)} SLOs
Healthy: {len(all_statuses) - len(critical) - len(warning)} SLOs
Write a brief daily reliability report (5-8 sentences) covering:
1. Overall reliability health
2. Any critical concerns requiring immediate attention
3. Trends or patterns worth watching
4. Recommended focus areas for today
Be concise and actionable."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=400,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].textStep 5: Slack Notifier
# slack_notifier.py
from slack_sdk import WebClient
from budget_calc import BudgetStatus
def send_alert(token: str, channel: str, status: BudgetStatus, ai_summary: str):
client = WebClient(token=token)
# Color based on alert level
color = {"ok": "#36a64f", "warning": "#ffa500", "critical": "#ff0000"}[status.alert_level]
emoji = {"ok": "✅", "warning": "⚠️", "critical": "🚨"}[status.alert_level]
hours_until_exhausted = (
(status.remaining_minutes / 60) / status.burn_rate_1h
if status.burn_rate_1h > 0 else float("inf")
)
client.chat_postMessage(
channel=channel,
text=f"{emoji} SLO Alert: {status.slo_name}",
attachments=[
{
"color": color,
"title": f"{emoji} SLO {status.alert_level.upper()}: {status.slo_name}",
"fields": [
{"title": "Service", "value": status.service, "short": True},
{"title": "SLO Target", "value": f"{status.target * 100:.3f}%", "short": True},
{"title": "Current SLI", "value": f"{status.current_sli * 100:.3f}%", "short": True},
{"title": "Burn Rate (1h)", "value": f"{status.burn_rate_1h:.1f}x", "short": True},
{
"title": "Budget Remaining",
"value": f"{status.remaining_minutes:.1f} min ({status.remaining_pct:.1f}%)",
"short": True,
},
{
"title": "Exhausted In",
"value": f"~{hours_until_exhausted:.1f}h" if hours_until_exhausted < 168 else "Safe",
"short": True,
},
],
"text": f"*AI Analysis:*\n{ai_summary}",
"footer": "SLO Budget Tracker",
}
],
)Step 6: Main Loop
# main.py
import os
import time
import logging
from dotenv import load_dotenv
from prometheus import PrometheusClient
from slo_config import SLO_DEFINITIONS
from budget_calc import calculate_budget
from ai_summarizer import generate_slo_summary, generate_daily_report
from slack_notifier import send_alert
load_dotenv()
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
prom = PrometheusClient(os.environ["PROMETHEUS_URL"])
alerted_slos = set() # Prevent alert spam
def check_slos():
statuses = []
for slo in SLO_DEFINITIONS:
try:
# Current rates
good = prom.query(slo["good_query"])
total = prom.query(slo["total_query"])
# 1h rates
good_1h = prom.query(slo["good_query"].replace("[5m]", "[1h]"))
total_1h = prom.query(slo["total_query"].replace("[5m]", "[1h]"))
# 6h rates
good_6h = prom.query(slo["good_query"].replace("[5m]", "[6h]"))
total_6h = prom.query(slo["total_query"].replace("[5m]", "[6h]"))
status = calculate_budget(slo, good, total, good_1h, total_1h, good_6h, total_6h)
statuses.append(status)
logging.info(
f"{slo['name']}: SLI={status.current_sli:.4f} "
f"burn={status.burn_rate_1h:.2f}x budget={status.remaining_pct:.1f}% "
f"status={status.alert_level}"
)
# Alert on critical/warning (once per incident)
alert_key = f"{slo['name']}-{status.alert_level}"
if status.alert_level in ("critical", "warning") and alert_key not in alerted_slos:
summary = generate_slo_summary(status, recent_errors=[])
send_alert(
os.environ["SLACK_BOT_TOKEN"],
os.environ["SLACK_CHANNEL"],
status,
summary,
)
alerted_slos.add(alert_key)
logging.warning(f"Alert sent for {slo['name']}")
# Clear alert state when recovered
if status.alert_level == "ok":
alerted_slos.discard(f"{slo['name']}-critical")
alerted_slos.discard(f"{slo['name']}-warning")
except Exception as e:
logging.error(f"Error checking {slo['name']}: {e}")
return statuses
def main():
logging.info("SLO Budget Tracker started")
check_count = 0
while True:
statuses = check_slos()
check_count += 1
# Daily report every 288 checks (24h at 5min intervals)
if check_count % 288 == 0 and statuses:
report = generate_daily_report(statuses)
logging.info(f"Daily report:\n{report}")
# Optionally send to Slack as daily digest
time.sleep(300) # Check every 5 minutes
if __name__ == "__main__":
main()Running It
# Local
python main.py
# Docker
docker build -t slo-tracker .
docker run -d \
--env-file .env \
--name slo-tracker \
slo-tracker
# Kubernetes CronJob for scheduled checks
kubectl apply -f k8s/cronjob.yamlSample Output
2026-05-24 14:32:01 INFO api-gateway-availability: SLI=0.9961 burn=4.21x budget=28.8% status=warning
2026-05-24 14:32:02 INFO api-gateway-latency: SLI=0.9712 burn=1.02x budget=81.3% status=ok
2026-05-24 14:32:03 INFO checkout-availability: SLI=0.9991 burn=0.98x budget=92.1% status=ok
2026-05-24 14:32:03 WARNING Alert sent for api-gateway-availability
Extensions
- Grafana dashboard — expose metrics as Prometheus metrics, visualize in Grafana
- Historical trending — store budget snapshots in PostgreSQL, predict future exhaustion
- Multi-window SLOs — track 7-day and 30-day windows simultaneously
- Auto-rollback trigger — when burn rate > 14.4x, trigger ArgoCD rollback automatically
- PagerDuty integration — escalate critical burns to on-call rotation
Related: OpenTelemetry Complete Guide | Prometheus + Grafana Monitoring Guide | AI-Powered Incident Response with LLM Runbooks
Affiliate note: This project uses Anthropic Claude API (claude-haiku-4-5-20251001 for cost-efficient summaries — ~$0.25/million tokens). For production deployments, Grafana Cloud has a generous free tier with built-in SLO tracking dashboards.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI On-Call Assistant with PagerDuty and Claude API
Build an AI assistant that reads PagerDuty alerts, fetches related runbooks, and generates a first-response action plan — so your on-call engineer doesn't start from zero at 3am.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
AI-Powered Log Analysis Is Replacing Manual Debugging in DevOps (2026)
How LLMs and AI are transforming log analysis, anomaly detection, and root cause analysis — and the tools DevOps engineers should know about in 2026.