Build an AI On-Call Assistant with PagerDuty and Claude API
Build an AI assistant that reads PagerDuty alerts, fetches related runbooks, and generates a first-response action plan — so your on-call engineer doesn't start from zero at 3am.
3am. PagerDuty fires. You wake up groggy, stare at an alert, and your brain is doing 5% capacity. You need to figure out: what broke, why, and what to do.
This project builds an AI assistant that does the first 10 minutes of that work for you — reads the alert, looks up relevant runbooks, and gives you a specific action plan.
What We're Building
PagerDuty Alert: "High error rate on payment-api (5xx > 5%)"
↓
AI On-Call Assistant
↓
📋 Incident Brief:
- Service: payment-api (prod)
- Alert: 5xx error rate 8.3% (threshold: 5%)
- Started: 3 minutes ago
- Impact: Checkout flow affected
🔍 Likely Causes (based on recent deploys + runbook):
1. Deploy at 02:47 UTC may have introduced a bug
2. Downstream stripe-api has had intermittent issues this week
⚡ Immediate Actions:
1. Check deploy history: kubectl rollout history deployment/payment-api
2. Check error logs: kubectl logs -n prod -l app=payment-api --since=10m
3. Check Stripe status: https://status.stripe.com
4. If deploy is culprit: kubectl rollout undo deployment/payment-api
Stack
- Python 3.11+
pdpyras(PagerDuty Python SDK)anthropic(Claude API)python-dotenv
pip install pdpyras anthropic python-dotenv requests.env:
PAGERDUTY_API_KEY=your-api-key
ANTHROPIC_API_KEY=sk-ant-...
RUNBOOK_DIR=./runbooks
Step 1: Fetch PagerDuty Incident Data
# pd_client.py
import pdpyras
import os
from dataclasses import dataclass
from typing import Optional
from datetime import datetime, timedelta
@dataclass
class IncidentContext:
id: str
title: str
service_name: str
urgency: str
status: str
created_at: str
details: str
recent_alerts: list[dict]
recent_deploys: list[dict]
def get_pd_client():
return pdpyras.APISession(os.getenv("PAGERDUTY_API_KEY"))
def fetch_incident(incident_id: str) -> IncidentContext:
session = get_pd_client()
# Get the incident
incident = session.rget(f"/incidents/{incident_id}")
# Get recent alerts for context
alerts = session.rget(
f"/incidents/{incident_id}/alerts",
params={"limit": 10}
)
# Get service info
service_id = incident.get("service", {}).get("id")
service_name = incident.get("service", {}).get("summary", "unknown")
# Get recent incidents for same service (pattern detection)
recent = session.rget(
"/incidents",
params={
"service_ids[]": [service_id],
"since": (datetime.utcnow() - timedelta(days=7)).isoformat(),
"statuses[]": ["resolved"],
"limit": 5
}
) if service_id else []
return IncidentContext(
id=incident_id,
title=incident.get("title", ""),
service_name=service_name,
urgency=incident.get("urgency", ""),
status=incident.get("status", ""),
created_at=incident.get("created_at", ""),
details=str(incident.get("body", {}).get("details", "")),
recent_alerts=[
{
"summary": a.get("summary", ""),
"created_at": a.get("created_at", "")
}
for a in (alerts if isinstance(alerts, list) else [])
],
recent_deploys=[] # filled by deploy tracker if you have one
)
def get_open_incidents() -> list[dict]:
"""Get all currently open incidents"""
session = get_pd_client()
incidents = session.rget(
"/incidents",
params={"statuses[]": ["triggered", "acknowledged"]}
)
return incidents if isinstance(incidents, list) else []Step 2: Load Runbooks
# runbook_loader.py
from pathlib import Path
import os
def load_runbooks(service_name: str = None) -> str:
"""Load runbooks, prioritizing ones matching the service name"""
runbook_dir = Path(os.getenv("RUNBOOK_DIR", "./runbooks"))
if not runbook_dir.exists():
return "No runbooks available."
all_runbooks = []
priority_runbooks = []
for file in runbook_dir.glob("*.md"):
content = file.read_text()
# If service name matches, prioritize this runbook
if service_name and service_name.lower() in file.stem.lower():
priority_runbooks.append(f"## RUNBOOK: {file.stem}\n{content}")
else:
all_runbooks.append(f"## RUNBOOK: {file.stem}\n{content}")
# Return priority runbooks first, then others (truncated to fit context)
combined = priority_runbooks + all_runbooks
full_text = "\n\n---\n\n".join(combined)
# Truncate if too long (keep ~4000 chars for context)
if len(full_text) > 4000:
full_text = full_text[:4000] + "\n\n[...truncated for context...]"
return full_text if full_text else "No runbooks available."Step 3: Claude AI Analysis
# ai_assistant.py
import anthropic
import os
from pd_client import IncidentContext
from runbook_loader import load_runbooks
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def generate_incident_brief(incident: IncidentContext) -> str:
runbooks = load_runbooks(incident.service_name)
recent_alerts_text = "\n".join([
f"- {a['created_at']}: {a['summary']}"
for a in incident.recent_alerts[:5]
]) or "No recent alerts"
prompt = f"""You are an expert on-call assistant. An incident just fired.
Give the on-call engineer a clear, actionable brief to start their investigation.
## INCIDENT DETAILS
- **ID**: {incident.id}
- **Title**: {incident.title}
- **Service**: {incident.service_name}
- **Urgency**: {incident.urgency}
- **Status**: {incident.status}
- **Started**: {incident.created_at}
- **Details**: {incident.details}
## RECENT ALERTS (last 10)
{recent_alerts_text}
## RUNBOOKS
{runbooks}
---
Please provide:
1. **30-Second Summary** — What broke, when, what's likely affected (2-3 sentences max)
2. **Likely Causes** — Top 3 probable root causes based on the alert details and runbooks
3. **Immediate Actions** — Exact commands/steps to run RIGHT NOW (numbered, specific, copy-pasteable)
4. **Escalation** — When to escalate and who to call (based on runbook or common sense)
5. **Key Questions** — 2-3 diagnostic questions to answer in the first 5 minutes
Format as markdown. Be concise — the engineer just woke up. No fluff."""
message = client.messages.create(
model="claude-opus-4-7",
max_tokens=1500,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
def generate_postmortem_draft(incident: IncidentContext, resolution_notes: str) -> str:
"""Generate a postmortem draft after incident resolution"""
prompt = f"""Generate a postmortem draft for this incident:
**Incident**: {incident.title}
**Service**: {incident.service_name}
**Duration**: From {incident.created_at} to now
**Resolution Notes**: {resolution_notes}
Follow the standard postmortem format:
1. Summary
2. Impact
3. Root Cause
4. Timeline
5. What Went Well
6. What Went Poorly
7. Action Items (with owners and due dates)
Keep it factual, blameless, and actionable."""
message = client.messages.create(
model="claude-opus-4-7",
max_tokens=1500,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].textStep 4: CLI Interface
# main.py
import argparse
import sys
from pd_client import fetch_incident, get_open_incidents
from ai_assistant import generate_incident_brief, generate_postmortem_draft
def main():
parser = argparse.ArgumentParser(description="AI On-Call Assistant")
subparsers = parser.add_subparsers(dest="command")
# Brief command
brief_parser = subparsers.add_parser("brief", help="Get AI brief for an incident")
brief_parser.add_argument("incident_id", help="PagerDuty incident ID")
# List command
subparsers.add_parser("list", help="List open incidents")
# Postmortem command
pm_parser = subparsers.add_parser("postmortem", help="Generate postmortem draft")
pm_parser.add_argument("incident_id")
pm_parser.add_argument("--notes", default="", help="Resolution notes")
args = parser.parse_args()
if args.command == "brief":
print(f"🔍 Fetching incident {args.incident_id}...")
incident = fetch_incident(args.incident_id)
print(f"🤖 Generating AI brief for: {incident.title}\n")
print("=" * 60)
brief = generate_incident_brief(incident)
print(brief)
print("=" * 60)
elif args.command == "list":
incidents = get_open_incidents()
if not incidents:
print("✅ No open incidents!")
return
print(f"🚨 {len(incidents)} open incident(s):\n")
for inc in incidents:
print(f" [{inc['urgency'].upper()}] {inc['id']} — {inc['title']}")
print(f" Service: {inc.get('service', {}).get('summary', 'unknown')}")
print(f" Status: {inc['status']}")
print()
print(f"Run: python main.py brief <incident_id>")
elif args.command == "postmortem":
print(f"📝 Generating postmortem for {args.incident_id}...")
incident = fetch_incident(args.incident_id)
pm = generate_postmortem_draft(incident, args.notes)
# Save to file
filename = f"postmortem-{args.incident_id}.md"
with open(filename, "w") as f:
f.write(f"# Postmortem: {incident.title}\n\n")
f.write(pm)
print(f"✅ Postmortem saved to {filename}")
else:
parser.print_help()
if __name__ == "__main__":
main()Using It
# See open incidents
python main.py list
# Get AI brief for an incident
python main.py brief P1234ABC
# Generate postmortem
python main.py postmortem P1234ABC --notes "Rolled back deploy v2.1.3, fixed memory leak"Add a Runbook
# runbooks/payment-api.md
## payment-api Runbook
### Common Issues
**High Error Rate (5xx)**
- Check recent deploys: `kubectl rollout history deployment/payment-api -n prod`
- Check logs: `kubectl logs -n prod -l app=payment-api --since=10m | grep ERROR`
- Check Stripe API status: https://status.stripe.com
**Rollback**
```bash
kubectl rollout undo deployment/payment-api -n prod
kubectl rollout status deployment/payment-api -n prodEscalation
- Payment team lead: @payment-lead on Slack
- Escalate if revenue impact > 5 minutes
---
## What's Next
- Add Slack notification with the brief automatically
- Connect to your metrics API to include real-time data
- Store incident + resolution pairs to improve recommendations over time
- Build a web UI with Streamlit for visual incident management
---
> **[Anthropic Claude API](https://anthropic.com)** — `claude-opus-4-7` handles complex technical analysis. Build your API key at console.anthropic.com.
> **[PagerDuty](https://pagerduty.com)** — the incident management platform this integrates with. They have a free developer account for testing.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI-Powered SLO Budget Tracker with Python + Claude (2026)
Track your error budget automatically and get AI-generated burn rate alerts and incident summaries. Build a real SLO monitoring tool with Python, Prometheus, and Claude API.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
AI-Powered Log Analysis Is Replacing Manual Debugging in DevOps (2026)
How LLMs and AI are transforming log analysis, anomaly detection, and root cause analysis — and the tools DevOps engineers should know about in 2026.