🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build an AI On-Call Assistant with PagerDuty and Claude API

Build an AI assistant that reads PagerDuty alerts, fetches related runbooks, and generates a first-response action plan — so your on-call engineer doesn't start from zero at 3am.

DevOpsBoysMay 28, 20266 min read
Share:Tweet

3am. PagerDuty fires. You wake up groggy, stare at an alert, and your brain is doing 5% capacity. You need to figure out: what broke, why, and what to do.

This project builds an AI assistant that does the first 10 minutes of that work for you — reads the alert, looks up relevant runbooks, and gives you a specific action plan.


What We're Building

PagerDuty Alert: "High error rate on payment-api (5xx > 5%)"
                    ↓
         AI On-Call Assistant
                    ↓
📋 Incident Brief:
- Service: payment-api (prod)
- Alert: 5xx error rate 8.3% (threshold: 5%)
- Started: 3 minutes ago
- Impact: Checkout flow affected

🔍 Likely Causes (based on recent deploys + runbook):
1. Deploy at 02:47 UTC may have introduced a bug
2. Downstream stripe-api has had intermittent issues this week

⚡ Immediate Actions:
1. Check deploy history: kubectl rollout history deployment/payment-api
2. Check error logs: kubectl logs -n prod -l app=payment-api --since=10m
3. Check Stripe status: https://status.stripe.com
4. If deploy is culprit: kubectl rollout undo deployment/payment-api

Stack

  • Python 3.11+
  • pdpyras (PagerDuty Python SDK)
  • anthropic (Claude API)
  • python-dotenv
bash
pip install pdpyras anthropic python-dotenv requests

.env:

PAGERDUTY_API_KEY=your-api-key
ANTHROPIC_API_KEY=sk-ant-...
RUNBOOK_DIR=./runbooks

Step 1: Fetch PagerDuty Incident Data

python
# pd_client.py
import pdpyras
import os
from dataclasses import dataclass
from typing import Optional
from datetime import datetime, timedelta
 
@dataclass
class IncidentContext:
    id: str
    title: str
    service_name: str
    urgency: str
    status: str
    created_at: str
    details: str
    recent_alerts: list[dict]
    recent_deploys: list[dict]
 
def get_pd_client():
    return pdpyras.APISession(os.getenv("PAGERDUTY_API_KEY"))
 
def fetch_incident(incident_id: str) -> IncidentContext:
    session = get_pd_client()
    
    # Get the incident
    incident = session.rget(f"/incidents/{incident_id}")
    
    # Get recent alerts for context
    alerts = session.rget(
        f"/incidents/{incident_id}/alerts",
        params={"limit": 10}
    )
    
    # Get service info
    service_id = incident.get("service", {}).get("id")
    service_name = incident.get("service", {}).get("summary", "unknown")
    
    # Get recent incidents for same service (pattern detection)
    recent = session.rget(
        "/incidents",
        params={
            "service_ids[]": [service_id],
            "since": (datetime.utcnow() - timedelta(days=7)).isoformat(),
            "statuses[]": ["resolved"],
            "limit": 5
        }
    ) if service_id else []
    
    return IncidentContext(
        id=incident_id,
        title=incident.get("title", ""),
        service_name=service_name,
        urgency=incident.get("urgency", ""),
        status=incident.get("status", ""),
        created_at=incident.get("created_at", ""),
        details=str(incident.get("body", {}).get("details", "")),
        recent_alerts=[
            {
                "summary": a.get("summary", ""),
                "created_at": a.get("created_at", "")
            }
            for a in (alerts if isinstance(alerts, list) else [])
        ],
        recent_deploys=[]  # filled by deploy tracker if you have one
    )
 
def get_open_incidents() -> list[dict]:
    """Get all currently open incidents"""
    session = get_pd_client()
    incidents = session.rget(
        "/incidents",
        params={"statuses[]": ["triggered", "acknowledged"]}
    )
    return incidents if isinstance(incidents, list) else []

Step 2: Load Runbooks

python
# runbook_loader.py
from pathlib import Path
import os
 
def load_runbooks(service_name: str = None) -> str:
    """Load runbooks, prioritizing ones matching the service name"""
    runbook_dir = Path(os.getenv("RUNBOOK_DIR", "./runbooks"))
    
    if not runbook_dir.exists():
        return "No runbooks available."
    
    all_runbooks = []
    priority_runbooks = []
    
    for file in runbook_dir.glob("*.md"):
        content = file.read_text()
        
        # If service name matches, prioritize this runbook
        if service_name and service_name.lower() in file.stem.lower():
            priority_runbooks.append(f"## RUNBOOK: {file.stem}\n{content}")
        else:
            all_runbooks.append(f"## RUNBOOK: {file.stem}\n{content}")
    
    # Return priority runbooks first, then others (truncated to fit context)
    combined = priority_runbooks + all_runbooks
    full_text = "\n\n---\n\n".join(combined)
    
    # Truncate if too long (keep ~4000 chars for context)
    if len(full_text) > 4000:
        full_text = full_text[:4000] + "\n\n[...truncated for context...]"
    
    return full_text if full_text else "No runbooks available."

Step 3: Claude AI Analysis

python
# ai_assistant.py
import anthropic
import os
from pd_client import IncidentContext
from runbook_loader import load_runbooks
 
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
 
def generate_incident_brief(incident: IncidentContext) -> str:
    runbooks = load_runbooks(incident.service_name)
    
    recent_alerts_text = "\n".join([
        f"- {a['created_at']}: {a['summary']}"
        for a in incident.recent_alerts[:5]
    ]) or "No recent alerts"
    
    prompt = f"""You are an expert on-call assistant. An incident just fired. 
Give the on-call engineer a clear, actionable brief to start their investigation.
 
## INCIDENT DETAILS
- **ID**: {incident.id}
- **Title**: {incident.title}
- **Service**: {incident.service_name}
- **Urgency**: {incident.urgency}
- **Status**: {incident.status}
- **Started**: {incident.created_at}
- **Details**: {incident.details}
 
## RECENT ALERTS (last 10)
{recent_alerts_text}
 
## RUNBOOKS
{runbooks}
 
---
 
Please provide:
 
1. **30-Second Summary** — What broke, when, what's likely affected (2-3 sentences max)
 
2. **Likely Causes** — Top 3 probable root causes based on the alert details and runbooks
 
3. **Immediate Actions** — Exact commands/steps to run RIGHT NOW (numbered, specific, copy-pasteable)
 
4. **Escalation** — When to escalate and who to call (based on runbook or common sense)
 
5. **Key Questions** — 2-3 diagnostic questions to answer in the first 5 minutes
 
Format as markdown. Be concise — the engineer just woke up. No fluff."""
 
    message = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1500,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return message.content[0].text
 
def generate_postmortem_draft(incident: IncidentContext, resolution_notes: str) -> str:
    """Generate a postmortem draft after incident resolution"""
    
    prompt = f"""Generate a postmortem draft for this incident:
 
**Incident**: {incident.title}
**Service**: {incident.service_name}
**Duration**: From {incident.created_at} to now
**Resolution Notes**: {resolution_notes}
 
Follow the standard postmortem format:
1. Summary
2. Impact
3. Root Cause
4. Timeline
5. What Went Well
6. What Went Poorly  
7. Action Items (with owners and due dates)
 
Keep it factual, blameless, and actionable."""
 
    message = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1500,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return message.content[0].text

Step 4: CLI Interface

python
# main.py
import argparse
import sys
from pd_client import fetch_incident, get_open_incidents
from ai_assistant import generate_incident_brief, generate_postmortem_draft
 
def main():
    parser = argparse.ArgumentParser(description="AI On-Call Assistant")
    subparsers = parser.add_subparsers(dest="command")
    
    # Brief command
    brief_parser = subparsers.add_parser("brief", help="Get AI brief for an incident")
    brief_parser.add_argument("incident_id", help="PagerDuty incident ID")
    
    # List command
    subparsers.add_parser("list", help="List open incidents")
    
    # Postmortem command
    pm_parser = subparsers.add_parser("postmortem", help="Generate postmortem draft")
    pm_parser.add_argument("incident_id")
    pm_parser.add_argument("--notes", default="", help="Resolution notes")
    
    args = parser.parse_args()
    
    if args.command == "brief":
        print(f"🔍 Fetching incident {args.incident_id}...")
        incident = fetch_incident(args.incident_id)
        
        print(f"🤖 Generating AI brief for: {incident.title}\n")
        print("=" * 60)
        brief = generate_incident_brief(incident)
        print(brief)
        print("=" * 60)
        
    elif args.command == "list":
        incidents = get_open_incidents()
        if not incidents:
            print("✅ No open incidents!")
            return
        
        print(f"🚨 {len(incidents)} open incident(s):\n")
        for inc in incidents:
            print(f"  [{inc['urgency'].upper()}] {inc['id']}{inc['title']}")
            print(f"         Service: {inc.get('service', {}).get('summary', 'unknown')}")
            print(f"         Status: {inc['status']}")
            print()
        print(f"Run: python main.py brief <incident_id>")
        
    elif args.command == "postmortem":
        print(f"📝 Generating postmortem for {args.incident_id}...")
        incident = fetch_incident(args.incident_id)
        pm = generate_postmortem_draft(incident, args.notes)
        
        # Save to file
        filename = f"postmortem-{args.incident_id}.md"
        with open(filename, "w") as f:
            f.write(f"# Postmortem: {incident.title}\n\n")
            f.write(pm)
        
        print(f"✅ Postmortem saved to {filename}")
    
    else:
        parser.print_help()
 
if __name__ == "__main__":
    main()

Using It

bash
# See open incidents
python main.py list
 
# Get AI brief for an incident
python main.py brief P1234ABC
 
# Generate postmortem
python main.py postmortem P1234ABC --notes "Rolled back deploy v2.1.3, fixed memory leak"

Add a Runbook

markdown
# runbooks/payment-api.md
 
## payment-api Runbook
 
### Common Issues
 
**High Error Rate (5xx)**
- Check recent deploys: `kubectl rollout history deployment/payment-api -n prod`
- Check logs: `kubectl logs -n prod -l app=payment-api --since=10m | grep ERROR`
- Check Stripe API status: https://status.stripe.com
 
**Rollback**
```bash
kubectl rollout undo deployment/payment-api -n prod
kubectl rollout status deployment/payment-api -n prod

Escalation

  • Payment team lead: @payment-lead on Slack
  • Escalate if revenue impact > 5 minutes

---

## What's Next

- Add Slack notification with the brief automatically
- Connect to your metrics API to include real-time data
- Store incident + resolution pairs to improve recommendations over time
- Build a web UI with Streamlit for visual incident management

---

> **[Anthropic Claude API](https://anthropic.com)** — `claude-opus-4-7` handles complex technical analysis. Build your API key at console.anthropic.com.

> **[PagerDuty](https://pagerduty.com)** — the incident management platform this integrates with. They have a free developer account for testing.
🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments