Build a Slack DevOps Bot with Claude API: Alerts, Runbooks, and Incident Help

Build a Slack bot that uses Claude AI to explain alerts, fetch runbooks, suggest fixes, and help during incidents. Full Python project with Slack Bolt and Anthropic API.

Imagine getting a PagerDuty alert at 2am, typing /ask why is memory high on prod-api in Slack, and getting a specific, actionable response based on your runbooks and recent alerts. That's what we're building.

A Slack bot powered by Claude API that:

Answers DevOps questions in plain English
Explains Kubernetes errors
Suggests fixes based on your runbooks
Summarizes recent alerts

What We're Building

Engineer: @devbot what does OOMKilled mean and how do I fix it?

DevBot: OOMKilled means your container was killed by the kernel because 
it exceeded its memory limit. Here's what to check:

1. Check current memory usage: kubectl top pod <pod-name> -n <namespace>
2. Check current limit: kubectl describe pod <pod-name> | grep -A5 Limits
3. Short-term fix: increase memory limit in your deployment
4. Long-term: find the memory leak or optimize your app

Based on your team's runbook, the most common cause in prod-api is 
the /export endpoint loading large datasets into memory...

Stack

Python 3.11+
Slack Bolt (official Slack SDK)
Anthropic Claude API
Optional: your runbooks as text files

Setup

bash

mkdir devops-slack-bot && cd devops-slack-bot
pip install slack-bolt anthropic python-dotenv aiohttp

Create .env:

SLACK_BOT_TOKEN=xoxb-...
SLACK_SIGNING_SECRET=...
ANTHROPIC_API_KEY=sk-ant-...

Create a Slack App:

Go to https://api.slack.com/apps → New App
Features → OAuth & Permissions → Add Bot Token Scopes:
- app_mentions:read
- chat:write
- commands
- im:read, im:write
Enable Socket Mode (for development) or Events API
Install to workspace → copy Bot Token
Basic Information → copy Signing Secret

Step 1: Core Bot Setup

python

# bot.py
import os
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler
from dotenv import load_dotenv
from ai_handler import DevOpsAI
 
load_dotenv()
 
app = App(token=os.environ["SLACK_BOT_TOKEN"])
ai = DevOpsAI()
 
# Handle @mentions
@app.event("app_mention")
def handle_mention(event, say, client):
    user = event["user"]
    text = event["text"]
    channel = event["channel"]
    thread_ts = event.get("thread_ts", event["ts"])
    
    # Remove the bot mention from the text
    question = text.split(">", 1)[-1].strip()
    
    if not question:
        say(
            text="Hey! Ask me anything about DevOps, Kubernetes, alerts, or paste an error message.",
            thread_ts=thread_ts
        )
        return
    
    # Show typing indicator
    client.reactions_add(
        channel=channel,
        name="thinking_face",
        timestamp=event["ts"]
    )
    
    try:
        response = ai.answer(question)
        say(text=response, thread_ts=thread_ts)
    except Exception as e:
        say(text=f"❌ Error: {str(e)}", thread_ts=thread_ts)
    finally:
        client.reactions_remove(
            channel=channel,
            name="thinking_face",
            timestamp=event["ts"]
        )
 
# Handle direct messages
@app.event("message")
def handle_dm(event, say):
    # Only respond to DMs (not channel messages)
    if event.get("channel_type") != "im":
        return
    
    text = event.get("text", "").strip()
    if not text:
        return
    
    response = ai.answer(text)
    say(text=response)
 
# Slash command /ask
@app.command("/ask")
def handle_ask_command(ack, respond, command):
    ack()
    question = command["text"]
    
    if not question:
        respond("Usage: `/ask <your devops question>`")
        return
    
    respond("🤔 Thinking...")
    response = ai.answer(question)
    respond(response)
 
# Slash command /explain-error
@app.command("/explain-error")
def handle_explain_error(ack, respond, command):
    ack()
    error = command["text"]
    
    if not error:
        respond("Paste the error message after the command: `/explain-error <error>`")
        return
    
    respond("🔍 Analyzing error...")
    response = ai.explain_error(error)
    respond(response)
 
if __name__ == "__main__":
    handler = SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"])
    handler.start()

Step 2: Claude AI Handler

python

# ai_handler.py
import anthropic
import os
from pathlib import Path
 
class DevOpsAI:
    def __init__(self):
        self.client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
        self.runbooks = self._load_runbooks()
        
    def _load_runbooks(self) -> str:
        """Load runbooks from the runbooks/ directory"""
        runbooks_dir = Path("runbooks")
        if not runbooks_dir.exists():
            return ""
        
        content = []
        for file in runbooks_dir.glob("*.md"):
            content.append(f"## Runbook: {file.stem}\n{file.read_text()}")
        
        return "\n\n".join(content)
    
    def _system_prompt(self) -> str:
        base = """You are a DevOps expert assistant in Slack. You help engineers:
- Understand Kubernetes errors and alerts
- Debug CI/CD pipeline failures  
- Explain cloud infrastructure concepts
- Suggest fixes for common DevOps problems
- Interpret monitoring alerts
 
Rules:
- Be concise — Slack messages should be scannable, not essays
- Use bullet points and code blocks for commands
- Give the most likely cause first, then alternatives
- Include the exact commands to diagnose/fix when possible
- Format code with backticks or triple backticks
- If you're not sure, say so and suggest who to escalate to"""
        
        if self.runbooks:
            base += f"\n\nTeam Runbooks (use these for context):\n{self.runbooks}"
        
        return base
    
    def answer(self, question: str) -> str:
        """Answer a general DevOps question"""
        message = self.client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1024,
            system=self._system_prompt(),
            messages=[
                {"role": "user", "content": question}
            ]
        )
        return message.content[0].text
    
    def explain_error(self, error_text: str) -> str:
        """Explain an error message and suggest fixes"""
        prompt = (
            "Explain this error and how to fix it:\n\n"
            f"{error_text}\n\n"
            "Format your response as:\n"
            "**What happened:** (1-2 sentences)\n"
            "**Most likely cause:**\n"
            "**How to fix it:**\n"
            "1. ...\n"
            "2. ...\n"
            "**Commands to run:** (include bash commands)"
        )
 
        message = self.client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1024,
            system=self._system_prompt(),
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return message.content[0].text
    
    def summarize_alerts(self, alerts: list[dict]) -> str:
        """Summarize a list of alerts"""
        alerts_text = "\n".join([
            f"- [{a.get('severity', 'unknown')}] {a.get('name', '')}: {a.get('description', '')}"
            for a in alerts
        ])
        
        prompt = f"""Summarize these monitoring alerts and suggest priority order for investigation:
 
{alerts_text}
 
Group by: critical (fix now), warning (fix today), info (investigate when time allows)"""
 
        message = self.client.messages.create(
            model="claude-opus-4-7",
            max_tokens=512,
            system=self._system_prompt(),
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return message.content[0].text

Step 3: Add Runbooks

Create a runbooks/ directory with your team's runbooks as markdown files:

markdown

# runbooks/high-memory.md
 
## High Memory Usage Alert
 
### Symptoms
- Alert: `MemoryUsage > 80%`
- Pod may show `OOMKilled` in describe output
 
### Investigation Steps
1. `kubectl top pods -n production | sort -k3 -hr`
2. `kubectl describe pod <pod-name> -n production | grep -A10 "Last State"`
3. Check recent deploys: `kubectl rollout history deployment/<name>`
 
### Common Causes
- Memory leak in the /export endpoint (large CSV generation)
- Uncached database queries returning large result sets
- Redis connection pool not being released
 
### Fix
- Short term: `kubectl rollout restart deployment/<name>`
- Long term: add Redis caching for expensive queries
 
### Escalation
If memory stays above 90% after restart, page the backend team.

The bot will automatically include these in its context.

Step 4: PagerDuty / Alertmanager Integration

python

# webhook_handler.py
from fastapi import FastAPI, Request
from ai_handler import DevOpsAI
from slack_sdk import WebClient
import os
 
fastapi_app = FastAPI()
slack_client = WebClient(token=os.getenv("SLACK_BOT_TOKEN"))
ai = DevOpsAI()
 
ALERT_CHANNEL = "#alerts"
 
@fastapi_app.post("/webhook/alertmanager")
async def handle_alertmanager(request: Request):
    """Receive Alertmanager webhook and post to Slack with AI analysis"""
    payload = await request.json()
    alerts = payload.get("alerts", [])
    
    if not alerts:
        return {"status": "no alerts"}
    
    # Get AI summary
    summary = ai.summarize_alerts([
        {
            "name": a["labels"].get("alertname"),
            "severity": a["labels"].get("severity"),
            "description": a["annotations"].get("description", "")
        }
        for a in alerts
    ])
    
    # Post to Slack
    slack_client.chat_postMessage(
        channel=ALERT_CHANNEL,
        blocks=[
            {
                "type": "header",
                "text": {"type": "plain_text", "text": f"🚨 {len(alerts)} Alert(s) Firing"}
            },
            {
                "type": "section",
                "text": {"type": "mrkdwn", "text": summary}
            },
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": "Ask me for help: `/ask how to investigate high memory` or DM me directly"
                }
            }
        ]
    )
    
    return {"status": "posted"}

Run It

bash

# Development (Socket Mode)
python bot.py
 
# With webhook server
uvicorn webhook_handler:fastapi_app --port 8080 &
python bot.py

Deploy to Kubernetes

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: devops-bot
  namespace: monitoring
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: bot
        image: your-registry/devops-bot:latest
        env:
        - name: SLACK_BOT_TOKEN
          valueFrom:
            secretKeyRef:
              name: slack-secrets
              key: bot-token
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: anthropic-secret
              key: api-key

Example Interactions

# Kubernetes error
/explain-error Error: ImagePullBackOff

# Alert investigation  
@devbot our CPU alert is firing on prod-api what should I check?

# General questions
@devbot what's the difference between liveness and readiness probes?

# Runbook lookup
@devbot the high memory runbook says to restart — what command?

What's Next

Add conversation history (multi-turn context per thread)
Connect to your metrics API for real-time data
Build a /status command showing cluster health
Add approval workflows for risky operations (e.g., /restart pod prod-api)

Anthropic Claude API — claude-opus-4-7 handles complex technical questions with nuance. Perfect for DevOps assistants.

KodeKloud — if this project got you interested in Kubernetes + monitoring deeper, their SRE and platform engineering tracks are the best hands-on path.

Build a Slack DevOps Bot with Claude API: Alerts, Runbooks, and Incident Help

What We're Building

Stack

Setup

Step 1: Core Bot Setup

Step 2: Claude AI Handler

Step 3: Add Runbooks

Step 4: PagerDuty / Alertmanager Integration

Run It

Deploy to Kubernetes

Example Interactions

What's Next

Stay ahead of the curve

Related Articles

Build an AI DevOps Daily Digest with Claude API

Build an AI On-Call Assistant with PagerDuty and Claude API

Build an AI Release Notes Generator with Claude API + GitPython

Comments