Build a Slack DevOps Bot with Claude API: Alerts, Runbooks, and Incident Help
Build a Slack bot that uses Claude AI to explain alerts, fetch runbooks, suggest fixes, and help during incidents. Full Python project with Slack Bolt and Anthropic API.
Imagine getting a PagerDuty alert at 2am, typing /ask why is memory high on prod-api in Slack, and getting a specific, actionable response based on your runbooks and recent alerts. That's what we're building.
A Slack bot powered by Claude API that:
- Answers DevOps questions in plain English
- Explains Kubernetes errors
- Suggests fixes based on your runbooks
- Summarizes recent alerts
What We're Building
Engineer: @devbot what does OOMKilled mean and how do I fix it?
DevBot: OOMKilled means your container was killed by the kernel because
it exceeded its memory limit. Here's what to check:
1. Check current memory usage: kubectl top pod <pod-name> -n <namespace>
2. Check current limit: kubectl describe pod <pod-name> | grep -A5 Limits
3. Short-term fix: increase memory limit in your deployment
4. Long-term: find the memory leak or optimize your app
Based on your team's runbook, the most common cause in prod-api is
the /export endpoint loading large datasets into memory...
Stack
- Python 3.11+
- Slack Bolt (official Slack SDK)
- Anthropic Claude API
- Optional: your runbooks as text files
Setup
mkdir devops-slack-bot && cd devops-slack-bot
pip install slack-bolt anthropic python-dotenv aiohttpCreate .env:
SLACK_BOT_TOKEN=xoxb-...
SLACK_SIGNING_SECRET=...
ANTHROPIC_API_KEY=sk-ant-...
Create a Slack App:
- Go to https://api.slack.com/apps → New App
- Features → OAuth & Permissions → Add Bot Token Scopes:
app_mentions:readchat:writecommandsim:read,im:write
- Enable Socket Mode (for development) or Events API
- Install to workspace → copy Bot Token
- Basic Information → copy Signing Secret
Step 1: Core Bot Setup
# bot.py
import os
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler
from dotenv import load_dotenv
from ai_handler import DevOpsAI
load_dotenv()
app = App(token=os.environ["SLACK_BOT_TOKEN"])
ai = DevOpsAI()
# Handle @mentions
@app.event("app_mention")
def handle_mention(event, say, client):
user = event["user"]
text = event["text"]
channel = event["channel"]
thread_ts = event.get("thread_ts", event["ts"])
# Remove the bot mention from the text
question = text.split(">", 1)[-1].strip()
if not question:
say(
text="Hey! Ask me anything about DevOps, Kubernetes, alerts, or paste an error message.",
thread_ts=thread_ts
)
return
# Show typing indicator
client.reactions_add(
channel=channel,
name="thinking_face",
timestamp=event["ts"]
)
try:
response = ai.answer(question)
say(text=response, thread_ts=thread_ts)
except Exception as e:
say(text=f"❌ Error: {str(e)}", thread_ts=thread_ts)
finally:
client.reactions_remove(
channel=channel,
name="thinking_face",
timestamp=event["ts"]
)
# Handle direct messages
@app.event("message")
def handle_dm(event, say):
# Only respond to DMs (not channel messages)
if event.get("channel_type") != "im":
return
text = event.get("text", "").strip()
if not text:
return
response = ai.answer(text)
say(text=response)
# Slash command /ask
@app.command("/ask")
def handle_ask_command(ack, respond, command):
ack()
question = command["text"]
if not question:
respond("Usage: `/ask <your devops question>`")
return
respond("🤔 Thinking...")
response = ai.answer(question)
respond(response)
# Slash command /explain-error
@app.command("/explain-error")
def handle_explain_error(ack, respond, command):
ack()
error = command["text"]
if not error:
respond("Paste the error message after the command: `/explain-error <error>`")
return
respond("🔍 Analyzing error...")
response = ai.explain_error(error)
respond(response)
if __name__ == "__main__":
handler = SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"])
handler.start()Step 2: Claude AI Handler
# ai_handler.py
import anthropic
import os
from pathlib import Path
class DevOpsAI:
def __init__(self):
self.client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.runbooks = self._load_runbooks()
def _load_runbooks(self) -> str:
"""Load runbooks from the runbooks/ directory"""
runbooks_dir = Path("runbooks")
if not runbooks_dir.exists():
return ""
content = []
for file in runbooks_dir.glob("*.md"):
content.append(f"## Runbook: {file.stem}\n{file.read_text()}")
return "\n\n".join(content)
def _system_prompt(self) -> str:
base = """You are a DevOps expert assistant in Slack. You help engineers:
- Understand Kubernetes errors and alerts
- Debug CI/CD pipeline failures
- Explain cloud infrastructure concepts
- Suggest fixes for common DevOps problems
- Interpret monitoring alerts
Rules:
- Be concise — Slack messages should be scannable, not essays
- Use bullet points and code blocks for commands
- Give the most likely cause first, then alternatives
- Include the exact commands to diagnose/fix when possible
- Format code with backticks or triple backticks
- If you're not sure, say so and suggest who to escalate to"""
if self.runbooks:
base += f"\n\nTeam Runbooks (use these for context):\n{self.runbooks}"
return base
def answer(self, question: str) -> str:
"""Answer a general DevOps question"""
message = self.client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=self._system_prompt(),
messages=[
{"role": "user", "content": question}
]
)
return message.content[0].text
def explain_error(self, error_text: str) -> str:
"""Explain an error message and suggest fixes"""
prompt = (
"Explain this error and how to fix it:\n\n"
f"{error_text}\n\n"
"Format your response as:\n"
"**What happened:** (1-2 sentences)\n"
"**Most likely cause:**\n"
"**How to fix it:**\n"
"1. ...\n"
"2. ...\n"
"**Commands to run:** (include bash commands)"
)
message = self.client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=self._system_prompt(),
messages=[
{"role": "user", "content": prompt}
]
)
return message.content[0].text
def summarize_alerts(self, alerts: list[dict]) -> str:
"""Summarize a list of alerts"""
alerts_text = "\n".join([
f"- [{a.get('severity', 'unknown')}] {a.get('name', '')}: {a.get('description', '')}"
for a in alerts
])
prompt = f"""Summarize these monitoring alerts and suggest priority order for investigation:
{alerts_text}
Group by: critical (fix now), warning (fix today), info (investigate when time allows)"""
message = self.client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
system=self._system_prompt(),
messages=[
{"role": "user", "content": prompt}
]
)
return message.content[0].textStep 3: Add Runbooks
Create a runbooks/ directory with your team's runbooks as markdown files:
# runbooks/high-memory.md
## High Memory Usage Alert
### Symptoms
- Alert: `MemoryUsage > 80%`
- Pod may show `OOMKilled` in describe output
### Investigation Steps
1. `kubectl top pods -n production | sort -k3 -hr`
2. `kubectl describe pod <pod-name> -n production | grep -A10 "Last State"`
3. Check recent deploys: `kubectl rollout history deployment/<name>`
### Common Causes
- Memory leak in the /export endpoint (large CSV generation)
- Uncached database queries returning large result sets
- Redis connection pool not being released
### Fix
- Short term: `kubectl rollout restart deployment/<name>`
- Long term: add Redis caching for expensive queries
### Escalation
If memory stays above 90% after restart, page the backend team.The bot will automatically include these in its context.
Step 4: PagerDuty / Alertmanager Integration
# webhook_handler.py
from fastapi import FastAPI, Request
from ai_handler import DevOpsAI
from slack_sdk import WebClient
import os
fastapi_app = FastAPI()
slack_client = WebClient(token=os.getenv("SLACK_BOT_TOKEN"))
ai = DevOpsAI()
ALERT_CHANNEL = "#alerts"
@fastapi_app.post("/webhook/alertmanager")
async def handle_alertmanager(request: Request):
"""Receive Alertmanager webhook and post to Slack with AI analysis"""
payload = await request.json()
alerts = payload.get("alerts", [])
if not alerts:
return {"status": "no alerts"}
# Get AI summary
summary = ai.summarize_alerts([
{
"name": a["labels"].get("alertname"),
"severity": a["labels"].get("severity"),
"description": a["annotations"].get("description", "")
}
for a in alerts
])
# Post to Slack
slack_client.chat_postMessage(
channel=ALERT_CHANNEL,
blocks=[
{
"type": "header",
"text": {"type": "plain_text", "text": f"🚨 {len(alerts)} Alert(s) Firing"}
},
{
"type": "section",
"text": {"type": "mrkdwn", "text": summary}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "Ask me for help: `/ask how to investigate high memory` or DM me directly"
}
}
]
)
return {"status": "posted"}Run It
# Development (Socket Mode)
python bot.py
# With webhook server
uvicorn webhook_handler:fastapi_app --port 8080 &
python bot.pyDeploy to Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: devops-bot
namespace: monitoring
spec:
replicas: 1
template:
spec:
containers:
- name: bot
image: your-registry/devops-bot:latest
env:
- name: SLACK_BOT_TOKEN
valueFrom:
secretKeyRef:
name: slack-secrets
key: bot-token
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: anthropic-secret
key: api-keyExample Interactions
# Kubernetes error
/explain-error Error: ImagePullBackOff
# Alert investigation
@devbot our CPU alert is firing on prod-api what should I check?
# General questions
@devbot what's the difference between liveness and readiness probes?
# Runbook lookup
@devbot the high memory runbook says to restart — what command?
What's Next
- Add conversation history (multi-turn context per thread)
- Connect to your metrics API for real-time data
- Build a
/statuscommand showing cluster health - Add approval workflows for risky operations (e.g.,
/restart pod prod-api)
Anthropic Claude API —
claude-opus-4-7handles complex technical questions with nuance. Perfect for DevOps assistants.
KodeKloud — if this project got you interested in Kubernetes + monitoring deeper, their SRE and platform engineering tracks are the best hands-on path.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI On-Call Assistant with PagerDuty and Claude API
Build an AI assistant that reads PagerDuty alerts, fetches related runbooks, and generates a first-response action plan — so your on-call engineer doesn't start from zero at 3am.
Build an AI-Powered SLO Budget Tracker with Python + Claude (2026)
Track your error budget automatically and get AI-generated burn rate alerts and incident summaries. Build a real SLO monitoring tool with Python, Prometheus, and Claude API.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.