Build a Slack ChatOps Bot for Kubernetes Alerts Using Claude API (2026)
Build a Slack bot that receives Kubernetes Alertmanager webhooks, calls Claude AI to explain the alert and suggest fixes, then posts actionable runbook steps in Slack.
On-call gets paged at 2am: "KubePodCrashLooping — my-app — production". Without context, you scramble. With a ChatOps bot + LLM, Slack immediately shows: what it means, likely causes, and fix commands.
Here's how to build it.
Architecture
Alertmanager → webhook → Node.js bot → Claude API → Slack
↓
Alert context
K8s cluster info
Runbook lookup
The bot:
- Receives webhook from Alertmanager when an alert fires
- Enriches it with pod/node details from Kubernetes API
- Asks Claude to explain the alert and generate fix steps
- Posts a formatted message to a Slack channel
Prerequisites
- Kubernetes cluster with Prometheus + Alertmanager
- Slack workspace (free tier works)
- Claude API key (get at console.anthropic.com)
- Node.js 20+
Step 1: Create Slack App
- Go to api.slack.com/apps → Create New App
- From scratch → name:
k8s-alertbot→ your workspace - OAuth & Permissions → Bot Token Scopes → add:
chat:write,channels:read - Install to Workspace → copy Bot User OAuth Token (
xoxb-...) - Invite the bot to your alerts channel:
/invite @k8s-alertbot
Step 2: Build the Bot
mkdir k8s-alertbot && cd k8s-alertbot
npm init -y
npm install @slack/web-api @anthropic-ai/sdk express @kubernetes/client-nodebot.js
const express = require('express')
const { WebClient } = require('@slack/web-api')
const Anthropic = require('@anthropic-ai/sdk')
const k8s = require('@kubernetes/client-node')
const app = express()
app.use(express.json())
const slack = new WebClient(process.env.SLACK_BOT_TOKEN)
const claude = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
// Kubernetes client (in-cluster config when running in K8s)
const kc = new k8s.KubeConfig()
kc.loadFromDefault() // uses in-cluster ServiceAccount if in K8s, else ~/.kube/config
const coreApi = kc.makeApiClient(k8s.CoreV1Api)
// Fetch pod details for context
async function getPodContext(namespace, podName) {
try {
const pod = await coreApi.readNamespacedPod(podName, namespace)
const p = pod.body
return {
name: p.metadata.name,
namespace: p.metadata.namespace,
phase: p.status.phase,
containers: p.status.containerStatuses?.map(c => ({
name: c.name,
ready: c.ready,
restartCount: c.restartCount,
state: JSON.stringify(c.state)
})) || [],
conditions: p.status.conditions?.map(c => `${c.type}=${c.status}`).join(', ') || 'none',
nodeName: p.spec.nodeName
}
} catch (e) {
return { error: `Could not fetch pod: ${e.message}` }
}
}
// Ask Claude to explain alert and suggest fixes
async function explainAlert(alertName, labels, annotations, podContext) {
const prompt = `
You are a Kubernetes SRE assistant. An alert just fired.
Alert: ${alertName}
Labels: ${JSON.stringify(labels, null, 2)}
Annotations: ${JSON.stringify(annotations, null, 2)}
${podContext.error ? `Pod fetch failed: ${podContext.error}` : `
Pod context:
- Phase: ${podContext.phase}
- Node: ${podContext.nodeName}
- Containers: ${JSON.stringify(podContext.containers, null, 2)}
- Conditions: ${podContext.conditions}
`}
Provide:
1. **What happened** — 1-2 sentence plain English explanation
2. **Most likely causes** — top 3 causes with probability
3. **Immediate fix commands** — exact kubectl commands to diagnose and fix
4. **Escalate if** — when to wake up a senior engineer
Keep it concise. Use code blocks for commands.
`
const response = await claude.messages.create({
model: 'claude-opus-4-6',
max_tokens: 800,
messages: [{ role: 'user', content: prompt }]
})
return response.content[0].text
}
// Alertmanager webhook handler
app.post('/webhook', async (req, res) => {
res.sendStatus(200) // respond immediately to Alertmanager
const { alerts } = req.body
for (const alert of alerts) {
if (alert.status !== 'firing') continue
const { alertname, namespace, pod } = alert.labels
const channel = process.env.SLACK_CHANNEL || '#alerts'
try {
// Get pod context if available
let podContext = {}
if (namespace && pod) {
podContext = await getPodContext(namespace, pod)
}
// Get Claude's analysis
const analysis = await explainAlert(
alertname,
alert.labels,
alert.annotations,
podContext
)
// Post to Slack with rich formatting
await slack.chat.postMessage({
channel,
text: `🚨 Alert: ${alertname}`,
blocks: [
{
type: 'header',
text: { type: 'plain_text', text: `🚨 ${alertname}` }
},
{
type: 'section',
fields: [
{ type: 'mrkdwn', text: `*Namespace:*\n${namespace || 'N/A'}` },
{ type: 'mrkdwn', text: `*Pod:*\n${pod || 'N/A'}` },
{ type: 'mrkdwn', text: `*Severity:*\n${alert.labels.severity || 'unknown'}` },
{ type: 'mrkdwn', text: `*Started:*\n${new Date(alert.startsAt).toLocaleString()}` }
]
},
{
type: 'divider'
},
{
type: 'section',
text: { type: 'mrkdwn', text: `*🤖 AI Analysis:*\n${analysis}` }
},
{
type: 'context',
elements: [
{ type: 'mrkdwn', text: `📊 <${alert.generatorURL}|View in Prometheus>` }
]
}
]
})
console.log(`Alert ${alertname} sent to ${channel}`)
} catch (err) {
console.error('Error processing alert:', err)
}
}
})
app.get('/health', (req, res) => res.json({ status: 'ok' }))
app.listen(3000, () => console.log('Alert bot listening on :3000'))Step 3: Deploy to Kubernetes
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: k8s-alertbot
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: k8s-alertbot
template:
metadata:
labels:
app: k8s-alertbot
spec:
serviceAccountName: alertbot-sa
containers:
- name: alertbot
image: your-registry/k8s-alertbot:latest
ports:
- containerPort: 3000
env:
- name: SLACK_BOT_TOKEN
valueFrom:
secretKeyRef:
name: alertbot-secrets
key: slack-token
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: alertbot-secrets
key: anthropic-key
- name: SLACK_CHANNEL
value: "#ops-alerts"
resources:
limits:
cpu: 200m
memory: 256Mi
---
apiVersion: v1
kind: Service
metadata:
name: k8s-alertbot
namespace: monitoring
spec:
selector:
app: k8s-alertbot
ports:
- port: 3000
targetPort: 3000
---
# RBAC: bot needs to read pods
apiVersion: v1
kind: ServiceAccount
metadata:
name: alertbot-sa
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: alertbot-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "nodes", "events"]
verbs: ["get", "list"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: alertbot-reader
subjects:
- kind: ServiceAccount
name: alertbot-sa
namespace: monitoring
roleRef:
kind: ClusterRole
name: alertbot-reader
apiGroup: rbac.authorization.k8s.io# Create secrets
kubectl create secret generic alertbot-secrets \
--from-literal=slack-token=xoxb-your-token \
--from-literal=anthropic-key=sk-ant-your-key \
-n monitoring
kubectl apply -f deployment.yamlStep 4: Configure Alertmanager
# alertmanager.yaml
route:
receiver: 'slack-bot'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'slack-bot'
webhook_configs:
- url: 'http://k8s-alertbot.monitoring:3000/webhook'
send_resolved: falseStep 5: Build & Push Docker Image
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY bot.js .
EXPOSE 3000
CMD ["node", "bot.js"]docker build -t your-registry/k8s-alertbot:latest .
docker push your-registry/k8s-alertbot:latestExample Slack Output
When KubePodCrashLooping fires:
🚨 KubePodCrashLooping
Namespace: production Pod: my-app-7d9f8c-xkp2q
Severity: critical Started: Apr 12, 2026, 2:13 AM
🤖 AI Analysis:
**What happened**
The pod my-app-7d9f8c-xkp2q has restarted 8 times in the last 10 minutes,
indicating the container is crashing immediately after starting.
**Most likely causes**
1. (70%) Application startup error — OOMKilled or config missing
2. (20%) Liveness probe too aggressive, killing healthy pods
3. (10%) Broken init container blocking main container
**Immediate fix commands**
kubectl logs my-app-7d9f8c-xkp2q -n production --previous
kubectl describe pod my-app-7d9f8c-xkp2q -n production
kubectl get events -n production --sort-by='.lastTimestamp'
**Escalate if**
Restart count > 15 or logs show panic/OOMKilled. Wake the on-call lead.
📊 View in Prometheus
Cost Estimate
- Claude API: ~$0.002 per alert (800 tokens)
- 100 alerts/day = $0.20/day = ~$6/month
- Anthropic API free tier covers initial testing
Resources
- Claude API Docs — Get your API key and explore models
- KodeKloud Prometheus/Alertmanager Labs — Set up the monitoring stack this bot connects to
This bot pays for itself the first time it prevents a 2am escalation by giving on-call engineers instant context.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Log Analysis Is Replacing Manual Debugging in DevOps (2026)
How LLMs and AI are transforming log analysis, anomaly detection, and root cause analysis — and the tools DevOps engineers should know about in 2026.