Build a Slack Bot That Monitors Kubernetes Errors Using Claude API
Build a Slack bot that watches your Kubernetes cluster for errors and uses Claude AI to explain what went wrong and suggest fixes — in plain English.
Your Kubernetes cluster throws errors constantly. Most of them are noise. Some are critical. And when something breaks at 2 AM, you want an explanation in plain English — not a wall of YAML.
This bot watches your cluster, detects errors, and uses Claude API to explain them and suggest fixes — directly in Slack.
What We're Building
Kubernetes Events (errors)
→ Python watcher
→ Claude API (explain + suggest fix)
→ Slack message
When a pod crashes, goes OOMKilled, fails a liveness probe, or an Ingress fails — your Slack gets a message like:
Pod
api-server-abcin namespaceproductionis CrashLoopBackOffWhat happened: The container is repeatedly crashing and Kubernetes is backing off restarts. This is usually caused by an application error, missing environment variables, or resource limits being too low.
Suggested fix: Run
kubectl logs api-server-abc --previousto see the last crash logs. Check if memory limit is set too low withkubectl describe pod api-server-abc.
Prerequisites
- Python 3.10+
- Kubernetes cluster with
kubectlconfigured - Anthropic API key — get one here
- Slack workspace with permissions to create apps
Step 1: Create the Slack App
- Go to api.slack.com/apps → Create New App
- From scratch → name it "K8s Monitor"
- OAuth & Permissions → add
chat:writescope - Install to workspace → copy Bot User OAuth Token
- Create a channel
#k8s-alerts→ invite the bot
Step 2: Project Setup
mkdir k8s-slack-bot && cd k8s-slack-bot
python -m venv venv && source venv/bin/activate
pip install kubernetes anthropic slack-sdk python-dotenv# .env
ANTHROPIC_API_KEY=sk-ant-...
SLACK_BOT_TOKEN=xoxb-...
SLACK_CHANNEL=#k8s-alerts
KUBERNETES_NAMESPACE=default # or "all" for all namespacesStep 3: The Kubernetes Watcher
# watcher.py
import asyncio
from kubernetes import client, config, watch
from kubernetes.client.exceptions import ApiException
def get_k8s_client():
try:
config.load_incluster_config() # inside cluster
except:
config.load_kube_config() # local kubectl
return client.CoreV1Api()
def watch_pod_events(namespace="default"):
v1 = get_k8s_client()
w = watch.Watch()
error_reasons = {
"CrashLoopBackOff",
"OOMKilled",
"BackOff",
"Failed",
"FailedScheduling",
"FailedMount",
"Unhealthy",
"ImagePullBackOff",
"ErrImagePull",
}
print(f"Watching namespace: {namespace}")
try:
if namespace == "all":
stream = w.stream(v1.list_event_for_all_namespaces, timeout_seconds=0)
else:
stream = w.stream(v1.list_namespaced_event, namespace=namespace, timeout_seconds=0)
for event in stream:
obj = event['object']
reason = obj.reason or ""
if reason in error_reasons:
yield {
"reason": reason,
"message": obj.message,
"namespace": obj.metadata.namespace,
"name": obj.involved_object.name,
"kind": obj.involved_object.kind,
"count": obj.count,
"first_time": str(obj.first_timestamp),
"last_time": str(obj.last_timestamp),
}
except ApiException as e:
print(f"Kubernetes API error: {e}")Step 4: Claude API for Explanations
# analyzer.py
import anthropic
import os
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def analyze_k8s_error(error_event: dict) -> dict:
prompt = f"""You are a Kubernetes expert helping a DevOps engineer debug a cluster issue.
A Kubernetes error event occurred:
- Kind: {error_event['kind']}
- Name: {error_event['name']}
- Namespace: {error_event['namespace']}
- Reason: {error_event['reason']}
- Message: {error_event['message']}
- Occurred: {error_event['count']} times
Provide:
1. A simple 2-sentence explanation of what happened (no jargon)
2. The 3 most likely root causes
3. Specific kubectl commands to investigate and fix
Keep it concise and actionable. Format as JSON with keys: explanation, causes (list), fix_commands (list of strings)."""
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=600,
messages=[{"role": "user", "content": prompt}]
)
import json
try:
return json.loads(message.content[0].text)
except:
return {
"explanation": message.content[0].text,
"causes": [],
"fix_commands": []
}Step 5: Slack Notifier
# notifier.py
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
import os
slack_client = WebClient(token=os.getenv("SLACK_BOT_TOKEN"))
channel = os.getenv("SLACK_CHANNEL", "#k8s-alerts")
def send_k8s_alert(error_event: dict, analysis: dict):
causes_text = "\n".join([f"• {c}" for c in analysis.get("causes", [])])
commands_text = "\n".join([f"`{cmd}`" for cmd in analysis.get("fix_commands", [])])
blocks = [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"🚨 {error_event['reason']} — {error_event['kind']}/{error_event['name']}"
}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": f"*Namespace:*\n{error_event['namespace']}"},
{"type": "mrkdwn", "text": f"*Count:*\n{error_event['count']} times"},
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*What happened:*\n{analysis.get('explanation', 'No explanation available')}"
}
},
]
if causes_text:
blocks.append({
"type": "section",
"text": {"type": "mrkdwn", "text": f"*Likely causes:*\n{causes_text}"}
})
if commands_text:
blocks.append({
"type": "section",
"text": {"type": "mrkdwn", "text": f"*Investigate with:*\n{commands_text}"}
})
try:
slack_client.chat_postMessage(channel=channel, blocks=blocks)
print(f"Alert sent for {error_event['name']}")
except SlackApiError as e:
print(f"Slack error: {e.response['error']}")Step 6: Main Loop
# main.py
import os
import time
from dotenv import load_dotenv
from watcher import watch_pod_events
from analyzer import analyze_k8s_error
from notifier import send_k8s_alert
load_dotenv()
# Deduplicate — don't spam the same error
seen_errors = {}
def should_alert(error_event: dict) -> bool:
key = f"{error_event['namespace']}/{error_event['name']}/{error_event['reason']}"
last_count = seen_errors.get(key, 0)
# Alert on first occurrence, then only every 10 more
if error_event['count'] == 1 or (error_event['count'] - last_count) >= 10:
seen_errors[key] = error_event['count']
return True
return False
def main():
namespace = os.getenv("KUBERNETES_NAMESPACE", "default")
print(f"K8s Slack Bot started — watching {namespace}")
for error_event in watch_pod_events(namespace):
if not should_alert(error_event):
continue
print(f"Error detected: {error_event['reason']} on {error_event['name']}")
# Analyze with Claude
analysis = analyze_k8s_error(error_event)
# Send to Slack
send_k8s_alert(error_event, analysis)
# Small delay to avoid rate limits
time.sleep(1)
if __name__ == "__main__":
main()Deploy to Kubernetes
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: k8s-slack-bot
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: k8s-slack-bot
template:
metadata:
labels:
app: k8s-slack-bot
spec:
serviceAccountName: k8s-slack-bot-sa
containers:
- name: bot
image: python:3.11-slim
command: ["python", "/app/main.py"]
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: bot-secrets
key: anthropic-key
- name: SLACK_BOT_TOKEN
valueFrom:
secretKeyRef:
name: bot-secrets
key: slack-token
- name: KUBERNETES_NAMESPACE
value: "all"# rbac.yaml — bot needs read access to events
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: k8s-slack-bot
rules:
- apiGroups: [""]
resources: ["events", "pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: k8s-slack-bot
subjects:
- kind: ServiceAccount
name: k8s-slack-bot-sa
namespace: monitoring
roleRef:
kind: ClusterRole
name: k8s-slack-bot
apiGroup: rbac.authorization.k8s.ioThis bot turns cryptic Kubernetes events into actionable Slack messages. The deduplication logic prevents alert fatigue — you get notified on first occurrence and every 10 subsequent ones.
Get your Anthropic API key to start building. For Kubernetes hands-on practice, KodeKloud has real cluster labs.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI-Powered DevOps Chatbot with Streamlit on Kubernetes
Build a DevOps assistant chatbot that answers infrastructure questions, generates kubectl commands, and explains errors — deployed as a Streamlit app on Kubernetes.
Build LLM-Powered Runbook Automation with Haystack and Kubernetes
Turn your static runbooks into an AI system that answers 'what do I do when X happens' with step-by-step instructions retrieved from your actual documentation.
Build a Natural Language kubectl — Ask Questions to Your Cluster
Build a CLI tool that lets you describe what you want in plain English and generates the correct kubectl command — powered by Claude API.