🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build a Slack Bot That Monitors Kubernetes Errors Using Claude API

Build a Slack bot that watches your Kubernetes cluster for errors and uses Claude AI to explain what went wrong and suggest fixes — in plain English.

DevOpsBoysMay 28, 20265 min read
Share:Tweet

Your Kubernetes cluster throws errors constantly. Most of them are noise. Some are critical. And when something breaks at 2 AM, you want an explanation in plain English — not a wall of YAML.

This bot watches your cluster, detects errors, and uses Claude API to explain them and suggest fixes — directly in Slack.


What We're Building

Kubernetes Events (errors) 
    → Python watcher
    → Claude API (explain + suggest fix)
    → Slack message

When a pod crashes, goes OOMKilled, fails a liveness probe, or an Ingress fails — your Slack gets a message like:

Pod api-server-abc in namespace production is CrashLoopBackOff

What happened: The container is repeatedly crashing and Kubernetes is backing off restarts. This is usually caused by an application error, missing environment variables, or resource limits being too low.

Suggested fix: Run kubectl logs api-server-abc --previous to see the last crash logs. Check if memory limit is set too low with kubectl describe pod api-server-abc.


Prerequisites

  • Python 3.10+
  • Kubernetes cluster with kubectl configured
  • Anthropic API key — get one here
  • Slack workspace with permissions to create apps

Step 1: Create the Slack App

  1. Go to api.slack.com/apps → Create New App
  2. From scratch → name it "K8s Monitor"
  3. OAuth & Permissions → add chat:write scope
  4. Install to workspace → copy Bot User OAuth Token
  5. Create a channel #k8s-alerts → invite the bot

Step 2: Project Setup

bash
mkdir k8s-slack-bot && cd k8s-slack-bot
python -m venv venv && source venv/bin/activate
 
pip install kubernetes anthropic slack-sdk python-dotenv
bash
# .env
ANTHROPIC_API_KEY=sk-ant-...
SLACK_BOT_TOKEN=xoxb-...
SLACK_CHANNEL=#k8s-alerts
KUBERNETES_NAMESPACE=default  # or "all" for all namespaces

Step 3: The Kubernetes Watcher

python
# watcher.py
import asyncio
from kubernetes import client, config, watch
from kubernetes.client.exceptions import ApiException
 
def get_k8s_client():
    try:
        config.load_incluster_config()  # inside cluster
    except:
        config.load_kube_config()       # local kubectl
    return client.CoreV1Api()
 
def watch_pod_events(namespace="default"):
    v1 = get_k8s_client()
    w = watch.Watch()
    
    error_reasons = {
        "CrashLoopBackOff",
        "OOMKilled", 
        "BackOff",
        "Failed",
        "FailedScheduling",
        "FailedMount",
        "Unhealthy",
        "ImagePullBackOff",
        "ErrImagePull",
    }
    
    print(f"Watching namespace: {namespace}")
    
    try:
        if namespace == "all":
            stream = w.stream(v1.list_event_for_all_namespaces, timeout_seconds=0)
        else:
            stream = w.stream(v1.list_namespaced_event, namespace=namespace, timeout_seconds=0)
        
        for event in stream:
            obj = event['object']
            reason = obj.reason or ""
            
            if reason in error_reasons:
                yield {
                    "reason": reason,
                    "message": obj.message,
                    "namespace": obj.metadata.namespace,
                    "name": obj.involved_object.name,
                    "kind": obj.involved_object.kind,
                    "count": obj.count,
                    "first_time": str(obj.first_timestamp),
                    "last_time": str(obj.last_timestamp),
                }
    except ApiException as e:
        print(f"Kubernetes API error: {e}")

Step 4: Claude API for Explanations

python
# analyzer.py
import anthropic
import os
 
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
 
def analyze_k8s_error(error_event: dict) -> dict:
    prompt = f"""You are a Kubernetes expert helping a DevOps engineer debug a cluster issue.
 
A Kubernetes error event occurred:
- Kind: {error_event['kind']}
- Name: {error_event['name']}
- Namespace: {error_event['namespace']}
- Reason: {error_event['reason']}
- Message: {error_event['message']}
- Occurred: {error_event['count']} times
 
Provide:
1. A simple 2-sentence explanation of what happened (no jargon)
2. The 3 most likely root causes
3. Specific kubectl commands to investigate and fix
 
Keep it concise and actionable. Format as JSON with keys: explanation, causes (list), fix_commands (list of strings)."""
 
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=600,
        messages=[{"role": "user", "content": prompt}]
    )
    
    import json
    try:
        return json.loads(message.content[0].text)
    except:
        return {
            "explanation": message.content[0].text,
            "causes": [],
            "fix_commands": []
        }

Step 5: Slack Notifier

python
# notifier.py
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
import os
 
slack_client = WebClient(token=os.getenv("SLACK_BOT_TOKEN"))
channel = os.getenv("SLACK_CHANNEL", "#k8s-alerts")
 
def send_k8s_alert(error_event: dict, analysis: dict):
    causes_text = "\n".join([f"• {c}" for c in analysis.get("causes", [])])
    commands_text = "\n".join([f"`{cmd}`" for cmd in analysis.get("fix_commands", [])])
    
    blocks = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"🚨 {error_event['reason']}{error_event['kind']}/{error_event['name']}"
            }
        },
        {
            "type": "section",
            "fields": [
                {"type": "mrkdwn", "text": f"*Namespace:*\n{error_event['namespace']}"},
                {"type": "mrkdwn", "text": f"*Count:*\n{error_event['count']} times"},
            ]
        },
        {
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": f"*What happened:*\n{analysis.get('explanation', 'No explanation available')}"
            }
        },
    ]
    
    if causes_text:
        blocks.append({
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*Likely causes:*\n{causes_text}"}
        })
    
    if commands_text:
        blocks.append({
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*Investigate with:*\n{commands_text}"}
        })
    
    try:
        slack_client.chat_postMessage(channel=channel, blocks=blocks)
        print(f"Alert sent for {error_event['name']}")
    except SlackApiError as e:
        print(f"Slack error: {e.response['error']}")

Step 6: Main Loop

python
# main.py
import os
import time
from dotenv import load_dotenv
from watcher import watch_pod_events
from analyzer import analyze_k8s_error
from notifier import send_k8s_alert
 
load_dotenv()
 
# Deduplicate — don't spam the same error
seen_errors = {}
 
def should_alert(error_event: dict) -> bool:
    key = f"{error_event['namespace']}/{error_event['name']}/{error_event['reason']}"
    last_count = seen_errors.get(key, 0)
    
    # Alert on first occurrence, then only every 10 more
    if error_event['count'] == 1 or (error_event['count'] - last_count) >= 10:
        seen_errors[key] = error_event['count']
        return True
    return False
 
def main():
    namespace = os.getenv("KUBERNETES_NAMESPACE", "default")
    print(f"K8s Slack Bot started — watching {namespace}")
    
    for error_event in watch_pod_events(namespace):
        if not should_alert(error_event):
            continue
        
        print(f"Error detected: {error_event['reason']} on {error_event['name']}")
        
        # Analyze with Claude
        analysis = analyze_k8s_error(error_event)
        
        # Send to Slack
        send_k8s_alert(error_event, analysis)
        
        # Small delay to avoid rate limits
        time.sleep(1)
 
if __name__ == "__main__":
    main()

Deploy to Kubernetes

yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8s-slack-bot
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: k8s-slack-bot
  template:
    metadata:
      labels:
        app: k8s-slack-bot
    spec:
      serviceAccountName: k8s-slack-bot-sa
      containers:
        - name: bot
          image: python:3.11-slim
          command: ["python", "/app/main.py"]
          env:
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: bot-secrets
                  key: anthropic-key
            - name: SLACK_BOT_TOKEN
              valueFrom:
                secretKeyRef:
                  name: bot-secrets
                  key: slack-token
            - name: KUBERNETES_NAMESPACE
              value: "all"
yaml
# rbac.yaml — bot needs read access to events
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: k8s-slack-bot
rules:
  - apiGroups: [""]
    resources: ["events", "pods"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: k8s-slack-bot
subjects:
  - kind: ServiceAccount
    name: k8s-slack-bot-sa
    namespace: monitoring
roleRef:
  kind: ClusterRole
  name: k8s-slack-bot
  apiGroup: rbac.authorization.k8s.io

This bot turns cryptic Kubernetes events into actionable Slack messages. The deduplication logic prevents alert fatigue — you get notified on first occurrence and every 10 subsequent ones.

Get your Anthropic API key to start building. For Kubernetes hands-on practice, KodeKloud has real cluster labs.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments