🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build an AI Alert Classifier for Grafana Using LLMs (2026)

Tired of noisy Grafana alerts that wake you up for nothing? Build an AI layer that classifies incoming alerts as actionable or noise, enriches them with context, and routes them intelligently — using Claude or GPT-4 as the reasoning engine.

DevOpsBoysMay 19, 20266 min read
Share:Tweet

Your Grafana alerts fire 40 times a day. Maybe 5 of them actually matter.

The rest are flapping metrics, transient spikes, or conditions that auto-resolved before anyone looked. But you can't turn them off because occasionally they're real.

This is the alert fatigue problem. LLMs can help solve it.

In this post, we'll build an AI alert classifier — a service that sits between Grafana Alertmanager and your notification channels. It receives alerts, uses an LLM to classify them, enriches them with context, and routes only the ones that need human attention.


Architecture

Grafana Alertmanager
        │
        ▼
AI Alert Classifier (FastAPI)
  ├── Receives Alertmanager webhook
  ├── Fetches recent metric context from Prometheus
  ├── Asks LLM: is this actionable?
  ├── Classifies: CRITICAL / WARNING / NOISE
  └── Routes:
        ├── CRITICAL → PagerDuty + Slack
        ├── WARNING  → Slack only
        └── NOISE    → Logged, suppressed

Prerequisites

  • Kubernetes cluster with Prometheus + Grafana (kube-prometheus-stack)
  • Python 3.11+
  • Anthropic API key (Claude) or OpenAI API key
  • Slack webhook URL

Step 1: The Alert Classifier Service

python
# classifier/main.py
from fastapi import FastAPI, Request
from pydantic import BaseModel
from typing import List, Optional
import anthropic
import httpx
import json
import os
 
app = FastAPI(title="AI Alert Classifier")
 
anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
PROMETHEUS_URL = os.environ.get("PROMETHEUS_URL", "http://prometheus:9090")
SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK_URL"]
 
class Alert(BaseModel):
    status: str
    labels: dict
    annotations: dict
    startsAt: str
    endsAt: Optional[str] = None
 
class AlertmanagerPayload(BaseModel):
    alerts: List[Alert]
    groupLabels: dict
    commonLabels: dict
    commonAnnotations: dict
 
async def get_metric_context(alert: Alert) -> str:
    """Fetch recent metric values from Prometheus for context."""
    alertname = alert.labels.get("alertname", "")
    namespace = alert.labels.get("namespace", "")
    pod = alert.labels.get("pod", "")
 
    context_queries = []
 
    if "CPUThrottling" in alertname or "CPU" in alertname:
        context_queries.append(
            f'rate(container_cpu_usage_seconds_total{{namespace="{namespace}",pod=~"{pod}.*"}}[5m])'
        )
    elif "Memory" in alertname or "OOM" in alertname:
        context_queries.append(
            f'container_memory_working_set_bytes{{namespace="{namespace}",pod=~"{pod}.*"}}'
        )
    elif "Pod" in alertname or "Restart" in alertname:
        context_queries.append(
            f'kube_pod_container_status_restarts_total{{namespace="{namespace}",pod=~"{pod}.*"}}'
        )
 
    results = []
    async with httpx.AsyncClient() as client:
        for query in context_queries:
            try:
                resp = await client.get(
                    f"{PROMETHEUS_URL}/api/v1/query",
                    params={"query": query},
                    timeout=5.0
                )
                data = resp.json()
                if data.get("data", {}).get("result"):
                    results.append(f"Query: {query}")
                    for r in data["data"]["result"][:3]:  # Max 3 results
                        results.append(f"  Value: {r.get('value', ['', 'N/A'])[1]}")
            except Exception:
                pass
 
    return "\n".join(results) if results else "No metric context available"
 
 
async def classify_alert(alert: Alert, metric_context: str) -> dict:
    """Use Claude to classify if the alert is actionable."""
 
    prompt = f"""You are an expert SRE (Site Reliability Engineer) analyzing a Kubernetes alert.
 
ALERT DETAILS:
- Alert Name: {alert.labels.get('alertname', 'Unknown')}
- Severity: {alert.labels.get('severity', 'Unknown')}
- Namespace: {alert.labels.get('namespace', 'Unknown')}
- Pod/Service: {alert.labels.get('pod', alert.labels.get('service', 'Unknown'))}
- Status: {alert.status}
- Summary: {alert.annotations.get('summary', 'No summary')}
- Description: {alert.annotations.get('description', 'No description')}
- Started At: {alert.startsAt}
 
RECENT METRIC CONTEXT:
{metric_context}
 
Based on this information, classify this alert:
 
1. CLASSIFICATION: Choose one:
   - CRITICAL: Requires immediate human action, service is degraded or down
   - WARNING: Should be investigated soon but not urgent, may auto-resolve
   - NOISE: Likely transient, flapping, or informational — does not require action
 
2. REASONING: 2-3 sentences explaining your classification
 
3. RECOMMENDED_ACTION: What the on-call engineer should do (or "None needed" for NOISE)
 
4. CONFIDENCE: HIGH / MEDIUM / LOW
 
Respond in JSON format:
{{
  "classification": "CRITICAL|WARNING|NOISE",
  "reasoning": "...",
  "recommended_action": "...",
  "confidence": "HIGH|MEDIUM|LOW"
}}"""
 
    message = anthropic_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
 
    response_text = message.content[0].text
    # Extract JSON from response
    import re
    json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
    if json_match:
        return json.loads(json_match.group())
 
    return {
        "classification": "WARNING",
        "reasoning": "Could not parse LLM response",
        "recommended_action": "Manually review alert",
        "confidence": "LOW"
    }
 
 
async def send_to_slack(alert: Alert, classification: dict):
    """Send classified alert to Slack with enriched context."""
    emoji_map = {
        "CRITICAL": "🔴",
        "WARNING": "🟡",
        "NOISE": "⚪"
    }
 
    emoji = emoji_map.get(classification["classification"], "⚪")
    confidence = classification.get("confidence", "MEDIUM")
 
    payload = {
        "blocks": [
            {
                "type": "header",
                "text": {
                    "type": "plain_text",
                    "text": f"{emoji} {classification['classification']}: {alert.labels.get('alertname', 'Alert')}"
                }
            },
            {
                "type": "section",
                "fields": [
                    {"type": "mrkdwn", "text": f"*Namespace:*\n{alert.labels.get('namespace', 'N/A')}"},
                    {"type": "mrkdwn", "text": f"*Severity:*\n{alert.labels.get('severity', 'N/A')}"},
                    {"type": "mrkdwn", "text": f"*AI Confidence:*\n{confidence}"},
                    {"type": "mrkdwn", "text": f"*Status:*\n{alert.status}"}
                ]
            },
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": f"*AI Analysis:*\n{classification['reasoning']}"
                }
            },
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": f"*Recommended Action:*\n{classification['recommended_action']}"
                }
            }
        ]
    }
 
    async with httpx.AsyncClient() as client:
        await client.post(SLACK_WEBHOOK, json=payload)
 
 
@app.post("/webhook")
async def receive_alert(payload: AlertmanagerPayload):
    """Main webhook endpoint receiving Alertmanager notifications."""
    results = []
 
    for alert in payload.alerts:
        # Skip resolved alerts
        if alert.status == "resolved":
            continue
 
        # Get metric context from Prometheus
        metric_context = await get_metric_context(alert)
 
        # Classify with LLM
        classification = await classify_alert(alert, metric_context)
 
        # Route based on classification
        if classification["classification"] in ("CRITICAL", "WARNING"):
            await send_to_slack(alert, classification)
            # For CRITICAL, you'd also call PagerDuty here
 
        results.append({
            "alert": alert.labels.get("alertname"),
            "classification": classification["classification"],
            "confidence": classification.get("confidence")
        })
 
    return {"processed": len(results), "results": results}
 
 
@app.get("/health")
async def health():
    return {"status": "ok"}

Step 2: Dockerfile

dockerfile
FROM python:3.11-slim
 
WORKDIR /app
 
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
 
COPY . .
 
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
txt
# requirements.txt
fastapi==0.115.0
uvicorn==0.30.0
anthropic==0.40.0
httpx==0.27.0
pydantic==2.8.0

Step 3: Deploy to Kubernetes

yaml
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alert-classifier
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: alert-classifier
  template:
    metadata:
      labels:
        app: alert-classifier
    spec:
      containers:
        - name: classifier
          image: myregistry/alert-classifier:v1.0.0
          ports:
            - containerPort: 8080
          env:
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: ai-classifier-secrets
                  key: anthropic-api-key
            - name: SLACK_WEBHOOK_URL
              valueFrom:
                secretKeyRef:
                  name: ai-classifier-secrets
                  key: slack-webhook
            - name: PROMETHEUS_URL
              value: "http://prometheus-operated:9090"
          resources:
            requests:
              memory: 256Mi
              cpu: 100m
            limits:
              memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: alert-classifier
  namespace: monitoring
spec:
  selector:
    app: alert-classifier
  ports:
    - port: 8080
      targetPort: 8080

Step 4: Configure Alertmanager to Send to Your Classifier

yaml
# alertmanager.yaml
global:
  resolve_timeout: 5m
 
route:
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: ai-classifier
 
receivers:
  - name: ai-classifier
    webhook_configs:
      - url: 'http://alert-classifier.monitoring:8080/webhook'
        send_resolved: false
        http_config:
          timeout: 30s

In kube-prometheus-stack values.yaml:

yaml
alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      receiver: ai-classifier
    receivers:
      - name: ai-classifier
        webhook_configs:
          - url: 'http://alert-classifier.monitoring:8080/webhook'

Step 5: Test It

bash
# Send a test alert to your classifier
curl -X POST http://localhost:8080/webhook \
  -H 'Content-Type: application/json' \
  -d '{
    "alerts": [{
      "status": "firing",
      "labels": {
        "alertname": "KubePodCrashLooping",
        "namespace": "production",
        "pod": "api-server-abc123",
        "severity": "critical"
      },
      "annotations": {
        "summary": "Pod api-server-abc123 is crash looping",
        "description": "Pod has restarted 5 times in the last 15 minutes"
      },
      "startsAt": "2026-05-19T10:00:00Z"
    }],
    "groupLabels": {},
    "commonLabels": {},
    "commonAnnotations": {}
  }'

Expected response:

json
{
  "processed": 1,
  "results": [{
    "alert": "KubePodCrashLooping",
    "classification": "CRITICAL",
    "confidence": "HIGH"
  }]
}

Results You Can Expect

In practice this kind of system:

  • Reduces Slack alert noise by 60–80%
  • Cuts on-call pages by ~50% by suppressing transient alerts
  • Provides context that reduces MTTR (mean time to resolve)

The LLM is especially good at distinguishing:

  • CPUThrottling that's minor vs severe
  • Pod restarts that are scheduled vs unexpected
  • Memory pressure during batch jobs vs genuine leaks

Affiliate note: Anthropic Claude API powers the classification — fast, reliable, and cost-effective at $3/million input tokens for claude-sonnet-4-6. For monitoring infrastructure, Grafana Cloud offers a generous free tier with Prometheus and Alertmanager included.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments