Build an AI Alert Classifier for Grafana Using LLMs (2026)
Tired of noisy Grafana alerts that wake you up for nothing? Build an AI layer that classifies incoming alerts as actionable or noise, enriches them with context, and routes them intelligently — using Claude or GPT-4 as the reasoning engine.
Your Grafana alerts fire 40 times a day. Maybe 5 of them actually matter.
The rest are flapping metrics, transient spikes, or conditions that auto-resolved before anyone looked. But you can't turn them off because occasionally they're real.
This is the alert fatigue problem. LLMs can help solve it.
In this post, we'll build an AI alert classifier — a service that sits between Grafana Alertmanager and your notification channels. It receives alerts, uses an LLM to classify them, enriches them with context, and routes only the ones that need human attention.
Architecture
Grafana Alertmanager
│
▼
AI Alert Classifier (FastAPI)
├── Receives Alertmanager webhook
├── Fetches recent metric context from Prometheus
├── Asks LLM: is this actionable?
├── Classifies: CRITICAL / WARNING / NOISE
└── Routes:
├── CRITICAL → PagerDuty + Slack
├── WARNING → Slack only
└── NOISE → Logged, suppressed
Prerequisites
- Kubernetes cluster with Prometheus + Grafana (kube-prometheus-stack)
- Python 3.11+
- Anthropic API key (Claude) or OpenAI API key
- Slack webhook URL
Step 1: The Alert Classifier Service
# classifier/main.py
from fastapi import FastAPI, Request
from pydantic import BaseModel
from typing import List, Optional
import anthropic
import httpx
import json
import os
app = FastAPI(title="AI Alert Classifier")
anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
PROMETHEUS_URL = os.environ.get("PROMETHEUS_URL", "http://prometheus:9090")
SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK_URL"]
class Alert(BaseModel):
status: str
labels: dict
annotations: dict
startsAt: str
endsAt: Optional[str] = None
class AlertmanagerPayload(BaseModel):
alerts: List[Alert]
groupLabels: dict
commonLabels: dict
commonAnnotations: dict
async def get_metric_context(alert: Alert) -> str:
"""Fetch recent metric values from Prometheus for context."""
alertname = alert.labels.get("alertname", "")
namespace = alert.labels.get("namespace", "")
pod = alert.labels.get("pod", "")
context_queries = []
if "CPUThrottling" in alertname or "CPU" in alertname:
context_queries.append(
f'rate(container_cpu_usage_seconds_total{{namespace="{namespace}",pod=~"{pod}.*"}}[5m])'
)
elif "Memory" in alertname or "OOM" in alertname:
context_queries.append(
f'container_memory_working_set_bytes{{namespace="{namespace}",pod=~"{pod}.*"}}'
)
elif "Pod" in alertname or "Restart" in alertname:
context_queries.append(
f'kube_pod_container_status_restarts_total{{namespace="{namespace}",pod=~"{pod}.*"}}'
)
results = []
async with httpx.AsyncClient() as client:
for query in context_queries:
try:
resp = await client.get(
f"{PROMETHEUS_URL}/api/v1/query",
params={"query": query},
timeout=5.0
)
data = resp.json()
if data.get("data", {}).get("result"):
results.append(f"Query: {query}")
for r in data["data"]["result"][:3]: # Max 3 results
results.append(f" Value: {r.get('value', ['', 'N/A'])[1]}")
except Exception:
pass
return "\n".join(results) if results else "No metric context available"
async def classify_alert(alert: Alert, metric_context: str) -> dict:
"""Use Claude to classify if the alert is actionable."""
prompt = f"""You are an expert SRE (Site Reliability Engineer) analyzing a Kubernetes alert.
ALERT DETAILS:
- Alert Name: {alert.labels.get('alertname', 'Unknown')}
- Severity: {alert.labels.get('severity', 'Unknown')}
- Namespace: {alert.labels.get('namespace', 'Unknown')}
- Pod/Service: {alert.labels.get('pod', alert.labels.get('service', 'Unknown'))}
- Status: {alert.status}
- Summary: {alert.annotations.get('summary', 'No summary')}
- Description: {alert.annotations.get('description', 'No description')}
- Started At: {alert.startsAt}
RECENT METRIC CONTEXT:
{metric_context}
Based on this information, classify this alert:
1. CLASSIFICATION: Choose one:
- CRITICAL: Requires immediate human action, service is degraded or down
- WARNING: Should be investigated soon but not urgent, may auto-resolve
- NOISE: Likely transient, flapping, or informational — does not require action
2. REASONING: 2-3 sentences explaining your classification
3. RECOMMENDED_ACTION: What the on-call engineer should do (or "None needed" for NOISE)
4. CONFIDENCE: HIGH / MEDIUM / LOW
Respond in JSON format:
{{
"classification": "CRITICAL|WARNING|NOISE",
"reasoning": "...",
"recommended_action": "...",
"confidence": "HIGH|MEDIUM|LOW"
}}"""
message = anthropic_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
response_text = message.content[0].text
# Extract JSON from response
import re
json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
if json_match:
return json.loads(json_match.group())
return {
"classification": "WARNING",
"reasoning": "Could not parse LLM response",
"recommended_action": "Manually review alert",
"confidence": "LOW"
}
async def send_to_slack(alert: Alert, classification: dict):
"""Send classified alert to Slack with enriched context."""
emoji_map = {
"CRITICAL": "🔴",
"WARNING": "🟡",
"NOISE": "⚪"
}
emoji = emoji_map.get(classification["classification"], "⚪")
confidence = classification.get("confidence", "MEDIUM")
payload = {
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"{emoji} {classification['classification']}: {alert.labels.get('alertname', 'Alert')}"
}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": f"*Namespace:*\n{alert.labels.get('namespace', 'N/A')}"},
{"type": "mrkdwn", "text": f"*Severity:*\n{alert.labels.get('severity', 'N/A')}"},
{"type": "mrkdwn", "text": f"*AI Confidence:*\n{confidence}"},
{"type": "mrkdwn", "text": f"*Status:*\n{alert.status}"}
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*AI Analysis:*\n{classification['reasoning']}"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Recommended Action:*\n{classification['recommended_action']}"
}
}
]
}
async with httpx.AsyncClient() as client:
await client.post(SLACK_WEBHOOK, json=payload)
@app.post("/webhook")
async def receive_alert(payload: AlertmanagerPayload):
"""Main webhook endpoint receiving Alertmanager notifications."""
results = []
for alert in payload.alerts:
# Skip resolved alerts
if alert.status == "resolved":
continue
# Get metric context from Prometheus
metric_context = await get_metric_context(alert)
# Classify with LLM
classification = await classify_alert(alert, metric_context)
# Route based on classification
if classification["classification"] in ("CRITICAL", "WARNING"):
await send_to_slack(alert, classification)
# For CRITICAL, you'd also call PagerDuty here
results.append({
"alert": alert.labels.get("alertname"),
"classification": classification["classification"],
"confidence": classification.get("confidence")
})
return {"processed": len(results), "results": results}
@app.get("/health")
async def health():
return {"status": "ok"}Step 2: Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]# requirements.txt
fastapi==0.115.0
uvicorn==0.30.0
anthropic==0.40.0
httpx==0.27.0
pydantic==2.8.0Step 3: Deploy to Kubernetes
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alert-classifier
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: alert-classifier
template:
metadata:
labels:
app: alert-classifier
spec:
containers:
- name: classifier
image: myregistry/alert-classifier:v1.0.0
ports:
- containerPort: 8080
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: ai-classifier-secrets
key: anthropic-api-key
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: ai-classifier-secrets
key: slack-webhook
- name: PROMETHEUS_URL
value: "http://prometheus-operated:9090"
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
name: alert-classifier
namespace: monitoring
spec:
selector:
app: alert-classifier
ports:
- port: 8080
targetPort: 8080Step 4: Configure Alertmanager to Send to Your Classifier
# alertmanager.yaml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: ai-classifier
receivers:
- name: ai-classifier
webhook_configs:
- url: 'http://alert-classifier.monitoring:8080/webhook'
send_resolved: false
http_config:
timeout: 30sIn kube-prometheus-stack values.yaml:
alertmanager:
config:
global:
resolve_timeout: 5m
route:
receiver: ai-classifier
receivers:
- name: ai-classifier
webhook_configs:
- url: 'http://alert-classifier.monitoring:8080/webhook'Step 5: Test It
# Send a test alert to your classifier
curl -X POST http://localhost:8080/webhook \
-H 'Content-Type: application/json' \
-d '{
"alerts": [{
"status": "firing",
"labels": {
"alertname": "KubePodCrashLooping",
"namespace": "production",
"pod": "api-server-abc123",
"severity": "critical"
},
"annotations": {
"summary": "Pod api-server-abc123 is crash looping",
"description": "Pod has restarted 5 times in the last 15 minutes"
},
"startsAt": "2026-05-19T10:00:00Z"
}],
"groupLabels": {},
"commonLabels": {},
"commonAnnotations": {}
}'Expected response:
{
"processed": 1,
"results": [{
"alert": "KubePodCrashLooping",
"classification": "CRITICAL",
"confidence": "HIGH"
}]
}Results You Can Expect
In practice this kind of system:
- Reduces Slack alert noise by 60–80%
- Cuts on-call pages by ~50% by suppressing transient alerts
- Provides context that reduces MTTR (mean time to resolve)
The LLM is especially good at distinguishing:
- CPUThrottling that's minor vs severe
- Pod restarts that are scheduled vs unexpected
- Memory pressure during batch jobs vs genuine leaks
Affiliate note: Anthropic Claude API powers the classification — fast, reliable, and cost-effective at $3/million input tokens for claude-sonnet-4-6. For monitoring infrastructure, Grafana Cloud offers a generous free tier with Prometheus and Alertmanager included.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Agentic SRE Will Replace Traditional Incident Response by 2028
AI agents are moving beyond alerting into autonomous incident detection, root cause analysis, and remediation. Here's why Agentic SRE will fundamentally change how we handle production incidents.
AI-Powered Log Analysis Is Replacing Manual Debugging in DevOps (2026)
How LLMs and AI are transforming log analysis, anomaly detection, and root cause analysis — and the tools DevOps engineers should know about in 2026.