All Articles

AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds

Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.

DevOpsBoysMar 30, 20265 min read
Share:Tweet

Your Prometheus alerts are firing at 3 AM. CPU is at 78%. Threshold is 80%. Nothing fires. But your 99th percentile latency has been slowly degrading for 6 hours and a cascade failure is 20 minutes away.

Static thresholds don't catch this. AI anomaly detection does.

The Problem with Alert Fatigue and Threshold Gaps

Traditional monitoring has two failure modes:

  1. Too many alerts: Every metric gets a threshold, every threshold fires, on-call engineers develop alert blindness
  2. Too few alerts: Real problems that don't match threshold patterns slip through entirely

The core issue: static thresholds assume you know in advance what "bad" looks like. In distributed systems, you often don't. Anomalies are contextual — 500 RPS at 2 PM is normal, 500 RPS at 2 AM is suspicious.

What AI Anomaly Detection Does Differently

ML-based anomaly detection learns what "normal" looks like for YOUR system, then flags deviations. No thresholds to tune. No need to predict failure modes in advance.

Approaches:

1. Statistical Baselines (Prophet, SARIMA)

Model time-series seasonality (daily, weekly patterns). Flag values that deviate from the predicted range.

2. Unsupervised ML (Isolation Forest, Autoencoders)

Learn the multi-dimensional "normal state" of your system. Flag combinations of metrics that together look unusual even if each individual metric looks fine.

3. LLM-Based Correlation

Give an LLM your recent metrics, logs, and events and ask "what's happening?" — correlate across data sources that no static alert can connect.

Option 1: Prometheus + Alertmanager with ML Scoring

You can add ML scoring on top of existing Prometheus metrics without replacing your stack.

Install prometheus-anomaly-detector:

bash
# Run alongside your Prometheus stack
docker run -d \
  -e PROMETHEUS_URL=http://prometheus:9090 \
  -e METRIC_CHUNK_SIZE=100 \
  -p 8080:8080 \
  quay.io/thoth-station/prometheus-anomaly-detector:latest

This runs Facebook Prophet against your Prometheus metrics and exposes anomaly scores back as metrics — which you can alert on.

Alert on anomaly score, not raw values:

yaml
# Alertmanager rule
groups:
- name: anomaly-detection
  rules:
  - alert: AnomalyDetectedHighScore
    expr: predicted_anomaly_score > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Anomaly detected in {{ $labels.metric_name }}"
      description: "Score {{ $value }} - values deviating from historical baseline"

Learn MLOps and AI-powered observability: KodeKloud has courses on Prometheus, Grafana, and advanced monitoring patterns.

Option 2: Grafana with ML Forecasting Plugin

Grafana 10+ includes built-in ML-based forecasting via the grafana-ml plugin:

bash
# Install ML plugin
grafana-cli plugins install grafana-ml-app

Then in Grafana, create a "Forecasting" panel — it uses Prophet to predict future values and draw confidence bands. Values outside the band trigger alerts.

This is zero-code ML anomaly detection with a UI your team already knows.

Option 3: Custom Python Anomaly Detector with Kubernetes Metrics

For more control, build a lightweight anomaly detector that reads from Prometheus:

python
import requests
import numpy as np
from sklearn.ensemble import IsolationForest
from datetime import datetime, timedelta
import time
 
PROMETHEUS_URL = "http://prometheus-server:9090"
 
def fetch_metric(query, hours=24):
    end = datetime.now()
    start = end - timedelta(hours=hours)
    response = requests.get(f"{PROMETHEUS_URL}/api/v1/query_range", params={
        "query": query,
        "start": start.timestamp(),
        "end": end.timestamp(),
        "step": "60"
    })
    data = response.json()["data"]["result"]
    if not data:
        return np.array([])
    values = [float(v[1]) for v in data[0]["values"]]
    return np.array(values)
 
def detect_anomalies(values, contamination=0.05):
    if len(values) < 10:
        return []
    model = IsolationForest(contamination=contamination, random_state=42)
    reshaped = values.reshape(-1, 1)
    predictions = model.fit_predict(reshaped)
    return np.where(predictions == -1)[0].tolist()
 
def run_detection():
    metrics = {
        "cpu": 'rate(container_cpu_usage_seconds_total{namespace="production"}[5m])',
        "memory": 'container_memory_working_set_bytes{namespace="production"}',
        "latency": 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))',
        "error_rate": 'rate(http_requests_total{status=~"5.."}[5m])',
    }
 
    for metric_name, query in metrics.items():
        values = fetch_metric(query)
        if len(values) == 0:
            continue
        anomaly_indices = detect_anomalies(values)
        if anomaly_indices:
            recent = anomaly_indices[-1] > len(values) - 5
            if recent:
                print(f"ANOMALY: {metric_name} showing unusual pattern. Score: {len(anomaly_indices)/len(values):.2f}")
 
if __name__ == "__main__":
    while True:
        run_detection()
        time.sleep(60)

Deploy this as a Kubernetes CronJob:

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: anomaly-detector
  namespace: monitoring
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: detector
            image: your-registry/anomaly-detector:latest
            env:
            - name: PROMETHEUS_URL
              value: "http://prometheus-server.monitoring:9090"
          restartPolicy: OnFailure

Option 4: LLM-Powered Incident Correlation

The latest approach: give an LLM recent metrics, logs, and events and let it explain what's happening in natural language.

python
import anthropic
import requests
from datetime import datetime, timedelta
 
client = anthropic.Anthropic()
 
def get_recent_context():
    # Fetch recent high-level metrics summary
    prom_url = "http://prometheus:9090"
 
    # Get error rate
    err = requests.get(f"{prom_url}/api/v1/query",
        params={"query": "rate(http_requests_total{status=~'5..'}[10m])"}).json()
 
    # Get p99 latency
    lat = requests.get(f"{prom_url}/api/v1/query",
        params={"query": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[10m]))"}).json()
 
    return {
        "error_rate": err["data"]["result"][0]["value"][1] if err["data"]["result"] else "N/A",
        "p99_latency_seconds": lat["data"]["result"][0]["value"][1] if lat["data"]["result"] else "N/A",
        "timestamp": datetime.now().isoformat()
    }
 
def analyze_with_llm(metrics, recent_logs):
    context = f"""
    Current Kubernetes cluster metrics (last 10 minutes):
    - HTTP Error Rate: {metrics['error_rate']}
    - P99 Latency: {metrics['p99_latency_seconds']}s
 
    Recent log samples:
    {recent_logs[:3000]}
 
    Kubernetes events (last 5 min):
    [Paste kubectl get events output here]
    """
 
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"You are an SRE. Analyze this Kubernetes cluster telemetry and identify any anomalies, potential root causes, and recommended actions:\n\n{context}"
        }]
    )
    return response.content[0].text
 
metrics = get_recent_context()
analysis = analyze_with_llm(metrics, recent_logs="...")
print(analysis)

This approach doesn't replace traditional alerting but augments it — especially useful during active incidents when you need to correlate across many signals quickly.

Commercial Tools Worth Evaluating

If you don't want to build custom:

  • Dynatrace: Full-stack AI anomaly detection (Davis AI engine), automatic baseline learning
  • Datadog Watchdog: Automated anomaly detection across APM, infrastructure, logs
  • New Relic Applied Intelligence: NRAI detects anomalies and correlates incidents automatically
  • Coralogix: LLM-powered log anomaly detection with Loggregation

These are expensive but save significant engineering time for large teams.

Building the Right Detection Pipeline

Raw Metrics → Prometheus
       ↓
ML Baseline Layer (Prophet/Isolation Forest)
       ↓
Anomaly Score Metrics → Prometheus
       ↓
Alertmanager (score > threshold)
       ↓
PagerDuty/Slack
       ↓
LLM Correlation (during incident)
       ↓
Runbook Action

The key insight: anomaly detection tells you something is wrong, LLMs tell you why and what to do about it. These are complementary, not competing approaches.

Getting Started Today

  1. Install Grafana ML plugin — zero code, 15 minutes, immediate value
  2. Add Prophet-based anomaly scoring to your top 5 most critical metrics
  3. Set up an LLM-powered slack bot that can answer "what's happening in production?" on demand
  4. Gradually replace pure threshold alerts with anomaly score alerts for noisy metrics

Building AI-powered DevOps tooling? Start with the Claude API for LLM-based incident correlation — powerful, fast, and easy to integrate with existing Prometheus/Grafana stacks.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments