All Articles

How to Set Up AIOps-Powered Alerting with Grafana Machine Learning in 2026

Step-by-step guide to setting up Grafana's machine learning features for anomaly detection, predictive alerting, and intelligent noise reduction. Stop alert fatigue with AI.

DevOpsBoysMar 22, 20265 min read
Share:Tweet

Static thresholds are broken. Setting an alert for "CPU > 80%" sounds reasonable until your batch job legitimately uses 95% CPU every night at 2 AM and pages your on-call engineer for no reason.

The fix isn't better thresholds — it's machine learning that learns what "normal" looks like for each metric and alerts only on real anomalies.

Grafana Cloud now has ML-powered features built in. This guide walks you through setting up anomaly detection, predictive alerting, and intelligent alert grouping from scratch.

What Grafana ML Can Do

Grafana's machine learning features include:

  1. Anomaly Detection — automatically learns normal patterns and flags deviations
  2. Metric Forecasting — predicts future values (disk full in 3 days, etc.)
  3. Adaptive Alerting — dynamic thresholds that adjust to your data patterns
  4. Sift — AI-powered investigation that finds related changes when anomalies occur
  5. Alert Correlation — groups related alerts into single incidents

Prerequisites

  • Grafana Cloud account (ML features require Pro or Enterprise plan)
  • Prometheus or Grafana Mimir as data source
  • At least 2 weeks of historical metric data (ML needs training data)
  • Grafana Cloud Alerting enabled

If you're self-hosting Grafana, ML features require the Grafana ML plugin (paid).

Step 1 — Enable Machine Learning in Grafana Cloud

Navigate to Grafana Cloud → Machine Learning in your Grafana instance.

If you don't see the ML section:

  1. Go to Administration → Plugins
  2. Search for "Machine Learning"
  3. Install and enable the Grafana ML plugin
  4. Restart Grafana (or reload if using Grafana Cloud)

Step 2 — Create Your First Anomaly Detection Job

Anomaly detection learns the normal pattern of a metric and creates an anomaly score.

Via the UI

  1. Go to Machine Learning → Anomaly Detection → New Job
  2. Configure the job:
    • Name: api-latency-anomaly
    • Data source: Your Prometheus instance
    • Query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))
    • Training window: 14 days (minimum 7 days)
    • Detection sensitivity: Medium (start here, tune later)
  3. Click Create and wait for the initial training (5-15 minutes)

Via the API

bash
curl -X POST https://your-grafana.grafana.net/api/plugins/grafana-ml-app/resources/api/v1/anomaly-detection/jobs \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "api-latency-anomaly",
    "datasourceUid": "prometheus-uid",
    "query": {
      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"api\"}[5m]))",
      "refId": "A"
    },
    "trainingWindow": 1209600,
    "interval": 300
  }'

Step 3 — Understand the Anomaly Score

The ML model outputs an anomaly score between 0 and 1:

  • 0.0 - 0.3: Normal behavior
  • 0.3 - 0.7: Unusual but possibly expected (deploy spike, etc.)
  • 0.7 - 1.0: Significant anomaly — likely a real issue

You can query the anomaly score like any Prometheus metric:

promql
# Anomaly score for the job
grafana_ml_anomaly_score{job_name="api-latency-anomaly"}
 
# Predicted (expected) value
grafana_ml_anomaly_predicted{job_name="api-latency-anomaly"}
 
# Actual value
grafana_ml_anomaly_actual{job_name="api-latency-anomaly"}

Step 4 — Create ML-Powered Alerts

Now create alerts based on anomaly scores instead of static thresholds:

Create an Alert Rule

  1. Go to Alerting → Alert Rules → New Alert Rule
  2. Set the query:
    promql
    grafana_ml_anomaly_score{job_name="api-latency-anomaly"} > 0.8
  3. Configure evaluation:
    • Evaluate every: 1 minute
    • For: 5 minutes (avoid alerting on brief spikes)
  4. Set labels and notifications:
    • Severity: warning (for 0.7-0.85), critical (for 0.85+)
    • Notification channel: Slack, PagerDuty, etc.

Multi-Tier Anomaly Alerting

yaml
# Grafana alerting rule (provisioning file)
apiVersion: 1
groups:
  - orgId: 1
    name: ML Anomaly Alerts
    folder: Infrastructure
    interval: 1m
    rules:
      - uid: api-anomaly-warning
        title: "API Latency Anomaly Detected"
        condition: B
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: "grafana_ml_anomaly_score{job_name='api-latency-anomaly'}"
          - refId: B
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.75]
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual API latency pattern detected"
          description: "Anomaly score: {{ $values.A }}. This doesn't match the learned normal pattern."
 
      - uid: api-anomaly-critical
        title: "API Latency Critical Anomaly"
        condition: B
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: "grafana_ml_anomaly_score{job_name='api-latency-anomaly'}"
          - refId: B
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.9]
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Critical API latency anomaly"
          description: "Anomaly score: {{ $values.A }}. Immediate investigation required."

Step 5 — Set Up Metric Forecasting

Predict future values to catch issues before they happen.

Disk Space Prediction

bash
curl -X POST https://your-grafana.grafana.net/api/plugins/grafana-ml-app/resources/api/v1/forecast/jobs \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "disk-usage-forecast",
    "datasourceUid": "prometheus-uid",
    "query": {
      "expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"} * 100)",
      "refId": "A"
    },
    "trainingWindow": 2592000,
    "forecastHorizon": 604800
  }'

This creates a forecast 7 days into the future based on 30 days of training data.

Predictive Alert

promql
# Alert if disk will be full within 3 days
grafana_ml_forecast_predicted{job_name="disk-usage-forecast", forecast_horizon="3d"} > 90

This fires 3 days before the disk fills up, giving you plenty of time to clean up or expand storage.

Step 6 — Build an ML-Powered Dashboard

Create a dashboard that shows anomaly detection in context:

Panel 1 — Actual vs Predicted (Time Series)

promql
# Actual value
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))
 
# ML predicted value (normal band)
grafana_ml_anomaly_predicted{job_name="api-latency-anomaly"}

Display both on the same graph. The gap between actual and predicted is the anomaly.

Panel 2 — Anomaly Score (Gauge)

promql
grafana_ml_anomaly_score{job_name="api-latency-anomaly"}

Set thresholds: green (0-0.3), yellow (0.3-0.7), red (0.7-1.0).

Panel 3 — Forecast (Time Series)

promql
# Current disk usage
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)
 
# Forecasted disk usage
grafana_ml_forecast_predicted{job_name="disk-usage-forecast"}

Show a dashed line extending into the future. Add a horizontal threshold at 90% to visualize when it'll hit critical.

Step 7 — Tune Sensitivity

After running for a week, tune your anomaly detection:

Too Many Alerts (False Positives)

  • Increase detection threshold from 0.75 to 0.85
  • Increase the for duration from 5m to 10m
  • Exclude known patterns (scheduled jobs, deployments) using recording rules

Missing Real Issues (False Negatives)

  • Lower detection threshold to 0.6
  • Decrease the for duration to 3m
  • Create separate anomaly jobs for different time periods (business hours vs off-hours)

Handling Deployments

Deployments cause expected metric changes. Suppress anomaly alerts during deploy windows:

promql
# Only alert when there's no active deployment
grafana_ml_anomaly_score{job_name="api-latency-anomaly"} > 0.8
  unless on() (deploy_in_progress == 1)

Common Anomaly Detection Jobs to Create

MetricWhySensitivity
API P95 latencyCatch performance regressionsMedium
Error rateDetect outages before users noticeHigh
Request rateSpot traffic anomalies (DDoS, bot)Low
Memory usageCatch leaks earlyMedium
Disk I/OPredict disk issuesMedium
Pod restart rateCatch crashloop patternsHigh
Queue depthDetect processing bottlenecksMedium

Wrapping Up

ML-powered alerting isn't about replacing your existing monitoring — it's about making it smarter. Instead of guessing thresholds and waking up at 2 AM for false alarms, let the ML model learn what normal looks like and only alert when something genuinely unusual happens.

Start with 2-3 critical metrics, let the model train for a week, then gradually expand. The first time it catches a real anomaly that a static threshold would have missed, you'll never go back.

Want to master Prometheus, Grafana, and production monitoring with hands-on labs? The KodeKloud observability course covers monitoring, alerting, and the complete observability stack. For hosting your monitoring infrastructure, DigitalOcean offers reliable infrastructure with built-in monitoring.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments