How to Set Up AIOps-Powered Alerting with Grafana Machine Learning in 2026

Step-by-step guide to setting up Grafana's machine learning features for anomaly detection, predictive alerting, and intelligent noise reduction. Stop alert fatigue with AI.

Static thresholds are broken. Setting an alert for "CPU > 80%" sounds reasonable until your batch job legitimately uses 95% CPU every night at 2 AM and pages your on-call engineer for no reason.

The fix isn't better thresholds — it's machine learning that learns what "normal" looks like for each metric and alerts only on real anomalies.

Grafana Cloud now has ML-powered features built in. This guide walks you through setting up anomaly detection, predictive alerting, and intelligent alert grouping from scratch.

What Grafana ML Can Do

Grafana's machine learning features include:

Anomaly Detection — automatically learns normal patterns and flags deviations
Metric Forecasting — predicts future values (disk full in 3 days, etc.)
Adaptive Alerting — dynamic thresholds that adjust to your data patterns
Sift — AI-powered investigation that finds related changes when anomalies occur
Alert Correlation — groups related alerts into single incidents

Prerequisites

Grafana Cloud account (ML features require Pro or Enterprise plan)
Prometheus or Grafana Mimir as data source
At least 2 weeks of historical metric data (ML needs training data)
Grafana Cloud Alerting enabled

If you're self-hosting Grafana, ML features require the Grafana ML plugin (paid).

Step 1 — Enable Machine Learning in Grafana Cloud

Navigate to Grafana Cloud → Machine Learning in your Grafana instance.

If you don't see the ML section:

Go to Administration → Plugins
Search for "Machine Learning"
Install and enable the Grafana ML plugin
Restart Grafana (or reload if using Grafana Cloud)

Step 2 — Create Your First Anomaly Detection Job

Anomaly detection learns the normal pattern of a metric and creates an anomaly score.

Via the UI

Go to Machine Learning → Anomaly Detection → New Job
Configure the job:
- Name: api-latency-anomaly
- Data source: Your Prometheus instance
- Query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))
- Training window: 14 days (minimum 7 days)
- Detection sensitivity: Medium (start here, tune later)
Click Create and wait for the initial training (5-15 minutes)

Via the API

bash

curl -X POST https://your-grafana.grafana.net/api/plugins/grafana-ml-app/resources/api/v1/anomaly-detection/jobs \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "api-latency-anomaly",
    "datasourceUid": "prometheus-uid",
    "query": {
      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"api\"}[5m]))",
      "refId": "A"
    },
    "trainingWindow": 1209600,
    "interval": 300
  }'

Step 3 — Understand the Anomaly Score

The ML model outputs an anomaly score between 0 and 1:

0.0 - 0.3: Normal behavior
0.3 - 0.7: Unusual but possibly expected (deploy spike, etc.)
0.7 - 1.0: Significant anomaly — likely a real issue

You can query the anomaly score like any Prometheus metric:

promql

# Anomaly score for the job
grafana_ml_anomaly_score{job_name="api-latency-anomaly"}
 
# Predicted (expected) value
grafana_ml_anomaly_predicted{job_name="api-latency-anomaly"}
 
# Actual value
grafana_ml_anomaly_actual{job_name="api-latency-anomaly"}

Step 4 — Create ML-Powered Alerts

Now create alerts based on anomaly scores instead of static thresholds:

Create an Alert Rule

Go to Alerting → Alert Rules → New Alert Rule

Set the query:

promql

grafana_ml_anomaly_score{job_name="api-latency-anomaly"} > 0.8

Configure evaluation:
- Evaluate every: 1 minute
- For: 5 minutes (avoid alerting on brief spikes)
Set labels and notifications:
- Severity: warning (for 0.7-0.85), critical (for 0.85+)
- Notification channel: Slack, PagerDuty, etc.

Multi-Tier Anomaly Alerting

yaml

# Grafana alerting rule (provisioning file)
apiVersion: 1
groups:
  - orgId: 1
    name: ML Anomaly Alerts
    folder: Infrastructure
    interval: 1m
    rules:
      - uid: api-anomaly-warning
        title: "API Latency Anomaly Detected"
        condition: B
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: "grafana_ml_anomaly_score{job_name='api-latency-anomaly'}"
          - refId: B
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.75]
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual API latency pattern detected"
          description: "Anomaly score: {{ $values.A }}. This doesn't match the learned normal pattern."
 
      - uid: api-anomaly-critical
        title: "API Latency Critical Anomaly"
        condition: B
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: "grafana_ml_anomaly_score{job_name='api-latency-anomaly'}"
          - refId: B
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.9]
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Critical API latency anomaly"
          description: "Anomaly score: {{ $values.A }}. Immediate investigation required."

Step 5 — Set Up Metric Forecasting

Predict future values to catch issues before they happen.

Disk Space Prediction

bash

curl -X POST https://your-grafana.grafana.net/api/plugins/grafana-ml-app/resources/api/v1/forecast/jobs \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "disk-usage-forecast",
    "datasourceUid": "prometheus-uid",
    "query": {
      "expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"} * 100)",
      "refId": "A"
    },
    "trainingWindow": 2592000,
    "forecastHorizon": 604800
  }'

This creates a forecast 7 days into the future based on 30 days of training data.

Predictive Alert

promql

# Alert if disk will be full within 3 days
grafana_ml_forecast_predicted{job_name="disk-usage-forecast", forecast_horizon="3d"} > 90

This fires 3 days before the disk fills up, giving you plenty of time to clean up or expand storage.

Step 6 — Build an ML-Powered Dashboard

Create a dashboard that shows anomaly detection in context:

Panel 1 — Actual vs Predicted (Time Series)

promql

# Actual value
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))
 
# ML predicted value (normal band)
grafana_ml_anomaly_predicted{job_name="api-latency-anomaly"}

Display both on the same graph. The gap between actual and predicted is the anomaly.

Panel 2 — Anomaly Score (Gauge)

promql

grafana_ml_anomaly_score{job_name="api-latency-anomaly"}

Set thresholds: green (0-0.3), yellow (0.3-0.7), red (0.7-1.0).

Panel 3 — Forecast (Time Series)

promql

# Current disk usage
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)
 
# Forecasted disk usage
grafana_ml_forecast_predicted{job_name="disk-usage-forecast"}

Show a dashed line extending into the future. Add a horizontal threshold at 90% to visualize when it'll hit critical.

Step 7 — Tune Sensitivity

After running for a week, tune your anomaly detection:

Too Many Alerts (False Positives)

Increase detection threshold from 0.75 to 0.85
Increase the for duration from 5m to 10m
Exclude known patterns (scheduled jobs, deployments) using recording rules

Missing Real Issues (False Negatives)

Lower detection threshold to 0.6
Decrease the for duration to 3m
Create separate anomaly jobs for different time periods (business hours vs off-hours)

Handling Deployments

Deployments cause expected metric changes. Suppress anomaly alerts during deploy windows:

promql

# Only alert when there's no active deployment
grafana_ml_anomaly_score{job_name="api-latency-anomaly"} > 0.8
  unless on() (deploy_in_progress == 1)

Common Anomaly Detection Jobs to Create

Metric	Why	Sensitivity
API P95 latency	Catch performance regressions	Medium
Error rate	Detect outages before users notice	High
Request rate	Spot traffic anomalies (DDoS, bot)	Low
Memory usage	Catch leaks early	Medium
Disk I/O	Predict disk issues	Medium
Pod restart rate	Catch crashloop patterns	High
Queue depth	Detect processing bottlenecks	Medium

ML-powered alerting isn't about replacing your existing monitoring — it's about making it smarter. Instead of guessing thresholds and waking up at 2 AM for false alarms, let the ML model learn what normal looks like and only alert when something genuinely unusual happens.

Start with 2-3 critical metrics, let the model train for a week, then gradually expand. The first time it catches a real anomaly that a static threshold would have missed, you'll never go back.

Want to master Prometheus, Grafana, and production monitoring with hands-on labs? The KodeKloud observability course covers monitoring, alerting, and the complete observability stack. For hosting your monitoring infrastructure, DigitalOcean offers reliable infrastructure with built-in monitoring.

How to Set Up AIOps-Powered Alerting with Grafana Machine Learning in 2026

What Grafana ML Can Do

Prerequisites

Step 1 — Enable Machine Learning in Grafana Cloud

Step 2 — Create Your First Anomaly Detection Job

Via the UI

Via the API

Step 3 — Understand the Anomaly Score

Step 4 — Create ML-Powered Alerts

Create an Alert Rule

Multi-Tier Anomaly Alerting

Step 5 — Set Up Metric Forecasting

Disk Space Prediction

Predictive Alert

Step 6 — Build an ML-Powered Dashboard

Panel 1 — Actual vs Predicted (Time Series)

Panel 2 — Anomaly Score (Gauge)

Panel 3 — Forecast (Time Series)

Step 7 — Tune Sensitivity

Too Many Alerts (False Positives)

Missing Real Issues (False Negatives)

Handling Deployments

Common Anomaly Detection Jobs to Create

Wrapping Up

Stay ahead of the curve

Related Articles

Why Agentic AI Will Kill the Traditional On-Call Rotation by 2028

Production LLM Observability — Traces, Costs, Latency with OpenTelemetry

Top AIOps Tools for DevOps Engineers in 2026: Datadog AI, Moogsoft, PagerDuty & More

Comments