How to Set Up AIOps-Powered Alerting with Grafana Machine Learning in 2026
Step-by-step guide to setting up Grafana's machine learning features for anomaly detection, predictive alerting, and intelligent noise reduction. Stop alert fatigue with AI.
Static thresholds are broken. Setting an alert for "CPU > 80%" sounds reasonable until your batch job legitimately uses 95% CPU every night at 2 AM and pages your on-call engineer for no reason.
The fix isn't better thresholds — it's machine learning that learns what "normal" looks like for each metric and alerts only on real anomalies.
Grafana Cloud now has ML-powered features built in. This guide walks you through setting up anomaly detection, predictive alerting, and intelligent alert grouping from scratch.
What Grafana ML Can Do
Grafana's machine learning features include:
- Anomaly Detection — automatically learns normal patterns and flags deviations
- Metric Forecasting — predicts future values (disk full in 3 days, etc.)
- Adaptive Alerting — dynamic thresholds that adjust to your data patterns
- Sift — AI-powered investigation that finds related changes when anomalies occur
- Alert Correlation — groups related alerts into single incidents
Prerequisites
- Grafana Cloud account (ML features require Pro or Enterprise plan)
- Prometheus or Grafana Mimir as data source
- At least 2 weeks of historical metric data (ML needs training data)
- Grafana Cloud Alerting enabled
If you're self-hosting Grafana, ML features require the Grafana ML plugin (paid).
Step 1 — Enable Machine Learning in Grafana Cloud
Navigate to Grafana Cloud → Machine Learning in your Grafana instance.
If you don't see the ML section:
- Go to Administration → Plugins
- Search for "Machine Learning"
- Install and enable the Grafana ML plugin
- Restart Grafana (or reload if using Grafana Cloud)
Step 2 — Create Your First Anomaly Detection Job
Anomaly detection learns the normal pattern of a metric and creates an anomaly score.
Via the UI
- Go to Machine Learning → Anomaly Detection → New Job
- Configure the job:
- Name:
api-latency-anomaly - Data source: Your Prometheus instance
- Query:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m])) - Training window: 14 days (minimum 7 days)
- Detection sensitivity: Medium (start here, tune later)
- Name:
- Click Create and wait for the initial training (5-15 minutes)
Via the API
curl -X POST https://your-grafana.grafana.net/api/plugins/grafana-ml-app/resources/api/v1/anomaly-detection/jobs \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "api-latency-anomaly",
"datasourceUid": "prometheus-uid",
"query": {
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"api\"}[5m]))",
"refId": "A"
},
"trainingWindow": 1209600,
"interval": 300
}'Step 3 — Understand the Anomaly Score
The ML model outputs an anomaly score between 0 and 1:
- 0.0 - 0.3: Normal behavior
- 0.3 - 0.7: Unusual but possibly expected (deploy spike, etc.)
- 0.7 - 1.0: Significant anomaly — likely a real issue
You can query the anomaly score like any Prometheus metric:
# Anomaly score for the job
grafana_ml_anomaly_score{job_name="api-latency-anomaly"}
# Predicted (expected) value
grafana_ml_anomaly_predicted{job_name="api-latency-anomaly"}
# Actual value
grafana_ml_anomaly_actual{job_name="api-latency-anomaly"}Step 4 — Create ML-Powered Alerts
Now create alerts based on anomaly scores instead of static thresholds:
Create an Alert Rule
- Go to Alerting → Alert Rules → New Alert Rule
- Set the query:
grafana_ml_anomaly_score{job_name="api-latency-anomaly"} > 0.8 - Configure evaluation:
- Evaluate every: 1 minute
- For: 5 minutes (avoid alerting on brief spikes)
- Set labels and notifications:
- Severity: warning (for 0.7-0.85), critical (for 0.85+)
- Notification channel: Slack, PagerDuty, etc.
Multi-Tier Anomaly Alerting
# Grafana alerting rule (provisioning file)
apiVersion: 1
groups:
- orgId: 1
name: ML Anomaly Alerts
folder: Infrastructure
interval: 1m
rules:
- uid: api-anomaly-warning
title: "API Latency Anomaly Detected"
condition: B
data:
- refId: A
datasourceUid: prometheus
model:
expr: "grafana_ml_anomaly_score{job_name='api-latency-anomaly'}"
- refId: B
datasourceUid: __expr__
model:
type: threshold
conditions:
- evaluator:
type: gt
params: [0.75]
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual API latency pattern detected"
description: "Anomaly score: {{ $values.A }}. This doesn't match the learned normal pattern."
- uid: api-anomaly-critical
title: "API Latency Critical Anomaly"
condition: B
data:
- refId: A
datasourceUid: prometheus
model:
expr: "grafana_ml_anomaly_score{job_name='api-latency-anomaly'}"
- refId: B
datasourceUid: __expr__
model:
type: threshold
conditions:
- evaluator:
type: gt
params: [0.9]
for: 3m
labels:
severity: critical
annotations:
summary: "Critical API latency anomaly"
description: "Anomaly score: {{ $values.A }}. Immediate investigation required."Step 5 — Set Up Metric Forecasting
Predict future values to catch issues before they happen.
Disk Space Prediction
curl -X POST https://your-grafana.grafana.net/api/plugins/grafana-ml-app/resources/api/v1/forecast/jobs \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "disk-usage-forecast",
"datasourceUid": "prometheus-uid",
"query": {
"expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"} * 100)",
"refId": "A"
},
"trainingWindow": 2592000,
"forecastHorizon": 604800
}'This creates a forecast 7 days into the future based on 30 days of training data.
Predictive Alert
# Alert if disk will be full within 3 days
grafana_ml_forecast_predicted{job_name="disk-usage-forecast", forecast_horizon="3d"} > 90This fires 3 days before the disk fills up, giving you plenty of time to clean up or expand storage.
Step 6 — Build an ML-Powered Dashboard
Create a dashboard that shows anomaly detection in context:
Panel 1 — Actual vs Predicted (Time Series)
# Actual value
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))
# ML predicted value (normal band)
grafana_ml_anomaly_predicted{job_name="api-latency-anomaly"}Display both on the same graph. The gap between actual and predicted is the anomaly.
Panel 2 — Anomaly Score (Gauge)
grafana_ml_anomaly_score{job_name="api-latency-anomaly"}Set thresholds: green (0-0.3), yellow (0.3-0.7), red (0.7-1.0).
Panel 3 — Forecast (Time Series)
# Current disk usage
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)
# Forecasted disk usage
grafana_ml_forecast_predicted{job_name="disk-usage-forecast"}Show a dashed line extending into the future. Add a horizontal threshold at 90% to visualize when it'll hit critical.
Step 7 — Tune Sensitivity
After running for a week, tune your anomaly detection:
Too Many Alerts (False Positives)
- Increase detection threshold from 0.75 to 0.85
- Increase the
forduration from 5m to 10m - Exclude known patterns (scheduled jobs, deployments) using recording rules
Missing Real Issues (False Negatives)
- Lower detection threshold to 0.6
- Decrease the
forduration to 3m - Create separate anomaly jobs for different time periods (business hours vs off-hours)
Handling Deployments
Deployments cause expected metric changes. Suppress anomaly alerts during deploy windows:
# Only alert when there's no active deployment
grafana_ml_anomaly_score{job_name="api-latency-anomaly"} > 0.8
unless on() (deploy_in_progress == 1)Common Anomaly Detection Jobs to Create
| Metric | Why | Sensitivity |
|---|---|---|
| API P95 latency | Catch performance regressions | Medium |
| Error rate | Detect outages before users notice | High |
| Request rate | Spot traffic anomalies (DDoS, bot) | Low |
| Memory usage | Catch leaks early | Medium |
| Disk I/O | Predict disk issues | Medium |
| Pod restart rate | Catch crashloop patterns | High |
| Queue depth | Detect processing bottlenecks | Medium |
Wrapping Up
ML-powered alerting isn't about replacing your existing monitoring — it's about making it smarter. Instead of guessing thresholds and waking up at 2 AM for false alarms, let the ML model learn what normal looks like and only alert when something genuinely unusual happens.
Start with 2-3 critical metrics, let the model train for a week, then gradually expand. The first time it catches a real anomaly that a static threshold would have missed, you'll never go back.
Want to master Prometheus, Grafana, and production monitoring with hands-on labs? The KodeKloud observability course covers monitoring, alerting, and the complete observability stack. For hosting your monitoring infrastructure, DigitalOcean offers reliable infrastructure with built-in monitoring.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Why Agentic AI Will Kill the Traditional On-Call Rotation by 2028
60% of enterprises now use AIOps self-healing. 83% of alerts auto-resolve without humans. The era of 2 AM PagerDuty wake-ups is ending. Here's what replaces it.
Top AIOps Tools for DevOps Engineers in 2026: Datadog AI, Moogsoft, PagerDuty & More
The definitive comparison of AIOps tools in 2026. Datadog AI, Moogsoft, PagerDuty AIOps, BigPanda, and more — features, pricing, and which one fits your team.
Agentic SRE Will Replace Traditional Incident Response by 2028
AI agents are moving beyond alerting into autonomous incident detection, root cause analysis, and remediation. Here's why Agentic SRE will fundamentally change how we handle production incidents.