🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Prometheus Alerts Not Firing: Every Cause and Fix

Your Prometheus alert should have fired 30 minutes ago but nothing happened. Here's every reason alerts silently fail — routing, inhibition, receivers, and rule syntax.

DevOpsBoysMay 11, 20265 min read
Share:Tweet

You wrote an alert rule. The condition is clearly met. But no notification came. No PagerDuty, no Slack, nothing.

This is one of the most frustrating problems in observability — silent alert failures. Here's every cause and how to diagnose each one.


The Alert Pipeline

Before debugging, understand the full path an alert takes:

Prometheus Rule → PENDING → FIRING
                               ↓
                        Alertmanager
                               ↓
                    Route matching → Receiver
                               ↓
                    Slack / PagerDuty / Email

Failure can happen at any stage. Work through them in order.


Step 1 — Is the Alert Even Firing in Prometheus?

Go to your Prometheus UI → Alerts tab.

Three states:

  • Inactive — condition not met (check your PromQL)
  • Pending — condition met but waiting for for duration
  • Firing — alert is firing, Alertmanager should have received it
bash
# Port-forward Prometheus UI locally
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring

If Inactive: Your PromQL expression is wrong. Test it in the Graph tab:

promql
# Example alert rule
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m

Test the expression rate(http_requests_total{status=~"5.."}[5m]) > 0.05 directly in the Graph tab. If it returns no results, the condition isn't met or the metric doesn't exist.

bash
# Check if metric exists
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep http_requests

Step 2 — Alert Is Pending But Never Fires

If an alert is stuck in Pending, the for duration hasn't elapsed yet. This is normal — it's the grace period to avoid flapping alerts.

Problem: for: 1h — your alert fires after 1 hour of continuous condition. If the condition recovers and re-triggers, the clock resets.

Fix: Reduce the for duration for critical alerts:

yaml
- alert: PodCrashLoopBackOff
  expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
  for: 5m      # Not 1h
  labels:
    severity: critical

Step 3 — Alert Is Firing But Alertmanager Didn't Receive It

Check Alertmanager is receiving alerts from Prometheus:

bash
# Check Alertmanager targets in Prometheus
# Go to: Prometheus UI → Status → Targets
# Look for alertmanager target — it should be UP
 
# Or check via API
curl http://localhost:9090/api/v1/alertmanagers

Common issue: Alertmanager URL misconfigured in Prometheus config:

yaml
# prometheus.yaml
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093   # Must match the service name and port
bash
# Verify alertmanager service
kubectl get svc -n monitoring | grep alertmanager

Step 4 — Alertmanager Received Alert But No Notification

Check the Alertmanager UI:

bash
kubectl port-forward svc/alertmanager-operated 9093:9093 -n monitoring

Go to http://localhost:9093. Under Alerts, you should see incoming alerts.

Cause A: Route Not Matching

Your alert labels don't match any route's match conditions.

yaml
# alertmanager.yaml
route:
  receiver: 'null'         # Default = drop everything
  routes:
    - match:
        severity: critical   # Only routes alerts with severity=critical
      receiver: pagerduty

If your alert doesn't have severity: critical, it goes to the null receiver (dropped).

Fix: Check your alert labels match the route:

yaml
# In your Prometheus rule:
labels:
  severity: critical        # Must match route's match condition
 
# In alertmanager route:
routes:
  - match:
      severity: critical
    receiver: pagerduty

Cause B: Alert Is Silenced

Check for active silences in the Alertmanager UI under Silences. Someone may have silenced the alert (common after an incident).

bash
# List silences via API
curl http://localhost:9093/api/v2/silences | jq '.[] | select(.status.state == "active")'
 
# Delete all active silences (use with caution)
curl -X DELETE http://localhost:9093/api/v2/silence/<silence-id>

Cause C: Alert Is Inhibited

Inhibition rules suppress alerts when a higher-severity alert is firing.

yaml
# alertmanager.yaml
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning     # Suppresses all warnings when a critical alert fires
    equal: ['alertname', 'cluster']

If a critical alert is firing for the same cluster, all warning alerts are suppressed.

Fix: Check if any inhibition rules are matching:

bash
# In Alertmanager UI → Alerts tab
# Inhibited alerts show with a yellow indicator

Cause D: Receiver Misconfigured

Your route matches but the receiver has a bad config (wrong webhook URL, wrong Slack channel).

bash
# Check Alertmanager logs for errors
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager | grep -i error
 
# Common errors:
# "connection refused" → wrong webhook URL
# "invalid_auth" → wrong Slack token
# "channel not found" → wrong Slack channel name

Fix — Test your Slack webhook manually:

bash
curl -X POST -H 'Content-type: application/json' \
  --data '{"text":"Test alert from Alertmanager"}' \
  https://hooks.slack.com/services/YOUR/WEBHOOK/URL

Step 5 — Group Wait / Group Interval Delaying Alerts

Alertmanager batches alerts before sending:

yaml
route:
  group_wait: 30s        # Wait 30s before sending first notification
  group_interval: 5m     # Wait 5m between notifications for same group
  repeat_interval: 4h    # Repeat notification every 4h if still firing

If you just fired an alert, group_wait: 30s means you wait 30 seconds. If another alert fires in those 30 seconds, they're batched together.

For critical alerts, reduce group_wait:

yaml
routes:
  - match:
      severity: critical
    receiver: pagerduty
    group_wait: 0s        # Send immediately
    group_interval: 1m

Full Diagnostic Checklist

bash
# 1. Alert visible in Prometheus UI → Alerts tab?
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring
 
# 2. Alertmanager receiving alerts?
curl http://localhost:9090/api/v1/alertmanagers
 
# 3. Alert visible in Alertmanager UI?
kubectl port-forward svc/alertmanager-operated 9093:9093 -n monitoring
 
# 4. Any active silences?
curl http://localhost:9093/api/v2/silences
 
# 5. Alertmanager logs showing errors?
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50
 
# 6. Receiver config valid? (send test notification)
curl -H "Content-Type: application/json" -d \
  '[{"labels":{"alertname":"TestAlert","severity":"critical"}}]' \
  http://localhost:9093/api/v1/alerts

Prevention: Test Your Alert Pipeline

Never trust alert configs without testing them. Add a always-firing test alert:

yaml
- alert: AlertmanagerPipelineTest
  expr: vector(1)      # Always true
  for: 1m
  labels:
    severity: info
  annotations:
    summary: "Test alert — Alertmanager pipeline is working"

If you get this notification, your entire pipeline is healthy.

For hands-on Prometheus and Alertmanager labs, KodeKloud has dedicated monitoring courses with real cluster exercises.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments