Prometheus High Cardinality Causing OOM — How to Find and Fix It (2026)

Prometheus is crashing with OOMKilled or running out of memory. The culprit is almost always high cardinality metrics — labels with thousands of unique values. Here's how to find which metrics are killing your Prometheus and exactly how to fix it.

Prometheus is OOMKilled. Or it's running fine but consuming 20GB of RAM on a cluster with only 10 nodes. Or queries are taking 30 seconds when they used to take 1.

The culprit is almost always high cardinality metrics.

Here's how to diagnose and fix it.

What Is High Cardinality?

Every unique combination of label values in Prometheus creates a new time series. Cardinality = total number of unique time series.

A metric like this is fine:

http_requests_total{method="GET", status="200"}
http_requests_total{method="POST", status="404"}

2 time series. No problem.

A metric like this will destroy your Prometheus:

http_requests_total{user_id="user-a1b2c3", request_id="req-xyz789"}

If you have 100,000 users making 1,000,000 requests — that's 1,000,000+ time series from a single metric.

Symptoms of High Cardinality

Prometheus pod OOMKilled repeatedly
kubectl top pod prometheus-0 showing 10GB+ RAM
Queries timeout or return slowly
TSDB compaction taking forever
/api/v1/status/tsdb shows millions of series

Step 1: Find the High Cardinality Metrics

Option A — Prometheus TSDB status page:

Go to http://your-prometheus:9090/tsdb-status

This shows:

Top 10 metrics by series count — your main offenders
Top 10 label names by series count — which labels are causing explosion
Top 10 label values by series count — specific values creating most series

This is the fastest way to find the problem.

Option B — PromQL queries:

promql

# Top metrics by series count
topk(10, count by (__name__)({__name__=~".+"}))
 
# Total series count
count({__name__=~".+"})
 
# Series per job
count by (job)({__name__=~".+"})

Option C — Cardinality Explorer (Grafana):

In Grafana → Explore → select Prometheus → use the "Metrics browser" to see series count per metric.

Step 2: Identify Which Labels Are Exploding

Once you know the metric (e.g., http_requests_total), find which label has high cardinality:

promql

# Count unique values for each label in a metric
count by (user_id)(http_requests_total)
count by (request_id)(http_requests_total)
count by (url)(http_requests_total)

Common high-cardinality labels to watch for:

user_id, customer_id, account_id
request_id, trace_id, span_id
pod (can be high in large clusters)
url or path (with query parameters in the value)
error_message (free-form text)
job_id, build_id

Step 3: Fix High Cardinality

Fix 1: Drop the Label at Scrape Time

If the label provides no useful aggregation value, drop it:

yaml

# prometheus.yml scrape config
scrape_configs:
  - job_name: myapp
    static_configs:
      - targets: ['myapp:8080']
    metric_relabel_configs:
      - source_labels: [request_id]
        action: labeldrop
        regex: request_id
      - source_labels: [user_id]
        action: labeldrop
        regex: user_id

Fix 2: Replace High-Cardinality Labels with Buckets

Instead of exact user IDs, use bucketed values:

yaml

metric_relabel_configs:
  # Replace exact URL paths with normalized ones
  - source_labels: [url]
    regex: '/api/users/[0-9]+'
    target_label: url
    replacement: '/api/users/:id'

This turns millions of unique /api/users/123456 into one label value /api/users/:id.

Fix 3: Fix the Instrumentation

The real fix is in your application code. Don't use high-cardinality values as label values:

python

# Bad — creates a new series per user
requests_counter.labels(user_id=user.id, endpoint="/api/data").inc()
 
# Good — drop user_id entirely or use a low-cardinality grouping
requests_counter.labels(endpoint="/api/data", tier="premium").inc()

Fix 4: Drop Entire Metrics

If you don't need a metric at all, drop it:

yaml

metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_gc_.*|process_.*'
    action: drop

This is useful for Go runtime metrics (dozens of high-frequency metrics) you don't actually use.

Fix 5: Use Recording Rules to Pre-aggregate

If you need the data but not at full cardinality, pre-aggregate with recording rules:

yaml

groups:
  - name: aggregations
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum by (job, status)(rate(http_requests_total[5m]))

Then query the recording rule instead of the raw metric. Much fewer series.

Step 4: Right-Size Prometheus Memory

Once cardinality is under control, set appropriate resource limits:

yaml

# prometheus values.yaml (kube-prometheus-stack)
prometheus:
  prometheusSpec:
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
    retention: 15d
    retentionSize: 20GB
    # Limit series per scrape target
    enforcedSampleLimit: 100000

enforcedSampleLimit — Prometheus will reject scrape results that return more than N samples. Protects against runaway metrics.

Step 5: Prevent Future Cardinality Issues

Add cardinality alerts:

yaml

groups:
  - name: cardinality
    rules:
      - alert: PrometheusHighCardinality
        expr: count({__name__=~".+"}) > 1000000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus has > 1M series — investigate high cardinality"
 
      - alert: MetricHighSeriesCount
        expr: topk(1, count by (__name__)({__name__=~".+"})) > 50000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "A single metric has > 50K series"

Use Prometheus limits in scrape configs:

yaml

scrape_configs:
  - job_name: myapp
    sample_limit: 10000  # Reject scrape if > 10K samples returned

Summary

Problem	Fix
Label with user/request IDs	Drop label with `metric_relabel_configs`
URL paths with IDs	Normalize with regex relabeling
Metrics you don't need	Drop metric entirely
Need data but less granular	Recording rules to pre-aggregate
Runaway scrape target	`sample_limit` to cap

High cardinality is the #1 reason Prometheus runs out of memory. Fix the labels, not the RAM limit.

Useful Tools

Grafana Mimirtool — Analyze cardinality, find unused metrics
Prometheus Cardinality Explorer — Built-in TSDB stats
pint — PromQL linter that catches cardinality issues in rules

Affiliate note: Need a managed Prometheus that handles cardinality automatically? Grafana Cloud and Datadog both offer intelligent metric aggregation at scale — free tiers available.

Prometheus High Cardinality Causing OOM — How to Find and Fix It (2026)

What Is High Cardinality?

Symptoms of High Cardinality

Step 1: Find the High Cardinality Metrics

Step 2: Identify Which Labels Are Exploding

Step 3: Fix High Cardinality

Fix 1: Drop the Label at Scrape Time

Fix 2: Replace High-Cardinality Labels with Buckets

Fix 3: Fix the Instrumentation

Fix 4: Drop Entire Metrics

Fix 5: Use Recording Rules to Pre-aggregate

Step 4: Right-Size Prometheus Memory

Step 5: Prevent Future Cardinality Issues

Summary

Useful Tools

Stay ahead of the curve

Related Articles

Build a Complete Kubernetes Monitoring Stack from Scratch (2026)

Grafana Dashboard Panels Not Loading or Showing No Data Fix

Prometheus Scrape Target Down — Fix

Comments