🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Prometheus High Cardinality Causing OOM — How to Find and Fix It (2026)

Prometheus is crashing with OOMKilled or running out of memory. The culprit is almost always high cardinality metrics — labels with thousands of unique values. Here's how to find which metrics are killing your Prometheus and exactly how to fix it.

DevOpsBoysMay 19, 20264 min read
Share:Tweet

Prometheus is OOMKilled. Or it's running fine but consuming 20GB of RAM on a cluster with only 10 nodes. Or queries are taking 30 seconds when they used to take 1.

The culprit is almost always high cardinality metrics.

Here's how to diagnose and fix it.


What Is High Cardinality?

Every unique combination of label values in Prometheus creates a new time series. Cardinality = total number of unique time series.

A metric like this is fine:

http_requests_total{method="GET", status="200"}
http_requests_total{method="POST", status="404"}

2 time series. No problem.

A metric like this will destroy your Prometheus:

http_requests_total{user_id="user-a1b2c3", request_id="req-xyz789"}

If you have 100,000 users making 1,000,000 requests — that's 1,000,000+ time series from a single metric.


Symptoms of High Cardinality

  • Prometheus pod OOMKilled repeatedly
  • kubectl top pod prometheus-0 showing 10GB+ RAM
  • Queries timeout or return slowly
  • TSDB compaction taking forever
  • /api/v1/status/tsdb shows millions of series

Step 1: Find the High Cardinality Metrics

Option A — Prometheus TSDB status page:

Go to http://your-prometheus:9090/tsdb-status

This shows:

  • Top 10 metrics by series count — your main offenders
  • Top 10 label names by series count — which labels are causing explosion
  • Top 10 label values by series count — specific values creating most series

This is the fastest way to find the problem.

Option B — PromQL queries:

promql
# Top metrics by series count
topk(10, count by (__name__)({__name__=~".+"}))
 
# Total series count
count({__name__=~".+"})
 
# Series per job
count by (job)({__name__=~".+"})

Option C — Cardinality Explorer (Grafana):

In Grafana → Explore → select Prometheus → use the "Metrics browser" to see series count per metric.


Step 2: Identify Which Labels Are Exploding

Once you know the metric (e.g., http_requests_total), find which label has high cardinality:

promql
# Count unique values for each label in a metric
count by (user_id)(http_requests_total)
count by (request_id)(http_requests_total)
count by (url)(http_requests_total)

Common high-cardinality labels to watch for:

  • user_id, customer_id, account_id
  • request_id, trace_id, span_id
  • pod (can be high in large clusters)
  • url or path (with query parameters in the value)
  • error_message (free-form text)
  • job_id, build_id

Step 3: Fix High Cardinality

Fix 1: Drop the Label at Scrape Time

If the label provides no useful aggregation value, drop it:

yaml
# prometheus.yml scrape config
scrape_configs:
  - job_name: myapp
    static_configs:
      - targets: ['myapp:8080']
    metric_relabel_configs:
      - source_labels: [request_id]
        action: labeldrop
        regex: request_id
      - source_labels: [user_id]
        action: labeldrop
        regex: user_id

Fix 2: Replace High-Cardinality Labels with Buckets

Instead of exact user IDs, use bucketed values:

yaml
metric_relabel_configs:
  # Replace exact URL paths with normalized ones
  - source_labels: [url]
    regex: '/api/users/[0-9]+'
    target_label: url
    replacement: '/api/users/:id'

This turns millions of unique /api/users/123456 into one label value /api/users/:id.

Fix 3: Fix the Instrumentation

The real fix is in your application code. Don't use high-cardinality values as label values:

python
# Bad — creates a new series per user
requests_counter.labels(user_id=user.id, endpoint="/api/data").inc()
 
# Good — drop user_id entirely or use a low-cardinality grouping
requests_counter.labels(endpoint="/api/data", tier="premium").inc()

Fix 4: Drop Entire Metrics

If you don't need a metric at all, drop it:

yaml
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_gc_.*|process_.*'
    action: drop

This is useful for Go runtime metrics (dozens of high-frequency metrics) you don't actually use.

Fix 5: Use Recording Rules to Pre-aggregate

If you need the data but not at full cardinality, pre-aggregate with recording rules:

yaml
groups:
  - name: aggregations
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum by (job, status)(rate(http_requests_total[5m]))

Then query the recording rule instead of the raw metric. Much fewer series.


Step 4: Right-Size Prometheus Memory

Once cardinality is under control, set appropriate resource limits:

yaml
# prometheus values.yaml (kube-prometheus-stack)
prometheus:
  prometheusSpec:
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
    retention: 15d
    retentionSize: 20GB
    # Limit series per scrape target
    enforcedSampleLimit: 100000

enforcedSampleLimit — Prometheus will reject scrape results that return more than N samples. Protects against runaway metrics.


Step 5: Prevent Future Cardinality Issues

Add cardinality alerts:

yaml
groups:
  - name: cardinality
    rules:
      - alert: PrometheusHighCardinality
        expr: count({__name__=~".+"}) > 1000000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus has > 1M series — investigate high cardinality"
 
      - alert: MetricHighSeriesCount
        expr: topk(1, count by (__name__)({__name__=~".+"})) > 50000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "A single metric has > 50K series"

Use Prometheus limits in scrape configs:

yaml
scrape_configs:
  - job_name: myapp
    sample_limit: 10000  # Reject scrape if > 10K samples returned

Summary

ProblemFix
Label with user/request IDsDrop label with metric_relabel_configs
URL paths with IDsNormalize with regex relabeling
Metrics you don't needDrop metric entirely
Need data but less granularRecording rules to pre-aggregate
Runaway scrape targetsample_limit to cap

High cardinality is the #1 reason Prometheus runs out of memory. Fix the labels, not the RAM limit.


Useful Tools

Affiliate note: Need a managed Prometheus that handles cardinality automatically? Grafana Cloud and Datadog both offer intelligent metric aggregation at scale — free tiers available.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments