Prometheus High Cardinality Causing OOM — How to Find and Fix It (2026)
Prometheus is crashing with OOMKilled or running out of memory. The culprit is almost always high cardinality metrics — labels with thousands of unique values. Here's how to find which metrics are killing your Prometheus and exactly how to fix it.
Prometheus is OOMKilled. Or it's running fine but consuming 20GB of RAM on a cluster with only 10 nodes. Or queries are taking 30 seconds when they used to take 1.
The culprit is almost always high cardinality metrics.
Here's how to diagnose and fix it.
What Is High Cardinality?
Every unique combination of label values in Prometheus creates a new time series. Cardinality = total number of unique time series.
A metric like this is fine:
http_requests_total{method="GET", status="200"}
http_requests_total{method="POST", status="404"}
2 time series. No problem.
A metric like this will destroy your Prometheus:
http_requests_total{user_id="user-a1b2c3", request_id="req-xyz789"}
If you have 100,000 users making 1,000,000 requests — that's 1,000,000+ time series from a single metric.
Symptoms of High Cardinality
- Prometheus pod OOMKilled repeatedly
kubectl top pod prometheus-0showing 10GB+ RAM- Queries timeout or return slowly
- TSDB compaction taking forever
/api/v1/status/tsdbshows millions of series
Step 1: Find the High Cardinality Metrics
Option A — Prometheus TSDB status page:
Go to http://your-prometheus:9090/tsdb-status
This shows:
- Top 10 metrics by series count — your main offenders
- Top 10 label names by series count — which labels are causing explosion
- Top 10 label values by series count — specific values creating most series
This is the fastest way to find the problem.
Option B — PromQL queries:
# Top metrics by series count
topk(10, count by (__name__)({__name__=~".+"}))
# Total series count
count({__name__=~".+"})
# Series per job
count by (job)({__name__=~".+"})Option C — Cardinality Explorer (Grafana):
In Grafana → Explore → select Prometheus → use the "Metrics browser" to see series count per metric.
Step 2: Identify Which Labels Are Exploding
Once you know the metric (e.g., http_requests_total), find which label has high cardinality:
# Count unique values for each label in a metric
count by (user_id)(http_requests_total)
count by (request_id)(http_requests_total)
count by (url)(http_requests_total)Common high-cardinality labels to watch for:
user_id,customer_id,account_idrequest_id,trace_id,span_idpod(can be high in large clusters)urlorpath(with query parameters in the value)error_message(free-form text)job_id,build_id
Step 3: Fix High Cardinality
Fix 1: Drop the Label at Scrape Time
If the label provides no useful aggregation value, drop it:
# prometheus.yml scrape config
scrape_configs:
- job_name: myapp
static_configs:
- targets: ['myapp:8080']
metric_relabel_configs:
- source_labels: [request_id]
action: labeldrop
regex: request_id
- source_labels: [user_id]
action: labeldrop
regex: user_idFix 2: Replace High-Cardinality Labels with Buckets
Instead of exact user IDs, use bucketed values:
metric_relabel_configs:
# Replace exact URL paths with normalized ones
- source_labels: [url]
regex: '/api/users/[0-9]+'
target_label: url
replacement: '/api/users/:id'This turns millions of unique /api/users/123456 into one label value /api/users/:id.
Fix 3: Fix the Instrumentation
The real fix is in your application code. Don't use high-cardinality values as label values:
# Bad — creates a new series per user
requests_counter.labels(user_id=user.id, endpoint="/api/data").inc()
# Good — drop user_id entirely or use a low-cardinality grouping
requests_counter.labels(endpoint="/api/data", tier="premium").inc()Fix 4: Drop Entire Metrics
If you don't need a metric at all, drop it:
metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_gc_.*|process_.*'
action: dropThis is useful for Go runtime metrics (dozens of high-frequency metrics) you don't actually use.
Fix 5: Use Recording Rules to Pre-aggregate
If you need the data but not at full cardinality, pre-aggregate with recording rules:
groups:
- name: aggregations
rules:
- record: job:http_requests_total:rate5m
expr: sum by (job, status)(rate(http_requests_total[5m]))Then query the recording rule instead of the raw metric. Much fewer series.
Step 4: Right-Size Prometheus Memory
Once cardinality is under control, set appropriate resource limits:
# prometheus values.yaml (kube-prometheus-stack)
prometheus:
prometheusSpec:
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
retention: 15d
retentionSize: 20GB
# Limit series per scrape target
enforcedSampleLimit: 100000enforcedSampleLimit — Prometheus will reject scrape results that return more than N samples. Protects against runaway metrics.
Step 5: Prevent Future Cardinality Issues
Add cardinality alerts:
groups:
- name: cardinality
rules:
- alert: PrometheusHighCardinality
expr: count({__name__=~".+"}) > 1000000
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus has > 1M series — investigate high cardinality"
- alert: MetricHighSeriesCount
expr: topk(1, count by (__name__)({__name__=~".+"})) > 50000
for: 10m
labels:
severity: warning
annotations:
summary: "A single metric has > 50K series"Use Prometheus limits in scrape configs:
scrape_configs:
- job_name: myapp
sample_limit: 10000 # Reject scrape if > 10K samples returnedSummary
| Problem | Fix |
|---|---|
| Label with user/request IDs | Drop label with metric_relabel_configs |
| URL paths with IDs | Normalize with regex relabeling |
| Metrics you don't need | Drop metric entirely |
| Need data but less granular | Recording rules to pre-aggregate |
| Runaway scrape target | sample_limit to cap |
High cardinality is the #1 reason Prometheus runs out of memory. Fix the labels, not the RAM limit.
Useful Tools
- Grafana Mimirtool — Analyze cardinality, find unused metrics
- Prometheus Cardinality Explorer — Built-in TSDB stats
- pint — PromQL linter that catches cardinality issues in rules
Affiliate note: Need a managed Prometheus that handles cardinality automatically? Grafana Cloud and Datadog both offer intelligent metric aggregation at scale — free tiers available.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build a Complete Kubernetes Monitoring Stack from Scratch (2026)
Step-by-step project walkthrough: set up Prometheus, Grafana, Loki, and AlertManager on Kubernetes using Helm. Real configs, real dashboards, production-ready.
Prometheus Targets Showing 'Down' — Every Cause and Fix (2026)
Your Prometheus /targets page shows red. Services are running but Prometheus can't scrape them. Here's every reason this happens — wrong port, NetworkPolicy blocks, ServiceMonitor label mismatch, auth — and exactly how to fix each one.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.