Kafka Consumer Lag Keeps Growing — How to Diagnose and Fix It
Consumer lag climbing and you don't know why? Here's a step-by-step way to find whether it's slow processing, rebalancing storms, or partition skew — and how to actually fix each cause.
Consumer lag is the gap between the latest offset produced to a topic and the offset your consumer group has actually processed. A small, stable lag is normal. A lag that keeps climbing means your consumers can't keep up — and if you don't catch it, you'll eventually see message loss policies kick in or processing delays severe enough that "real-time" stops meaning anything.
Step 1: Confirm It's Actually Growing, Not Just High
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group my-consumer-groupRun this twice, a minute apart. If LAG is climbing between runs, you have a real problem. If it's high but flat, your consumers are keeping pace with a backlog — different issue, different fix.
Step 2: Check Which Partitions Are Lagging
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group my-consumer-group | sort -k6 -n -rThree patterns to look for:
All partitions lagging evenly → your consumers are simply too slow for the incoming rate. This is a throughput problem.
One or two partitions lagging badly, others fine → partition skew. Your partition key is producing a hot partition, and adding more consumers won't help because Kafka can only assign one consumer per partition within a group.
Lag spikes then drops then spikes again → rebalancing storms, usually from consumers dying and rejoining repeatedly.
Fix 1: Slow Processing (Even Lag Across Partitions)
Check your actual per-message processing time first — don't guess.
import time
def process_message(msg):
start = time.time()
# your processing logic
handle(msg)
duration = time.time() - start
if duration > 0.1: # log anything over 100ms
logger.warning(f"slow message processing: {duration:.3f}s, key={msg.key}")If processing is genuinely slow:
- Increase
max.poll.recordsso each poll does more work per network round trip, but only if your processing can batch (e.g., batch DB writes instead of one write per message) - Scale consumers up to partition count. Adding a consumer beyond the number of partitions does nothing — Kafka caps consumers-per-group at partition count.
- If you've maxed out partition-bound parallelism, increase partition count on the topic (requires care — it changes key-to-partition mapping for existing keys) and scale consumers accordingly.
# Check current partition count
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-topic
# Increase partitions (cannot decrease later)
kafka-topics.sh --bootstrap-server localhost:9092 \
--alter --topic my-topic --partitions 12Fix 2: Hot Partition (Skewed Lag)
This is a key design problem, not a scaling problem. Check your producer's partitioning key:
# Bad — if user_id is unevenly distributed (e.g. one tenant dominates),
# this creates a permanent hot partition
producer.send('events', key=user_id, value=event)
# Better — combine with a high-cardinality secondary field,
# or use round-robin if ordering per-key doesn't matter
producer.send('events', key=f"{user_id}:{event_id[:4]}", value=event)If you can't change the key without breaking downstream ordering guarantees, the practical workaround is giving the consumer handling that partition dedicated resources (run it as a separate deployment with more CPU/memory) rather than scaling the whole group.
Fix 3: Rebalancing Storms
Check consumer group logs for repeated JoinGroup / LeaveGroup events. Common causes:
# If session.timeout.ms is too low relative to actual processing time,
# consumers get kicked out of the group mid-batch, triggering a rebalance
session.timeout.ms=10000 # too aggressive for slow consumers
max.poll.interval.ms=300000 # must exceed worst-case processing time per poll
# Fix: give consumers enough time to actually finish a poll batch
session.timeout.ms=30000
max.poll.interval.ms=600000
heartbeat.interval.ms=10000 # roughly session.timeout.ms / 3Also check for OOM kills or liveness probe failures restarting consumer pods — that looks identical to a rebalancing storm from Kafka's side but the real fix is in Kubernetes, not Kafka config.
kubectl get pods -n kafka-consumers -o wide
kubectl describe pod <pod-name> | grep -A5 "Last State"Monitoring So You Catch This Before It's an Incident
# Prometheus alert — catches sustained lag growth, not just absolute lag
- alert: KafkaConsumerLagGrowing
expr: |
deriv(kafka_consumergroup_lag[10m]) > 0
and kafka_consumergroup_lag > 1000
for: 15m
labels:
severity: warning
annotations:
summary: "Consumer lag for {{ $labels.consumergroup }} has grown for 15m straight"Alerting on the rate of change, not just the absolute lag number, catches the problem while it's still small — before it becomes a 2 AM page.
Set up full Kafka observability: Grafana Loki Log Aggregation Guide
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Datadog Agent Not Sending Metrics — Diagnosis and Fix Guide
Datadog dashboards show no data, hosts appear offline, or custom metrics aren't showing up. Here's how to systematically diagnose and fix Datadog agent issues on Kubernetes and VMs.
Prometheus AlertManager Silence Not Working? Here's Why (and the Fix)
You created a silence in AlertManager but alerts are still firing. Here are the 6 most common reasons silences fail and exactly how to fix each one.
Prometheus High Cardinality Causing OOM — How to Find and Fix It (2026)
Prometheus is crashing with OOMKilled or running out of memory. The culprit is almost always high cardinality metrics — labels with thousands of unique values. Here's how to find which metrics are killing your Prometheus and exactly how to fix it.