🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Kafka Consumer Lag Keeps Growing — How to Diagnose and Fix It

Consumer lag climbing and you don't know why? Here's a step-by-step way to find whether it's slow processing, rebalancing storms, or partition skew — and how to actually fix each cause.

DevOpsBoysJun 15, 20264 min read
Share:Tweet

Consumer lag is the gap between the latest offset produced to a topic and the offset your consumer group has actually processed. A small, stable lag is normal. A lag that keeps climbing means your consumers can't keep up — and if you don't catch it, you'll eventually see message loss policies kick in or processing delays severe enough that "real-time" stops meaning anything.

Step 1: Confirm It's Actually Growing, Not Just High

bash
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-consumer-group

Run this twice, a minute apart. If LAG is climbing between runs, you have a real problem. If it's high but flat, your consumers are keeping pace with a backlog — different issue, different fix.

Step 2: Check Which Partitions Are Lagging

bash
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-consumer-group | sort -k6 -n -r

Three patterns to look for:

All partitions lagging evenly → your consumers are simply too slow for the incoming rate. This is a throughput problem.

One or two partitions lagging badly, others fine → partition skew. Your partition key is producing a hot partition, and adding more consumers won't help because Kafka can only assign one consumer per partition within a group.

Lag spikes then drops then spikes again → rebalancing storms, usually from consumers dying and rejoining repeatedly.

Fix 1: Slow Processing (Even Lag Across Partitions)

Check your actual per-message processing time first — don't guess.

python
import time
 
def process_message(msg):
    start = time.time()
    # your processing logic
    handle(msg)
    duration = time.time() - start
    if duration > 0.1:  # log anything over 100ms
        logger.warning(f"slow message processing: {duration:.3f}s, key={msg.key}")

If processing is genuinely slow:

  • Increase max.poll.records so each poll does more work per network round trip, but only if your processing can batch (e.g., batch DB writes instead of one write per message)
  • Scale consumers up to partition count. Adding a consumer beyond the number of partitions does nothing — Kafka caps consumers-per-group at partition count.
  • If you've maxed out partition-bound parallelism, increase partition count on the topic (requires care — it changes key-to-partition mapping for existing keys) and scale consumers accordingly.
bash
# Check current partition count
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-topic
 
# Increase partitions (cannot decrease later)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --alter --topic my-topic --partitions 12

Fix 2: Hot Partition (Skewed Lag)

This is a key design problem, not a scaling problem. Check your producer's partitioning key:

python
# Bad — if user_id is unevenly distributed (e.g. one tenant dominates),
# this creates a permanent hot partition
producer.send('events', key=user_id, value=event)
 
# Better — combine with a high-cardinality secondary field,
# or use round-robin if ordering per-key doesn't matter
producer.send('events', key=f"{user_id}:{event_id[:4]}", value=event)

If you can't change the key without breaking downstream ordering guarantees, the practical workaround is giving the consumer handling that partition dedicated resources (run it as a separate deployment with more CPU/memory) rather than scaling the whole group.

Fix 3: Rebalancing Storms

Check consumer group logs for repeated JoinGroup / LeaveGroup events. Common causes:

properties
# If session.timeout.ms is too low relative to actual processing time,
# consumers get kicked out of the group mid-batch, triggering a rebalance
session.timeout.ms=10000          # too aggressive for slow consumers
max.poll.interval.ms=300000       # must exceed worst-case processing time per poll
 
# Fix: give consumers enough time to actually finish a poll batch
session.timeout.ms=30000
max.poll.interval.ms=600000
heartbeat.interval.ms=10000       # roughly session.timeout.ms / 3

Also check for OOM kills or liveness probe failures restarting consumer pods — that looks identical to a rebalancing storm from Kafka's side but the real fix is in Kubernetes, not Kafka config.

bash
kubectl get pods -n kafka-consumers -o wide
kubectl describe pod <pod-name> | grep -A5 "Last State"

Monitoring So You Catch This Before It's an Incident

yaml
# Prometheus alert — catches sustained lag growth, not just absolute lag
- alert: KafkaConsumerLagGrowing
  expr: |
    deriv(kafka_consumergroup_lag[10m]) > 0
    and kafka_consumergroup_lag > 1000
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Consumer lag for {{ $labels.consumergroup }} has grown for 15m straight"

Alerting on the rate of change, not just the absolute lag number, catches the problem while it's still small — before it becomes a 2 AM page.

Set up full Kafka observability: Grafana Loki Log Aggregation Guide

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments