Kafka Consumer Lag Keeps Growing — How to Diagnose and Fix It

Consumer lag climbing and you don't know why? Here's a step-by-step way to find whether it's slow processing, rebalancing storms, or partition skew — and how to actually fix each cause.

Consumer lag is the gap between the latest offset produced to a topic and the offset your consumer group has actually processed. A small, stable lag is normal. A lag that keeps climbing means your consumers can't keep up — and if you don't catch it, you'll eventually see message loss policies kick in or processing delays severe enough that "real-time" stops meaning anything.

Step 1: Confirm It's Actually Growing, Not Just High

bash

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-consumer-group

Run this twice, a minute apart. If LAG is climbing between runs, you have a real problem. If it's high but flat, your consumers are keeping pace with a backlog — different issue, different fix.

Step 2: Check Which Partitions Are Lagging

bash

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-consumer-group | sort -k6 -n -r

Three patterns to look for:

All partitions lagging evenly → your consumers are simply too slow for the incoming rate. This is a throughput problem.

One or two partitions lagging badly, others fine → partition skew. Your partition key is producing a hot partition, and adding more consumers won't help because Kafka can only assign one consumer per partition within a group.

Lag spikes then drops then spikes again → rebalancing storms, usually from consumers dying and rejoining repeatedly.

Fix 1: Slow Processing (Even Lag Across Partitions)

Check your actual per-message processing time first — don't guess.

python

import time
 
def process_message(msg):
    start = time.time()
    # your processing logic
    handle(msg)
    duration = time.time() - start
    if duration > 0.1:  # log anything over 100ms
        logger.warning(f"slow message processing: {duration:.3f}s, key={msg.key}")

If processing is genuinely slow:

Increase max.poll.records so each poll does more work per network round trip, but only if your processing can batch (e.g., batch DB writes instead of one write per message)
Scale consumers up to partition count. Adding a consumer beyond the number of partitions does nothing — Kafka caps consumers-per-group at partition count.
If you've maxed out partition-bound parallelism, increase partition count on the topic (requires care — it changes key-to-partition mapping for existing keys) and scale consumers accordingly.

bash

# Check current partition count
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-topic
 
# Increase partitions (cannot decrease later)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --alter --topic my-topic --partitions 12

Fix 2: Hot Partition (Skewed Lag)

This is a key design problem, not a scaling problem. Check your producer's partitioning key:

python

# Bad — if user_id is unevenly distributed (e.g. one tenant dominates),
# this creates a permanent hot partition
producer.send('events', key=user_id, value=event)
 
# Better — combine with a high-cardinality secondary field,
# or use round-robin if ordering per-key doesn't matter
producer.send('events', key=f"{user_id}:{event_id[:4]}", value=event)

If you can't change the key without breaking downstream ordering guarantees, the practical workaround is giving the consumer handling that partition dedicated resources (run it as a separate deployment with more CPU/memory) rather than scaling the whole group.

Fix 3: Rebalancing Storms

Check consumer group logs for repeated JoinGroup / LeaveGroup events. Common causes:

properties

# If session.timeout.ms is too low relative to actual processing time,
# consumers get kicked out of the group mid-batch, triggering a rebalance
session.timeout.ms=10000          # too aggressive for slow consumers
max.poll.interval.ms=300000       # must exceed worst-case processing time per poll
 
# Fix: give consumers enough time to actually finish a poll batch
session.timeout.ms=30000
max.poll.interval.ms=600000
heartbeat.interval.ms=10000       # roughly session.timeout.ms / 3

Also check for OOM kills or liveness probe failures restarting consumer pods — that looks identical to a rebalancing storm from Kafka's side but the real fix is in Kubernetes, not Kafka config.

bash

kubectl get pods -n kafka-consumers -o wide
kubectl describe pod <pod-name> | grep -A5 "Last State"

Monitoring So You Catch This Before It's an Incident

yaml

# Prometheus alert — catches sustained lag growth, not just absolute lag
- alert: KafkaConsumerLagGrowing
  expr: |
    deriv(kafka_consumergroup_lag[10m]) > 0
    and kafka_consumergroup_lag > 1000
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Consumer lag for {{ $labels.consumergroup }} has grown for 15m straight"

Alerting on the rate of change, not just the absolute lag number, catches the problem while it's still small — before it becomes a 2 AM page.

Set up full Kafka observability: Grafana Loki Log Aggregation Guide

Kafka Consumer Lag Keeps Growing — How to Diagnose and Fix It

Step 1: Confirm It's Actually Growing, Not Just High

Step 2: Check Which Partitions Are Lagging

Fix 1: Slow Processing (Even Lag Across Partitions)

Fix 2: Hot Partition (Skewed Lag)

Fix 3: Rebalancing Storms

Monitoring So You Catch This Before It's an Incident

Stay ahead of the curve

Related Articles

Datadog Agent Not Sending Metrics — Diagnosis and Fix Guide

Prometheus AlertManager Silence Not Working? Here's Why (and the Fix)

Prometheus High Cardinality Causing OOM — How to Find and Fix It (2026)

Comments