🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Kubernetes Liveness Probe Failing Randomly — Fix

Your pods randomly restart with liveness probe failures, even when the app is healthy. Here's every reason this happens and how to tune probes correctly.

DevOpsBoysMay 31, 20264 min read
Share:Tweet

Your app is working fine. Logs show normal requests. But every few hours, a pod restarts with Liveness probe failed. Users see brief errors. The app comes back, works fine again.

This is one of the most common Kubernetes misconfigurations. Here's how to fix it.


Diagnose First

bash
# Check probe failure details
kubectl describe pod <pod-name> -n <namespace>
 
# Look for:
# Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 500
# Warning  Unhealthy  Liveness probe failed: context deadline exceeded
# Warning  Killing    Container killed due to liveness probe failure
 
# Check restart history
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'
 
# Watch probe events in real time
kubectl get events -n <namespace> --field-selector reason=Unhealthy -w

Case 1: Probe Timeout Too Short

The most common cause. Your app is healthy but responds slowly during GC, high load, or a database query. The probe times out and Kubernetes kills the pod.

Default probe settings are too aggressive for most apps:

yaml
# Default (too tight):
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 0   # fires immediately!
  periodSeconds: 10        # every 10 seconds
  timeoutSeconds: 1        # 1 second to respond!
  failureThreshold: 3      # 3 failures = restart

Fix — tune based on your app's actual behavior:

yaml
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30   # wait for app to start
  periodSeconds: 15         # check every 15s
  timeoutSeconds: 5         # 5 seconds to respond
  failureThreshold: 3       # 3 consecutive failures = restart
  successThreshold: 1       # 1 success to become healthy

Case 2: Health Endpoint Checks Too Much

Your /health endpoint is doing too much — checking DB connections, external APIs, cache. Any of those being slow causes the probe to fail.

Bad health endpoint:

go
// Checks everything — too fragile
func healthHandler(w http.ResponseWriter, r *http.Request) {
    if err := db.Ping(); err != nil {  // DB might be slow!
        http.Error(w, "DB unhealthy", 503)
        return
    }
    if err := redis.Ping(); err != nil {  // Redis might be slow!
        http.Error(w, "Cache unhealthy", 503)
        return
    }
    w.WriteHeader(200)
}

Fix — separate liveness from readiness:

go
// Liveness: is the process alive? (bare minimum)
func livenessHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(200)  // If this endpoint responds, the app is alive
    w.Write([]byte(`{"status":"alive"}`))
}
 
// Readiness: can this pod accept traffic? (check dependencies)
func readinessHandler(w http.ResponseWriter, r *http.Request) {
    if err := db.Ping(); err != nil {
        http.Error(w, "DB not ready", 503)
        return
    }
    w.WriteHeader(200)
    w.Write([]byte(`{"status":"ready"}`))
}
yaml
livenessProbe:
  httpGet:
    path: /health/live    # Simple alive check
    port: 8080
 
readinessProbe:
  httpGet:
    path: /health/ready   # Full dependency check
    port: 8080

Liveness failure = restart pod. Readiness failure = stop sending traffic (no restart).


Case 3: Java / JVM Apps — Startup GC Pauses

JVM apps have garbage collection pauses. During a full GC, your app can be unresponsive for 1–5 seconds. If your probe timeout is 1s, this triggers a false failure.

Fix:

yaml
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 60    # JVM takes time to start
  periodSeconds: 20
  timeoutSeconds: 10         # Allow for GC pauses
  failureThreshold: 5        # Be lenient
 
startupProbe:               # Handle slow startup separately
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  failureThreshold: 30      # 30 * 10s = 5 minutes max startup time
  periodSeconds: 10

The startupProbe disables liveness and readiness checks until the app has started. Perfect for slow-starting JVM apps.


Case 4: Exec Probe Spawning Too Many Processes

If you use command-based probes:

yaml
# Problematic — spawns a new process every 10 seconds
livenessProbe:
  exec:
    command:
      - sh
      - -c
      - "curl -f http://localhost:8080/health || exit 1"

Under load, these processes pile up and can exhaust system resources, causing the main process to slow down — triggering more probe failures.

Fix: Use httpGet instead of exec for HTTP checks:

yaml
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  # Much lighter than spawning curl

Case 5: Resource Pressure Causing Slow Responses

CPU throttling due to low CPU limits causes the app to respond slowly. The probe times out.

bash
# Check if CPU is being throttled
kubectl top pod <pod-name>
 
# Check events for OOMKilled (memory pressure)
kubectl describe pod <pod-name> | grep -A5 "Last State"

Fix: Increase resource limits or reduce CPU limit pressure:

yaml
resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "500m"    # Increase if app is CPU throttled
    memory: "512Mi"

yaml
# Node.js / Go (fast startup)
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
 
# Java / Spring Boot (slow startup)
startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 20
  timeoutSeconds: 10
  failureThreshold: 5
 
# Python / Django
livenessProbe:
  httpGet:
    path: /health/
    port: 8000
  initialDelaySeconds: 20
  periodSeconds: 15
  timeoutSeconds: 8
  failureThreshold: 3

Random probe failures are almost always a configuration issue, not an app bug. Tune initialDelaySeconds, timeoutSeconds, and failureThreshold based on your app's actual behavior — not the defaults.

Learn Kubernetes health check best practices with hands-on labs at KodeKloud.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments