Kubernetes Liveness Probe Failing Randomly — Fix

Your pods randomly restart with liveness probe failures, even when the app is healthy. Here's every reason this happens and how to tune probes correctly.

Your app is working fine. Logs show normal requests. But every few hours, a pod restarts with Liveness probe failed. Users see brief errors. The app comes back, works fine again.

This is one of the most common Kubernetes misconfigurations. Here's how to fix it.

Diagnose First

bash

# Check probe failure details
kubectl describe pod <pod-name> -n <namespace>
 
# Look for:
# Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 500
# Warning  Unhealthy  Liveness probe failed: context deadline exceeded
# Warning  Killing    Container killed due to liveness probe failure
 
# Check restart history
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'
 
# Watch probe events in real time
kubectl get events -n <namespace> --field-selector reason=Unhealthy -w

Case 1: Probe Timeout Too Short

The most common cause. Your app is healthy but responds slowly during GC, high load, or a database query. The probe times out and Kubernetes kills the pod.

Default probe settings are too aggressive for most apps:

yaml

# Default (too tight):
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 0   # fires immediately!
  periodSeconds: 10        # every 10 seconds
  timeoutSeconds: 1        # 1 second to respond!
  failureThreshold: 3      # 3 failures = restart

Fix — tune based on your app's actual behavior:

yaml

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30   # wait for app to start
  periodSeconds: 15         # check every 15s
  timeoutSeconds: 5         # 5 seconds to respond
  failureThreshold: 3       # 3 consecutive failures = restart
  successThreshold: 1       # 1 success to become healthy

Case 2: Health Endpoint Checks Too Much

Your /health endpoint is doing too much — checking DB connections, external APIs, cache. Any of those being slow causes the probe to fail.

Bad health endpoint:

// Checks everything — too fragile
func healthHandler(w http.ResponseWriter, r *http.Request) {
    if err := db.Ping(); err != nil {  // DB might be slow!
        http.Error(w, "DB unhealthy", 503)
        return
    }
    if err := redis.Ping(); err != nil {  // Redis might be slow!
        http.Error(w, "Cache unhealthy", 503)
        return
    }
    w.WriteHeader(200)
}

Fix — separate liveness from readiness:

// Liveness: is the process alive? (bare minimum)
func livenessHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(200)  // If this endpoint responds, the app is alive
    w.Write([]byte(`{"status":"alive"}`))
}
 
// Readiness: can this pod accept traffic? (check dependencies)
func readinessHandler(w http.ResponseWriter, r *http.Request) {
    if err := db.Ping(); err != nil {
        http.Error(w, "DB not ready", 503)
        return
    }
    w.WriteHeader(200)
    w.Write([]byte(`{"status":"ready"}`))
}

yaml

livenessProbe:
  httpGet:
    path: /health/live    # Simple alive check
    port: 8080
 
readinessProbe:
  httpGet:
    path: /health/ready   # Full dependency check
    port: 8080

Liveness failure = restart pod. Readiness failure = stop sending traffic (no restart).

Case 3: Java / JVM Apps — Startup GC Pauses

JVM apps have garbage collection pauses. During a full GC, your app can be unresponsive for 1–5 seconds. If your probe timeout is 1s, this triggers a false failure.

Fix:

yaml

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 60    # JVM takes time to start
  periodSeconds: 20
  timeoutSeconds: 10         # Allow for GC pauses
  failureThreshold: 5        # Be lenient
 
startupProbe:               # Handle slow startup separately
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  failureThreshold: 30      # 30 * 10s = 5 minutes max startup time
  periodSeconds: 10

The startupProbe disables liveness and readiness checks until the app has started. Perfect for slow-starting JVM apps.

Case 4: Exec Probe Spawning Too Many Processes

If you use command-based probes:

yaml

# Problematic — spawns a new process every 10 seconds
livenessProbe:
  exec:
    command:
      - sh
      - -c
      - "curl -f http://localhost:8080/health || exit 1"

Under load, these processes pile up and can exhaust system resources, causing the main process to slow down — triggering more probe failures.

Fix: Use httpGet instead of exec for HTTP checks:

yaml

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  # Much lighter than spawning curl

Case 5: Resource Pressure Causing Slow Responses

CPU throttling due to low CPU limits causes the app to respond slowly. The probe times out.

bash

# Check if CPU is being throttled
kubectl top pod <pod-name>
 
# Check events for OOMKilled (memory pressure)
kubectl describe pod <pod-name> | grep -A5 "Last State"

Fix: Increase resource limits or reduce CPU limit pressure:

yaml

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "500m"    # Increase if app is CPU throttled
    memory: "512Mi"

Recommended Probe Config by App Type

yaml

# Node.js / Go (fast startup)
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
 
# Java / Spring Boot (slow startup)
startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 20
  timeoutSeconds: 10
  failureThreshold: 5
 
# Python / Django
livenessProbe:
  httpGet:
    path: /health/
    port: 8000
  initialDelaySeconds: 20
  periodSeconds: 15
  timeoutSeconds: 8
  failureThreshold: 3

Random probe failures are almost always a configuration issue, not an app bug. Tune initialDelaySeconds, timeoutSeconds, and failureThreshold based on your app's actual behavior — not the defaults.

Learn Kubernetes health check best practices with hands-on labs at KodeKloud.

Kubernetes Liveness Probe Failing Randomly — Fix

Diagnose First

Case 1: Probe Timeout Too Short

Case 2: Health Endpoint Checks Too Much

Case 3: Java / JVM Apps — Startup GC Pauses

Case 4: Exec Probe Spawning Too Many Processes

Case 5: Resource Pressure Causing Slow Responses

Recommended Probe Config by App Type

Stay ahead of the curve

Related Articles

ArgoCD App of Apps Not Syncing — Every Fix (2026)

ArgoCD Image Updater Not Syncing — Fix Guide

ArgoCD Resource Hook Failed: How to Debug and Fix It

Comments