Kubernetes Liveness Probe Failing Randomly — Fix
Your pods randomly restart with liveness probe failures, even when the app is healthy. Here's every reason this happens and how to tune probes correctly.
Your app is working fine. Logs show normal requests. But every few hours, a pod restarts with Liveness probe failed. Users see brief errors. The app comes back, works fine again.
This is one of the most common Kubernetes misconfigurations. Here's how to fix it.
Diagnose First
# Check probe failure details
kubectl describe pod <pod-name> -n <namespace>
# Look for:
# Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 500
# Warning Unhealthy Liveness probe failed: context deadline exceeded
# Warning Killing Container killed due to liveness probe failure
# Check restart history
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'
# Watch probe events in real time
kubectl get events -n <namespace> --field-selector reason=Unhealthy -wCase 1: Probe Timeout Too Short
The most common cause. Your app is healthy but responds slowly during GC, high load, or a database query. The probe times out and Kubernetes kills the pod.
Default probe settings are too aggressive for most apps:
# Default (too tight):
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 0 # fires immediately!
periodSeconds: 10 # every 10 seconds
timeoutSeconds: 1 # 1 second to respond!
failureThreshold: 3 # 3 failures = restartFix — tune based on your app's actual behavior:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # wait for app to start
periodSeconds: 15 # check every 15s
timeoutSeconds: 5 # 5 seconds to respond
failureThreshold: 3 # 3 consecutive failures = restart
successThreshold: 1 # 1 success to become healthyCase 2: Health Endpoint Checks Too Much
Your /health endpoint is doing too much — checking DB connections, external APIs, cache. Any of those being slow causes the probe to fail.
Bad health endpoint:
// Checks everything — too fragile
func healthHandler(w http.ResponseWriter, r *http.Request) {
if err := db.Ping(); err != nil { // DB might be slow!
http.Error(w, "DB unhealthy", 503)
return
}
if err := redis.Ping(); err != nil { // Redis might be slow!
http.Error(w, "Cache unhealthy", 503)
return
}
w.WriteHeader(200)
}Fix — separate liveness from readiness:
// Liveness: is the process alive? (bare minimum)
func livenessHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200) // If this endpoint responds, the app is alive
w.Write([]byte(`{"status":"alive"}`))
}
// Readiness: can this pod accept traffic? (check dependencies)
func readinessHandler(w http.ResponseWriter, r *http.Request) {
if err := db.Ping(); err != nil {
http.Error(w, "DB not ready", 503)
return
}
w.WriteHeader(200)
w.Write([]byte(`{"status":"ready"}`))
}livenessProbe:
httpGet:
path: /health/live # Simple alive check
port: 8080
readinessProbe:
httpGet:
path: /health/ready # Full dependency check
port: 8080Liveness failure = restart pod. Readiness failure = stop sending traffic (no restart).
Case 3: Java / JVM Apps — Startup GC Pauses
JVM apps have garbage collection pauses. During a full GC, your app can be unresponsive for 1–5 seconds. If your probe timeout is 1s, this triggers a false failure.
Fix:
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60 # JVM takes time to start
periodSeconds: 20
timeoutSeconds: 10 # Allow for GC pauses
failureThreshold: 5 # Be lenient
startupProbe: # Handle slow startup separately
httpGet:
path: /actuator/health/liveness
port: 8080
failureThreshold: 30 # 30 * 10s = 5 minutes max startup time
periodSeconds: 10The startupProbe disables liveness and readiness checks until the app has started. Perfect for slow-starting JVM apps.
Case 4: Exec Probe Spawning Too Many Processes
If you use command-based probes:
# Problematic — spawns a new process every 10 seconds
livenessProbe:
exec:
command:
- sh
- -c
- "curl -f http://localhost:8080/health || exit 1"Under load, these processes pile up and can exhaust system resources, causing the main process to slow down — triggering more probe failures.
Fix: Use httpGet instead of exec for HTTP checks:
livenessProbe:
httpGet:
path: /health
port: 8080
# Much lighter than spawning curlCase 5: Resource Pressure Causing Slow Responses
CPU throttling due to low CPU limits causes the app to respond slowly. The probe times out.
# Check if CPU is being throttled
kubectl top pod <pod-name>
# Check events for OOMKilled (memory pressure)
kubectl describe pod <pod-name> | grep -A5 "Last State"Fix: Increase resource limits or reduce CPU limit pressure:
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "500m" # Increase if app is CPU throttled
memory: "512Mi"Recommended Probe Config by App Type
# Node.js / Go (fast startup)
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
# Java / Spring Boot (slow startup)
startupProbe:
httpGet:
path: /actuator/health
port: 8080
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
periodSeconds: 20
timeoutSeconds: 10
failureThreshold: 5
# Python / Django
livenessProbe:
httpGet:
path: /health/
port: 8000
initialDelaySeconds: 20
periodSeconds: 15
timeoutSeconds: 8
failureThreshold: 3Random probe failures are almost always a configuration issue, not an app bug. Tune initialDelaySeconds, timeoutSeconds, and failureThreshold based on your app's actual behavior — not the defaults.
Learn Kubernetes health check best practices with hands-on labs at KodeKloud.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
ArgoCD App of Apps Not Syncing — Every Fix (2026)
Your ArgoCD App of Apps pattern stopped syncing. Child apps aren't created, parent shows OutOfSync, or sync is stuck. Here are every cause and the exact fix.
ArgoCD Image Updater Not Syncing — Fix Guide
ArgoCD Image Updater detects a new image tag but doesn't update the Application. Here's how to diagnose and fix annotation errors, registry auth issues, write-back problems, and sync failures.
AWS EKS Cluster Autoscaler Not Scaling — Every Fix (2026)
Your EKS Cluster Autoscaler isn't scaling up, scale-down isn't working, or nodes spin up but stay empty. Here's every cause and the exact fix.