Kubernetes Troubleshooting Guide 2026: Fix Every Common Problem
The most complete Kubernetes troubleshooting guide for 2026. Learn how to diagnose and fix Pod crashes, ImagePullBackOff, OOMKilled, CrashLoopBackOff, networking issues, PVC problems, node NotReady, and more — with exact kubectl commands.
Kubernetes is powerful — but when something breaks at 2 AM, you need answers fast.
This guide covers every common Kubernetes problem you'll encounter in production, with exact diagnostic commands and step-by-step fixes. Bookmark it. You'll need it.
How to Read This Guide
Each section follows the same pattern:
- What the error means — understand it before you fix it
- How to diagnose — exact
kubectlcommands - How to fix — step-by-step solution
1. Pod Stuck in Pending
What It Means
A Pending Pod hasn't been scheduled onto any node yet. The scheduler is trying to find a suitable node but can't.
Why It Happens
- Not enough CPU or memory on any node
- Node selectors / affinity rules don't match any node
- Tolerations missing for tainted nodes
- No PersistentVolume available to bind the PVC
How to Diagnose
kubectl describe pod <pod-name> -n <namespace>Look at the Events section at the bottom. The scheduler writes its reason there.
Common messages:
0/3 nodes are available: 3 Insufficient cpu→ not enough CPU0/3 nodes are available: 3 node(s) had taint...→ missing toleration0/3 nodes are available: 3 pod has unbound PVC→ PVC not bound
Check node resource availability:
kubectl describe nodes | grep -A 5 "Allocated resources"How to Fix
CPU/Memory shortage — either reduce resource requests, or add more nodes:
resources:
requests:
cpu: "100m" # reduce from higher value
memory: "128Mi"Taint issues — add the correct toleration to your Pod spec:
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"PVC not bound — check PVC status:
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>2. CrashLoopBackOff
What It Means
Your container keeps starting, crashing, and Kubernetes keeps restarting it. After a few cycles, it enters CrashLoopBackOff — Kubernetes backs off exponentially before retrying.
The container itself is the problem, not Kubernetes.
How to Diagnose
Get the crash logs:
kubectl logs <pod-name> -n <namespace>If the container already restarted, get logs from the previous run:
kubectl logs <pod-name> -n <namespace> --previousCheck the exit code:
kubectl describe pod <pod-name> -n <namespace>Look for Last State → Exit Code. Common codes:
1→ application error (check app logs)137→ OOMKilled (out of memory) or SIGKILL139→ segmentation fault143→ SIGTERM (graceful shutdown, usually fine)255→ process failed to start (check command/entrypoint)
How to Fix
Application crashing — read the logs. It's almost always a missing env var, config file, or failed health check.
Exit code 137 (OOM) — increase memory limit:
resources:
limits:
memory: "512Mi" # increase this
requests:
memory: "256Mi"Wrong command/entrypoint — run an interactive shell to debug:
# Override the entrypoint temporarily
kubectl run debug --image=your-image:tag --rm -it --restart=Never -- /bin/shLiveness probe too aggressive — the probe kills the container before it's ready:
livenessProbe:
initialDelaySeconds: 30 # give app time to start
periodSeconds: 10
failureThreshold: 5 # allow more failures3. ImagePullBackOff / ErrImagePull
What It Means
Kubernetes can't pull the container image. It tried, failed, and now it's backing off before retrying.
ErrImagePull = first failure. ImagePullBackOff = repeated failures.
Why It Happens
- Image tag doesn't exist in the registry
- Wrong registry URL
- Missing imagePullSecret for private registries
- Network can't reach the registry from the node
- Rate limiting (Docker Hub without auth)
How to Diagnose
kubectl describe pod <pod-name> -n <namespace>Events will show exactly what failed:
Failed to pull image "myapp:latest": rpc error...→ image doesn't existunauthorized: authentication required→ missing or wrong credentials429 Too Many Requests→ Docker Hub rate limit
How to Fix
Wrong image tag — verify the image exists:
# Check from inside a node
docker pull your-image:tag
# Or use crane/skopeo
crane ls your-registry/your-imagePrivate registry missing secret — create the pull secret:
kubectl create secret docker-registry regcred \
--docker-server=<registry-url> \
--docker-username=<username> \
--docker-password=<password> \
--docker-email=<email> \
-n <namespace>Then reference it in your Pod spec:
spec:
imagePullSecrets:
- name: regcred
containers:
- name: myapp
image: private-registry.com/myapp:v1.0Docker Hub rate limiting — authenticate with Docker Hub:
kubectl create secret docker-registry dockerhub-creds \
--docker-server=https://index.docker.io/v1/ \
--docker-username=<your-dockerhub-username> \
--docker-password=<your-dockerhub-token>4. OOMKilled (Out of Memory)
What It Means
Your container exceeded its memory limit and the kernel killed it. This shows as exit code 137.
How to Diagnose
kubectl describe pod <pod-name> -n <namespace>You'll see:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Check current memory usage before deciding the limit:
kubectl top pod <pod-name> -n <namespace>
kubectl top nodesHow to Fix
Increase the memory limit — but first understand why it's using so much:
resources:
requests:
memory: "256Mi"
limits:
memory: "1Gi" # increase limitFix a memory leak — if the container keeps growing indefinitely, it's a code bug. Add monitoring (Prometheus + Grafana) to graph memory over time.
Java applications — JVM doesn't respect container limits by default. Set heap size explicitly:
env:
- name: JAVA_OPTS
value: "-Xms256m -Xmx512m"Or use JVM container awareness flags:
env:
- name: JAVA_TOOL_OPTIONS
value: "-XX:MaxRAMPercentage=75.0"5. Pod Stuck in Terminating
What It Means
The Pod received a delete signal but isn't shutting down. It's stuck waiting for something — usually a finalizer or a slow graceful shutdown.
How to Diagnose
kubectl describe pod <pod-name> -n <namespace>Check for Finalizers in the metadata section. If a finalizer is set but the controller managing it is gone, the Pod is stuck forever.
How to Fix
Force delete — use this only when you're sure the Pod is truly stuck:
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0Remove stuck finalizers:
kubectl patch pod <pod-name> -n <namespace> \
-p '{"metadata":{"finalizers":[]}}' \
--type=mergeSlow graceful shutdown — if your app takes too long to stop, increase the termination grace period:
spec:
terminationGracePeriodSeconds: 120 # default is 30Also make sure your app handles SIGTERM correctly and shuts down within the grace period.
6. Service Not Routing Traffic
What It Means
Your Pod is Running, your Service exists, but traffic isn't reaching the container.
How to Diagnose
Step 1 — Check if Endpoints exist:
kubectl get endpoints <service-name> -n <namespace>If ENDPOINTS shows <none>, the Service selector doesn't match any Pod labels.
Step 2 — Verify selector matches Pod labels:
kubectl get pods -n <namespace> --show-labels
kubectl describe svc <service-name> -n <namespace>Compare Selector in the Service with the Pod labels. They must match exactly.
Step 3 — Test from inside the cluster:
kubectl run test --image=busybox --rm -it --restart=Never -- \
wget -O- http://<service-name>.<namespace>.svc.cluster.local:<port>Step 4 — Check NetworkPolicy:
kubectl get networkpolicies -n <namespace>A NetworkPolicy might be blocking traffic.
How to Fix
Selector mismatch — fix the label:
# Service selector
selector:
app: myapp # must match exactly
# Pod label
metadata:
labels:
app: myapp # same key and valueWrong port — port is what the Service exposes, targetPort is what the container listens on:
spec:
ports:
- port: 80 # Service port (what clients use)
targetPort: 8080 # Container port (what the app listens on)NetworkPolicy blocking — add an allow rule:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-app-traffic
spec:
podSelector:
matchLabels:
app: myapp
ingress:
- from:
- podSelector:
matchLabels:
role: frontend
ports:
- port: 80807. Ingress Not Working
What It Means
You have an Ingress resource but traffic isn't reaching your Service from outside the cluster.
How to Diagnose
Check if Ingress controller is running:
kubectl get pods -n ingress-nginx
# or
kubectl get pods -n kube-system | grep ingressCheck the Ingress resource:
kubectl describe ingress <ingress-name> -n <namespace>Look for Address — if empty, the Ingress controller hasn't assigned an IP yet.
Check Ingress controller logs:
kubectl logs -n ingress-nginx deployment/ingress-nginx-controllerHow to Fix
No address assigned — check if the LoadBalancer Service for the Ingress controller has an External IP:
kubectl get svc -n ingress-nginxIf it shows <pending>, your cloud provider isn't provisioning a load balancer (check cloud quota/config).
TLS certificate issues — check cert-manager:
kubectl describe certificate <cert-name> -n <namespace>
kubectl describe certificaterequest -n <namespace>Path not matching — be explicit about pathType:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix # use Prefix, not Exact, for most cases
backend:
service:
name: myapp-svc
port:
number: 808. Node in NotReady State
What It Means
A node has stopped reporting to the control plane. Pods on that node will eventually be evicted and rescheduled elsewhere.
How to Diagnose
kubectl get nodes
kubectl describe node <node-name>In Conditions, look at the Ready condition. Common reasons:
KubeletNotReady→ kubelet stoppedNetworkPluginNotReady→ CNI plugin issueDiskPressure→ node running out of diskMemoryPressure→ node running out of memoryPIDPressure→ too many processes
Check node conditions in detail:
kubectl get node <node-name> -o json | jq '.status.conditions'How to Fix
SSH into the node (if accessible) and check kubelet:
systemctl status kubelet
journalctl -u kubelet -n 100Kubelet not running — restart it:
systemctl restart kubeletDisk pressure — clean up unused images and containers:
# On the node
docker system prune -a
# Or crictl (for containerd)
crictl rmi --pruneOut of inodes (often confused with disk space):
df -i # check inode usageMemory pressure — find memory-hungry processes:
ps aux --sort=-%mem | head -209. PersistentVolumeClaim Stuck in Pending
What It Means
Your PVC can't find a PersistentVolume to bind to. No storage has been provisioned.
How to Diagnose
kubectl describe pvc <pvc-name> -n <namespace>Common messages:
no persistent volumes available...→ no PV matchesstorageclass.storage.k8s.io "fast" not found→ wrong StorageClass namewaiting for a volume to be created...→ dynamic provisioner is working (just wait)
Check available StorageClasses:
kubectl get storageclassCheck existing PVs:
kubectl get pvHow to Fix
Wrong StorageClass name — use the exact name from kubectl get storageclass:
spec:
storageClassName: standard # must match exactly
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10GiNo default StorageClass — set one as default:
kubectl patch storageclass <name> \
-p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'AccessMode mismatch — EBS volumes only support ReadWriteOnce. NFS supports ReadWriteMany:
accessModes:
- ReadWriteOnce # for cloud block storage (AWS EBS, GCE PD)
# - ReadWriteMany # for NFS or cloud file storage10. DNS Resolution Failures
What It Means
Pods can't resolve service names. myservice.default.svc.cluster.local returns no answer.
How to Diagnose
Run a DNS test Pod:
kubectl run dnstest --image=busybox --rm -it --restart=Never -- \
nslookup kubernetes.default.svc.cluster.localCheck CoreDNS is running:
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dnsCheck CoreDNS ConfigMap:
kubectl get configmap coredns -n kube-system -o yamlHow to Fix
CoreDNS Pod not running — restart it:
kubectl rollout restart deployment/coredns -n kube-systemCustom DNS config in Pod — sometimes apps override DNS. Ensure your Pod uses cluster DNS:
spec:
dnsPolicy: ClusterFirst # default, ensures cluster DNS is usedndots setting — if short names don't resolve, check /etc/resolv.conf inside the Pod:
kubectl exec <pod-name> -- cat /etc/resolv.conf11. HorizontalPodAutoscaler Not Scaling
What It Means
Your HPA exists but the replica count isn't changing even under load.
How to Diagnose
kubectl describe hpa <hpa-name> -n <namespace>Look for:
unable to get metrics for resource cpu→ metrics-server not installedthe HPA was unable to compute the replica count→ metric fetch failed- Current vs desired replica count
Check metrics-server:
kubectl get pods -n kube-system | grep metrics-server
kubectl top pods -n <namespace>How to Fix
Metrics server not installed — install it:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yamlMetrics server failing on TLS — add --kubelet-insecure-tls flag (dev only):
args:
- --kubelet-insecure-tlsNo resource requests set — HPA needs requests.cpu to calculate utilization:
resources:
requests:
cpu: "100m" # REQUIRED for CPU-based HPAHPA and Deployment replicas conflict — if you set replicas in Deployment AND use HPA, HPA will fight with you. Remove replicas from Deployment manifest once HPA is active.
12. RBAC Permission Denied
What It Means
A Pod or user is getting 403 Forbidden or Error from server (Forbidden) when trying to access the Kubernetes API.
How to Diagnose
# Check what a ServiceAccount can do
kubectl auth can-i list pods \
--as=system:serviceaccount:<namespace>:<serviceaccount-name> \
-n <namespace>
# List all permissions of a role
kubectl describe role <role-name> -n <namespace>
kubectl describe clusterrole <clusterrole-name>
# Check what's bound to a ServiceAccount
kubectl get rolebindings,clusterrolebindings -A | grep <serviceaccount-name>How to Fix
Create a Role with the required permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: production
rules:
- apiGroups: [""]
resources: ["pods", "pods/logs"]
verbs: ["get", "list", "watch"]Bind the Role to a ServiceAccount:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-reader-binding
namespace: production
subjects:
- kind: ServiceAccount
name: myapp-sa
namespace: production
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.ioNever use wildcard verbs in production (verbs: ["*"]) — grant only what the application actually needs.
Quick Reference: kubectl Diagnostic Commands
Here's a cheat sheet for when you're in the middle of an incident:
# Pod status and events
kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
# Node status
kubectl get nodes -o wide
kubectl describe node <node>
kubectl top nodes
# Service and endpoints
kubectl get svc,endpoints -n <ns>
# Check all resources in a namespace
kubectl get all -n <ns>
# Follow logs in real time
kubectl logs -f <pod> -n <ns>
# Execute commands inside running container
kubectl exec -it <pod> -n <ns> -- /bin/sh
# Port-forward to debug locally
kubectl port-forward pod/<pod> 8080:8080 -n <ns>
# Get events sorted by time
kubectl get events -n <ns> --sort-by='.lastTimestamp'
# Check resource usage
kubectl top pods -n <ns> --sort-by=memory
# Dry run — test manifests without applying
kubectl apply -f deployment.yaml --dry-run=client
# View diff before applying changes
kubectl diff -f deployment.yamlDebugging Strategy: The 5-Step Method
When something breaks and you don't know where to start:
Step 1 — What's broken?
kubectl get pods,svc,ingress -n <namespace>Find what's in a bad state.
Step 2 — What's the error?
kubectl describe <resource> <name> -n <namespace>Read the Events section. The answer is usually there.
Step 3 — What do the logs say?
kubectl logs <pod> -n <namespace> --previousApplication logs contain the real error.
Step 4 — Can I reach it?
kubectl port-forward pod/<pod> 8080:8080 -n <namespace>
curl localhost:8080/healthTest connectivity directly.
Step 5 — What changed?
kubectl rollout history deployment/<name> -n <namespace>
kubectl rollout undo deployment/<name> -n <namespace>If it was working before, roll back.
Prevention: The 5 Things That Prevent 90% of Incidents
Most Kubernetes incidents are caused by the same mistakes:
- Always set resource requests and limits — without them, Pods steal resources and cause cascading failures
- Use liveness and readiness probes — Kubernetes can't heal what it can't detect
- Never use
:latestin production — pin image versions for reproducible deployments - Set Pod Disruption Budgets — prevent all replicas from being evicted at once during node drains
- Monitor with Prometheus + Grafana — know about problems before your users do
Kubernetes troubleshooting is a skill you build over time. Every incident teaches you something. Save this guide, run the commands, read the errors carefully — and most problems reveal themselves within minutes.
Got a specific error not covered here? Reach out at hello@devopsboys.com.
Recommended Course
If you want to go deeper on Kubernetes troubleshooting, networking internals, and real production scenarios — KodeKloud has hands-on labs where you debug broken clusters in a real environment. It is the closest thing to real on-call experience you can get without being on-call.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Why Your Docker Container Keeps Restarting (and How to Fix It)
CrashLoopBackOff, OOMKilled, exit code 1, exit code 137 — Docker containers restart for specific, diagnosable reasons. Here is how to identify the exact cause and fix it in minutes.
Kubernetes OOMKilled: How I Fixed Out of Memory Errors in Production
OOMKilled crashes killing your pods? Here's the real cause, how to diagnose it fast, and the exact steps to fix it without breaking production.
Kubernetes Pod Stuck in Pending State: Every Cause and Fix (2026)
Your pod says Pending and nothing is happening. Here's how to diagnose every possible reason — insufficient resources, taints, PVC issues, node selectors — and fix them fast.