All Articles

Kubernetes Node NotReady — How to Debug and Fix It (2026)

Your Kubernetes node is showing NotReady status. Here's every cause and the exact fix for each one.

DevOpsBoysApr 7, 20263 min read
Share:Tweet

A node in NotReady state means Kubernetes can't use it — pods won't schedule there, and existing pods may be evicted. Here's how to diagnose and fix it fast.


Step 1: Check Node Status

bash
kubectl get nodes
kubectl describe node <node-name>

Look at the Conditions section:

Conditions:
  Type              Status
  MemoryPressure    False
  DiskPressure      False
  PIDPressure       False
  Ready             False   ← this is the problem

Check Events at the bottom for clues.


Cause 1: kubelet Not Running

The kubelet is the agent on each node that reports to the control plane. If it stops, the node goes NotReady.

bash
# SSH into the node
systemctl status kubelet
journalctl -u kubelet -n 50 --no-pager

Fix:

bash
systemctl restart kubelet
systemctl enable kubelet

If kubelet fails to start, check logs for the root cause — usually misconfiguration or certificate issues.


Cause 2: Disk Pressure

Node disk is nearly full — kubelet reports DiskPressure and marks itself NotReady.

bash
df -h                          # check disk usage
du -sh /var/lib/docker/*       # docker taking space
du -sh /var/log/pods/*         # pod logs taking space

Fix:

bash
# Remove stopped containers and unused images
docker system prune -af
 
# Remove old pod logs
find /var/log/pods -name "*.log" -mtime +7 -delete

Long-term: add a larger disk or configure log rotation.


Cause 3: Memory Pressure

Node has less free memory than the eviction threshold (default 100Mi).

bash
free -h
kubectl describe node <node-name> | grep -A 5 "Allocatable"

Fix: Kill memory-hungry pods or add more nodes. Set proper resource limits on all pods to prevent runaway memory usage.


Cause 4: CNI Plugin Not Running

If the network plugin (Calico, Cilium, Flannel) crashes, nodes can't communicate and go NotReady.

bash
kubectl get pods -n kube-system
kubectl logs -n kube-system <calico-node-pod>

Fix: Restart the CNI pods:

bash
kubectl rollout restart daemonset/calico-node -n kube-system

Cause 5: Node Certificate Expired

kubelet uses TLS certificates to communicate with the API server. Expired certs = NotReady.

bash
# On the node
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

Fix: Renew certificates (for kubeadm clusters):

bash
kubeadm certs renew all
systemctl restart kubelet

Cause 6: Network Connectivity Issue

Node can't reach the API server — firewall change, network partition, or cloud provider issue.

bash
# From the node
curl -k https://<api-server-ip>:6443/healthz

Fix: Check security groups (AWS), firewall rules, or VPN connectivity.


Cause 7: OOM Kill — Node Kernel OOM

If a container ate all memory, the kernel OOM killer may have killed critical system processes.

bash
# On the node
dmesg | grep -i "oom kill"
journalctl -k | grep -i oom

Fix: Identify and fix the memory-leaking pod. Set memory limits on all pods.


Quick Recovery Commands

bash
# Check node conditions
kubectl describe node <node> | grep -A 20 Conditions
 
# Check kubelet logs
journalctl -u kubelet -f
 
# Force drain node (move pods elsewhere)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
 
# Mark node schedulable after fix
kubectl uncordon <node>

Prevention

  1. Set resource requests and limits on ALL pods
  2. Monitor node disk, memory, PID with Prometheus node-exporter
  3. Alert at 80% disk/memory (before eviction threshold)
  4. Use cluster autoscaler — add nodes before existing ones hit pressure
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments