All Articles

AWS EKS Worker Nodes Not Joining the Cluster: Complete Fix Guide

EKS worker nodes stuck in NotReady or not appearing at all? Here are all the causes and step-by-step fixes for node bootstrap failures.

DevOpsBoysApr 21, 20266 min read
Share:Tweet

You create an EKS node group. The EC2 instances launch. But kubectl get nodes shows nothing — or the nodes show up as NotReady and stay that way.

This is one of the most frustrating EKS issues because the failure happens on the node, not in the control plane, and the errors can come from IAM, networking, bootstrap scripts, or AMI issues.

Here's every cause and how to fix it.


Quick Diagnosis First

bash
# Check if nodes appear at all
kubectl get nodes
 
# If nodes show but are NotReady
kubectl describe node <node-name>
# Look for: Conditions → Ready: False, and Events section
 
# Check node group status in AWS console
aws eks describe-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name my-nodegroup \
  --query 'nodegroup.status'
 
# Check EC2 instances
aws ec2 describe-instances \
  --filters "Name=tag:eks:cluster-name,Values=my-cluster" \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name,Status:NetworkInterfaces[0].Status}'

If nodes don't appear in kubectl get nodes at all, the issue is registration — the node can't reach the API server.

If nodes appear but are NotReady, the node joined but the kubelet is unhealthy.


Cause 1 — Missing IAM Role Permissions (Most Common)

Worker nodes need an IAM role with three specific policies to join EKS.

Diagnose

bash
# Check node role
aws iam list-attached-role-policies \
  --role-name my-eks-node-role \
  --query 'AttachedPolicies[].PolicyName'

Fix — Attach Required Policies

bash
# Required policies for EKS worker nodes
aws iam attach-role-policy \
  --role-name my-eks-node-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
 
aws iam attach-role-policy \
  --role-name my-eks-node-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
 
aws iam attach-role-policy \
  --role-name my-eks-node-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

Also add the node IAM role to the aws-auth ConfigMap:

bash
# Check aws-auth ConfigMap
kubectl get configmap aws-auth -n kube-system -o yaml

If the node role isn't in aws-auth, nodes can't authenticate with the API server:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - rolearn: arn:aws:iam::123456789012:role/my-eks-node-role
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes
bash
kubectl apply -f aws-auth.yaml

Cause 2 — Node in Wrong Subnet or Security Group

Nodes must be able to reach the EKS API server endpoint on port 443.

Diagnose

SSH into the failing node (or use SSM):

bash
# Connect via SSM (no SSH key needed)
aws ssm start-session --target <instance-id>
 
# Test API server connectivity
curl -k https://<api-server-endpoint>:443
# Should return: 403 Forbidden (not a timeout)

If you get a timeout, the security group or route table is blocking the connection.

Fix — Security Group Rules

The node security group needs:

Outbound:
  Port 443 (HTTPS) → API server security group or CIDR
  Port 10250 (kubelet) → within cluster
  All traffic → 0.0.0.0/0 (for NAT/internet access)

Inbound:
  Port 443 → from control plane
  Port 10250 → from control plane
  All traffic → within VPC CIDR

In Terraform:

hcl
resource "aws_security_group_rule" "node_to_control_plane" {
  type                     = "egress"
  from_port                = 443
  to_port                  = 443
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.eks_control_plane.id
  security_group_id        = aws_security_group.eks_nodes.id
}

Cause 3 — Private Nodes Can't Reach API Server

If your nodes are in private subnets and the EKS endpoint is public-only:

bash
# Check endpoint access config
aws eks describe-cluster --name my-cluster \
  --query 'cluster.resourcesVpcConfig.{public:endpointPublicAccess,private:endpointPrivateAccess}'
# {"public": true, "private": false}

Nodes in private subnets need either:

  1. NAT Gateway to reach the public endpoint
  2. Private endpoint enabled

Fix Option 1 — Enable Private Endpoint

bash
aws eks update-cluster-config \
  --name my-cluster \
  --resources-vpc-config endpointPrivateAccess=true,endpointPublicAccess=true

Fix Option 2 — Add NAT Gateway for Private Nodes

If your private subnets don't have a NAT Gateway route:

hcl
resource "aws_nat_gateway" "nat" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public.id
}
 
resource "aws_route" "private_nat_route" {
  route_table_id         = aws_route_table.private.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.nat.id
}

Cause 4 — Kubelet Bootstrap Failure

The node joins but kubelet fails to start. Check the bootstrap log:

bash
# On the node (via SSM)
sudo journalctl -u kubelet --since "10 minutes ago"
sudo cat /var/log/cloud-init-output.log
sudo cat /var/log/user-data.log    # if using custom launch template

Common kubelet errors:

"Failed to get node info: nodes not found"

The node can reach the API server but can't register. Usually an IAM aws-auth issue. See Cause 1.

"certificate signed by unknown authority"

x509: certificate signed by unknown authority

The node can't verify the API server certificate. Usually happens with:

  • Wrong cluster endpoint in bootstrap script
  • Self-signed certs in private clusters

Fix — check the --apiserver-endpoint and --b64-cluster-ca in your bootstrap or launch template user data.

"failed to reserve container name"

The Docker or containerd socket isn't ready. Usually means the AMI is wrong or the node rebooted before setup finished.


Cause 5 — Wrong AMI for EKS Version

Each EKS cluster version requires a matching EKS-optimized AMI. Using an AMI built for EKS 1.27 on a 1.30 cluster breaks bootstrap.

Fix — Use the Correct AMI

bash
# Get the correct AMI for your EKS version and region
aws ssm get-parameter \
  --name /aws/service/eks/optimized-ami/1.30/amazon-linux-2/recommended/image_id \
  --region ap-south-1 \
  --query Parameter.Value \
  --output text
# ami-0abc123def456

In eksctl cluster.yaml:

yaml
nodeGroups:
  - name: workers
    instanceType: t3.medium
    amiFamily: AmazonLinux2    # Let eksctl pick the right AMI automatically

Cause 6 — Max Pods Exceeded

Nodes join but refuse to schedule pods. Check:

bash
kubectl describe node <node-name> | grep "max-pods"
kubectl describe node <node-name> | grep -A5 "Allocatable"

AWS VPC CNI limits pods per node based on instance type (secondary IPs per ENI). A t3.medium supports only 17 pods.

Fix — Use Prefix Delegation or Larger Instances

Enable prefix delegation (more IPs per ENI) for higher pod density:

bash
kubectl set env daemonset aws-node \
  -n kube-system \
  ENABLE_PREFIX_DELEGATION=true

Or use m5.xlarge which supports up to 58 pods.


Cause 7 — Clock Skew on Nodes

If node system clock is significantly off from the API server, certificate validation fails:

x509: certificate has expired or is not yet valid

Fix

The EKS-optimized AMI syncs time via chrony. If you're using a custom AMI:

bash
# On the node
sudo timedatectl status
sudo chronyc sources
sudo chronyc tracking
 
# Force sync
sudo chronyc makestep

Step-by-Step Debugging Checklist

bash
# 1. Does the node appear?
kubectl get nodes
 
# 2. Check node group status
aws eks describe-nodegroup --cluster-name <name> --nodegroup-name <name>
 
# 3. Check EC2 instance state
aws ec2 describe-instance-status --instance-ids <id>
 
# 4. SSH/SSM into node and check kubelet
sudo systemctl status kubelet
sudo journalctl -u kubelet -n 100
 
# 5. Check API server connectivity from node
curl -k https://<api-endpoint>:443
 
# 6. Verify IAM role policies
aws iam list-attached-role-policies --role-name <node-role>
 
# 7. Check aws-auth ConfigMap
kubectl get configmap aws-auth -n kube-system -o yaml
 
# 8. Check security groups allow port 443 outbound
aws ec2 describe-security-groups --group-ids <sg-id>

Summary

CauseSymptomFix
Missing IAM policiesNodes don't appearAttach 3 required policies + update aws-auth
Wrong security groupNodes don't appearAllow port 443 outbound to API server
Private nodes, no NATNodes don't appearEnable private endpoint or add NAT Gateway
Kubelet bootstrap errorNode joins but NotReadyCheck /var/log/user-data.log + journalctl
Wrong AMIBootstrap failureUse SSM to find correct EKS-optimized AMI
Max pods exceededPods won't scheduleEnable prefix delegation or use larger instance
Clock skewCertificate errorsSync time with chrony

IAM and networking (security groups + VPC routing) cause 90% of node join failures. Start there.

Build and test your EKS setup on DigitalOcean Kubernetes with $200 free credit, or practice EKS troubleshooting on KodeKloud hands-on labs.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments