The Problem
EKS pods were stuck in ImagePullBackOff. Running kubectl describe pod:
Failed to pull image "123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest":
rpc error: code = Unknown
desc = failed to pull and unpack image:
failed to resolve reference "...":
unexpected status code 401 Unauthorized
What Happened
The ECR authentication token expired. ECR auth tokens are valid for only 12 hours. The nodes had authenticated at startup, but after 12 hours the token expired and new image pulls started failing.
The Fix
Option 1: Restart the node group (forces fresh token)
Not ideal, but works in an emergency:
# Force nodes to re-authenticate
aws ec2 reboot-instances --instance-ids i-xxxxxOption 2: The proper fix — use ECR credential helper
On EKS, the nodes should automatically refresh ECR credentials if the node IAM role has the right permissions.
# Check node IAM role
aws eks describe-nodegroup \
--cluster-name my-cluster \
--nodegroup-name my-nodes \
--query 'nodegroup.nodeRole'
# The role needs this policy attached:
# AmazonEC2ContainerRegistryReadOnly# Attach the policy if missing
aws iam attach-role-policy \
--role-name eks-node-role \
--policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnlyOption 3: For non-EKS clusters — create a refresh CronJob
# Create/refresh the ECR secret
aws ecr get-login-password --region us-east-1 | \
kubectl create secret docker-registry ecr-secret \
--docker-server=123456789.dkr.ecr.us-east-1.amazonaws.com \
--docker-username=AWS \
--docker-password=$(aws ecr get-login-password --region us-east-1) \
--dry-run=client -o yaml | kubectl apply -f -Set this as a CronJob that runs every 6 hours to keep the token fresh.
Root Cause
EKS nodes with the correct IAM role automatically handle ECR auth refresh. The issue was that the node group had been created with a custom IAM role that was missing AmazonEC2ContainerRegistryReadOnly. Attached the policy, new nodes picked it up, problem solved permanently.