🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

AWS EKS Cluster Autoscaler Not Scaling — Every Fix (2026)

Your EKS Cluster Autoscaler isn't scaling up, scale-down isn't working, or nodes spin up but stay empty. Here's every cause and the exact fix.

DevOpsBoysMay 26, 20265 min read
Share:Tweet

Cluster Autoscaler on EKS feels like it should just work — but misconfigured IAM, wrong ASG tags, or pending pod annotations can silently prevent scaling for hours. Here's every cause and the exact fix.


How EKS Cluster Autoscaler Works

The Cluster Autoscaler (CA) watches for:

  • Pending pods — pods that can't be scheduled because no node has enough resources
  • Underutilized nodes — nodes where all pods could fit on fewer nodes

It communicates with AWS Auto Scaling Groups (ASGs) to add or remove nodes.

The full flow for scale-up:

Pod pending → CA detects → selects ASG → increases desired count → ASG launches EC2 → kubelet joins → pod scheduled

Symptom 1: Pods Pending but No Scale-Up

Check 1: Is CA Running?

bash
kubectl get pods -n kube-system | grep cluster-autoscaler
 
# Check CA logs
kubectl logs -n kube-system \
  -l app.kubernetes.io/name=cluster-autoscaler \
  --tail=100

Look for lines like:

I0526 No candidates for scale up
W0526 Failed to get node group size

Check 2: Pod Actually Pending?

CA only scales up for pods that are Pending due to resource constraints. Not for pods that are Pending for other reasons.

bash
# Check pod status
kubectl describe pod <pending-pod>
 
# Look for Events section
Events:
  Warning  FailedScheduling  0/3 nodes available: 
    3 Insufficient cpu. CA will scale up for this
 
  Warning  FailedScheduling  pod has unbound PVC  
 CA will NOT scale up for this (storage issue, not resources)

CA scales for: insufficient CPU/memory, node selector not satisfied, no node with GPU. CA does NOT scale for: PVC binding failures, image pull errors, RBAC issues.


Check 3: ASG Tags Are Missing or Wrong

CA discovers ASGs by looking for specific AWS tags. If your node group's ASG doesn't have these tags, CA ignores it.

Required ASG tags:

k8s.io/cluster-autoscaler/<cluster-name>   = owned
k8s.io/cluster-autoscaler/enabled          = true

Check your ASG:

bash
# Get your node group ASG name
aws eks describe-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name my-nodegroup \
  --query "nodegroup.resources.autoScalingGroups"
 
# Check tags on the ASG
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-cluster-nodegroup-xxx \
  --query "AutoScalingGroups[].Tags"

Fix — add tags if missing:

bash
aws autoscaling create-or-update-tags \
  --tags \
    ResourceId=my-asg-name,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/my-cluster,Value=owned,PropagateAtLaunch=false \
    ResourceId=my-asg-name,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=false

Check 4: IAM Permissions Missing

CA needs permissions to describe and update ASGs.

bash
# Check CA's service account has IAM role
kubectl describe serviceaccount cluster-autoscaler -n kube-system
 
# Look for:
# Annotations:  eks.amazonaws.com/role-arn: arn:aws:iam::...

If the annotation is missing, CA is running without IAM permissions.

Required IAM policy:

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeScalingActivities",
        "autoscaling:DescribeTags",
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "ec2:DescribeImages",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:GetInstanceTypesFromInstanceRequirements",
        "eks:DescribeNodegroup"
      ],
      "Resource": "*"
    }
  ]
}

Fix — attach policy and annotate service account:

bash
# Create IAM role with IRSA
eksctl create iamserviceaccount \
  --cluster my-cluster \
  --namespace kube-system \
  --name cluster-autoscaler \
  --attach-policy-arn arn:aws:iam::123456789:policy/ClusterAutoscalerPolicy \
  --override-existing-serviceaccounts \
  --approve

Check 5: CA Version Doesn't Match Cluster Version

CA version must match your Kubernetes minor version.

bash
# Check k8s version
kubectl version --short
 
# Check CA version
kubectl describe deployment cluster-autoscaler -n kube-system | grep Image
 
# CA 1.29 for k8s 1.29, CA 1.30 for k8s 1.30, etc.

Fix — update CA deployment:

bash
# Replace with correct version
kubectl set image deployment/cluster-autoscaler \
  cluster-autoscaler=registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.3 \
  -n kube-system

Find correct versions at: https://github.com/kubernetes/autoscaler/releases


Symptom 2: CA Scales Up but Nodes Stay NotReady or Empty

Cause: Node Labels Don't Match Pod nodeSelector

bash
# Check what labels new nodes get
kubectl get node -l eks.amazonaws.com/nodegroup=my-nodegroup --show-labels
 
# Check what your pod requires
kubectl describe pod <pending-pod> | grep -A5 "Node-Selectors"

If pod requires node-type=gpu but nodes launch with node-type=standard, the pod won't schedule even after scale-up.

Fix — add label to launch template:

bash
# In your EKS node group launch template user data:
--kubelet-extra-args '--node-labels=node-type=gpu'

Cause: Taints Not Matching Tolerations

bash
# Check node taints
kubectl describe node <new-node> | grep Taints
 
# Check pod tolerations
kubectl describe pod <pending-pod> | grep -A5 Tolerations

If the node has a taint the pod doesn't tolerate, pod won't schedule.


Symptom 3: Scale-Down Not Working

Check: Pod Disruption Budgets (PDB)

bash
# Check PDBs
kubectl get pdb -A
 
# Check if any PDB is blocking drain
kubectl describe pdb my-pdb

A PDB that requires minAvailable: 1 with only 1 replica will block scale-down permanently. Make sure your PDBs allow at least 1 pod to be disrupted.


Check: Annotations Preventing Eviction

Some pods have annotations that tell CA not to evict them:

yaml
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
bash
# Find pods with safe-to-evict=false
kubectl get pods -A -o json | \
  python3 -c "
import sys, json
data = json.load(sys.stdin)
for item in data['items']:
    ann = item['metadata'].get('annotations', {})
    if ann.get('cluster-autoscaler.kubernetes.io/safe-to-evict') == 'false':
        print(item['metadata']['namespace'], item['metadata']['name'])
"

These pods keep nodes alive. Remove the annotation if eviction is safe.


Check: Scale-Down Delay

CA has a default cooldown before scale-down:

  • scale-down-delay-after-add: 10 minutes after a node was added
  • scale-down-unneeded-time: 10 minutes a node must be unneeded before removal
  • scale-down-utilization-threshold: 0.5 (node must be below 50% utilized)

This is by design. Wait 15 minutes after a node becomes idle.


Quick Diagnostics Script

bash
#!/bin/bash
echo "=== CA Pod Status ==="
kubectl get pods -n kube-system -l app.kubernetes.io/name=cluster-autoscaler
 
echo ""
echo "=== Pending Pods ==="
kubectl get pods -A --field-selector=status.phase=Pending
 
echo ""
echo "=== Node Capacity ==="
kubectl describe nodes | grep -A4 "Allocated resources"
 
echo ""
echo "=== CA Recent Logs ==="
kubectl logs -n kube-system \
  -l app.kubernetes.io/name=cluster-autoscaler \
  --tail=30 | grep -E "scale|Scale|ERROR|WARN"

Consider Karpenter Instead

If you're fighting CA consistently, consider migrating to Karpenter — it's faster, supports spot instances better, and doesn't require pre-defined ASG node groups.

Karpenter scales in ~45 seconds vs CA's 2-3 minutes, and can provision any EC2 instance type that satisfies your pod requirements.

See: Karpenter vs Cluster Autoscaler


Related: Karpenter vs Cluster Autoscaler | AWS EKS Pods Stuck Pending Fix | Kubernetes OOMKilled Fix

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments