All Articles

AWS ECS Task Keeps Stopping — How to Fix It (2026)

Your ECS task starts and then immediately stops or keeps restarting. Here's every reason this happens and how to debug and fix it.

DevOpsBoysApr 5, 20263 min read
Share:Tweet

You deploy to ECS and your task immediately goes into STOPPED state. Or it runs for 30 seconds and then dies. The service keeps trying to restart it, hitting the circuit breaker, and eventually gives up.

Here's how to find and fix the root cause.


Step 1: Find the Stop Reason

The most important first step — check why the task stopped:

bash
# AWS CLI
aws ecs describe-tasks \
  --cluster your-cluster \
  --tasks <task-arn> \
  --query 'tasks[0].{status:lastStatus,stopCode:stopCode,reason:stoppedReason}'

Or in the console: ECS → Cluster → Service → Tasks → Stopped tab → click the task → check Stopped reason.

Common stop reasons:

  • Essential container in task exited — your app container crashed
  • Task failed ELB health checks — health check failing
  • CannotPullContainerError — image pull failed
  • OutOfMemoryError: Container killed — OOM

Cause 1: Application Crash on Startup

Most common cause. The container starts but the app crashes immediately.

Check container logs:

bash
# If using CloudWatch Logs
aws logs get-log-events \
  --log-group-name /ecs/your-service \
  --log-stream-name <log-stream-id> \
  --limit 50
 
# Or in console: ECS → Task → Logs tab

Look for: stack traces, missing environment variables, database connection failures, port conflicts.

Fix: Reproduce locally first:

bash
docker run --rm \
  -e DATABASE_URL=your-db-url \
  -e PORT=8080 \
  your-image:tag

If it crashes locally too — fix the app, not ECS.


Cause 2: Missing or Wrong Environment Variables

App starts fine locally but fails in ECS because a required env var is missing.

bash
# Check task definition env vars
aws ecs describe-task-definition \
  --task-definition your-task-def \
  --query 'taskDefinition.containerDefinitions[0].environment'

Fix: Add the missing variables in your task definition. For secrets, use AWS Secrets Manager:

json
"secrets": [
  {
    "name": "DATABASE_PASSWORD",
    "valueFrom": "arn:aws:secretsmanager:us-east-1:123:secret:prod/db-password"
  }
]

Cause 3: Health Check Failing

ECS or your load balancer is killing the task because it's not passing health checks.

bash
# Check target group health
aws elbv2 describe-target-health \
  --target-group-arn your-tg-arn

Fix the ECS health check (in task definition):

json
"healthCheck": {
  "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
  "interval": 30,
  "timeout": 10,
  "retries": 3,
  "startPeriod": 60
}

startPeriod is critical — give slow-starting apps time before health checks begin. Default is 0 which kills apps that take > 10s to start.


Cause 4: OOM — Container Killed

The task is using more memory than its limit.

bash
aws ecs describe-tasks \
  --cluster your-cluster \
  --tasks <task-arn> \
  --query 'tasks[0].stoppedReason'
# "OutOfMemoryError: Container killed"

Fix: Increase memory in task definition or reduce app memory usage:

json
{
  "memory": 1024,
  "memoryReservation": 512
}

Add CloudWatch Container Insights to track memory usage over time.


Cause 5: Image Pull Failure

ECS can't pull the Docker image from ECR or Docker Hub.

bash
# Check task stopped reason
"CannotPullContainerError: ref pull has been retried 5 time(s)"

Common causes:

  • ECR repository doesn't exist
  • Task IAM role doesn't have ECR pull permissions
  • Image tag doesn't exist
  • ECR login expired (for cross-account)

Fix for ECR permissions — add to your task execution role:

json
{
  "Effect": "Allow",
  "Action": [
    "ecr:GetAuthorizationToken",
    "ecr:BatchCheckLayerAvailability",
    "ecr:GetDownloadUrlForLayer",
    "ecr:BatchGetImage"
  ],
  "Resource": "*"
}

Cause 6: Port Conflict

If you're using host network mode, another task might already be using the port.

Fix: Use awsvpc network mode (recommended for Fargate and ECS on EC2):

json
"networkMode": "awsvpc"

Each task gets its own ENI and private IP. No port conflicts possible.


Enable CloudWatch Container Insights

Turn this on — it gives you CPU, memory, network metrics per task:

bash
aws ecs update-cluster-settings \
  --cluster your-cluster \
  --settings name=containerInsights,value=enabled

Quick Debug Checklist

SymptomCheckFix
Immediate exitContainer logsFix app crash
Missing env varTask def env sectionAdd variable
Health check failTarget group healthFix endpoint + startPeriod
OOM killedStopped reasonIncrease memory limit
Can't pull imageECR permissionsFix task execution role IAM
Port in useNetwork modeSwitch to awsvpc
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments