🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

AWS ECS Service Discovery Not Working — Every Fix (2026)

Your ECS services can't find each other. Service Connect or Cloud Map DNS isn't resolving. Here's every cause — wrong namespace, missing IAM, wrong DNS config, VPC resolver issues — and exactly how to fix each one.

DevOpsBoysMay 22, 20264 min read
Share:Tweet

ECS services need to talk to each other. When service discovery breaks — whether you're using Service Connect, Cloud Map, or plain DNS — nothing works and the error messages are often cryptic.

Here's every cause and the exact fix.


Which Service Discovery Are You Using?

First, identify your setup:

  1. ECS Service Connect — Newer AWS feature (2022+). Uses Envoy sidecar proxy. Configured in service definition.
  2. AWS Cloud Map + ECS — Cloud Map registers service instances. Services discover via DNS (myservice.mynamespace).
  3. Internal ALB — Services communicate through an Application Load Balancer.
  4. Plain EC2/Task IP — Services hardcode IPs or use environment variables.

Most modern ECS setups use Service Connect. Cloud Map is the previous approach.


Service Connect Issues

Error: "Connection refused" or "No such host"

Diagnosis:

bash
# Check if Service Connect proxy is running in the task
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks <task-arn> \
  --query 'tasks[0].containers[*].{name:name,status:lastStatus}'

You should see your app container AND an ecs-service-connect-agent container.

Cause 1: Service Connect not enabled on the cluster

bash
# Check cluster Service Connect default
aws ecs describe-clusters \
  --clusters my-cluster \
  --query 'clusters[0].serviceConnectDefaults'

Fix — enable on cluster:

bash
aws ecs update-cluster \
  --cluster my-cluster \
  --service-connect-defaults namespace=my-namespace

Cause 2: Service name mismatch

The client must use the exact portName configured on the server service:

json
// Server service definition
"serviceConnectConfiguration": {
  "enabled": true,
  "namespace": "my-namespace",
  "services": [{
    "portName": "api",          // ← This exact name
    "clientAliases": [{
      "port": 8080,
      "dnsName": "api-service"  // ← DNS name clients use
    }]
  }]
}

Client calls http://api-service:8080 — if the dnsName or port doesn't match, it fails.

Cause 3: Missing namespace

bash
# Check Cloud Map namespace exists
aws servicediscovery list-namespaces \
  --query 'Namespaces[*].{Name:Name,Id:Id,Type:Type}'

If the namespace doesn't exist, Service Connect can't register services:

bash
aws servicediscovery create-private-dns-namespace \
  --name my-namespace \
  --vpc vpc-xxxxxxxx

Error: Service Connect Agent Failing to Start

bash
# Check agent logs
aws logs get-log-events \
  --log-group-name /ecs/my-service \
  --log-stream-name ecs-service-connect-agent/<task-id>

Common cause: IAM permissions

The task execution role needs:

json
{
  "Effect": "Allow",
  "Action": [
    "servicediscovery:RegisterInstance",
    "servicediscovery:DeregisterInstance",
    "servicediscovery:DiscoverInstances",
    "servicediscovery:Get*",
    "servicediscovery:List*",
    "route53:GetHealthCheck",
    "route53:CreateHealthCheck",
    "route53:UpdateHealthCheck",
    "route53:DeleteHealthCheck",
    "route53:ChangeResourceRecordSets"
  ],
  "Resource": "*"
}

AWS provides a managed policy: AmazonECSTaskExecutionRolePolicy — make sure it's attached to your execution role.


Cloud Map DNS Issues

Error: DNS Name Not Resolving

bash
# Test from inside the VPC (e.g., from a bastion or debug container)
nslookup myservice.my-namespace
# or
dig myservice.my-namespace

Cause 1: Wrong DNS suffix

For Cloud Map private DNS namespaces, the full DNS name is: <service-name>.<namespace-name>

But inside ECS tasks, you also need the namespace type to match your VPC DNS:

  • Private DNS namespace → resolves within VPC
  • Public DNS namespace → resolves publicly (not for internal comms)

Cause 2: VPC DNS resolution not enabled

bash
# Check VPC DNS settings
aws ec2 describe-vpcs \
  --vpc-ids vpc-xxxxxxxx \
  --query 'Vpcs[0].{DNS_resolution:EnableDnsSupport, DNS_hostnames:EnableDnsHostnames}'

Both must be true. Fix:

bash
aws ec2 modify-vpc-attribute --vpc-id vpc-xxxxxxxx --enable-dns-support
aws ec2 modify-vpc-attribute --vpc-id vpc-xxxxxxxx --enable-dns-hostnames

Cause 3: Security group blocking DNS

DNS uses port 53 (UDP and TCP). If your task's security group blocks outbound port 53, DNS won't work:

bash
# Check security group outbound rules
aws ec2 describe-security-groups \
  --group-ids sg-xxxxxxxx \
  --query 'SecurityGroups[0].IpPermissionsEgress'

Add outbound rule for DNS:

bash
aws ec2 authorize-security-group-egress \
  --group-id sg-xxxxxxxx \
  --protocol udp \
  --port 53 \
  --cidr 0.0.0.0/0

Error: Service Registered but Not Resolving

bash
# Check Cloud Map service instances
aws servicediscovery discover-instances \
  --namespace-name my-namespace \
  --service-name my-service \
  --max-results 10

Cause: Health check failing

Cloud Map uses Route53 health checks. If they fail, instances are deregistered:

bash
# Check health check status
aws route53 list-health-checks \
  --query 'HealthChecks[*].{Id:Id,Status:HealthCheckConfig.Type}'
 
aws route53 get-health-check-status \
  --health-check-id <id>

Fix — configure correct health check port and path in service definition, or disable health checks for internal services:

json
"HealthCheckCustomConfig": {
  "FailureThreshold": 1
}

Task Can't Reach Another Task

Cause: Security Group Rules

ECS tasks communicate over their task IP (awsvpc mode). The source task's security group must be allowed in the destination task's security group:

bash
# Task A tries to reach Task B on port 8080
# Task B's security group must allow inbound from Task A's SG
 
aws ec2 authorize-security-group-ingress \
  --group-id <task-b-sg> \
  --protocol tcp \
  --port 8080 \
  --source-group <task-a-sg>
bash
# Verify security group rules
aws ec2 describe-security-groups \
  --group-ids <task-b-sg> \
  --query 'SecurityGroups[0].IpPermissions'

Cause: Bridge Networking Mode

If tasks use bridge networking (not awsvpc), they share the EC2 host's network. Port mapping conflicts can occur. Prefer awsvpc mode for isolation and proper service discovery.


Debugging Checklist

bash
# 1. Is the namespace correct?
aws servicediscovery list-namespaces
 
# 2. Is the service registered?
aws servicediscovery list-services \
  --filters Name=NAMESPACE_ID,Values=<namespace-id>,Condition=EQ
 
# 3. Are instances healthy?
aws servicediscovery discover-instances \
  --namespace-name my-namespace \
  --service-name my-service
 
# 4. Can the task resolve DNS? (exec into the task)
aws ecs execute-command \
  --cluster my-cluster \
  --task <task-arn> \
  --container my-container \
  --interactive \
  --command "/bin/sh"
# Then inside: nslookup my-service.my-namespace
 
# 5. Is the port correct?
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[0].serviceConnectConfiguration'

Related: AWS EKS Pods Stuck Pending Fix | AWS VPC Networking Guide

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments