AWS ECS Service Discovery Not Working — Every Fix (2026)
Your ECS services can't find each other. Service Connect or Cloud Map DNS isn't resolving. Here's every cause — wrong namespace, missing IAM, wrong DNS config, VPC resolver issues — and exactly how to fix each one.
ECS services need to talk to each other. When service discovery breaks — whether you're using Service Connect, Cloud Map, or plain DNS — nothing works and the error messages are often cryptic.
Here's every cause and the exact fix.
Which Service Discovery Are You Using?
First, identify your setup:
- ECS Service Connect — Newer AWS feature (2022+). Uses Envoy sidecar proxy. Configured in service definition.
- AWS Cloud Map + ECS — Cloud Map registers service instances. Services discover via DNS (
myservice.mynamespace). - Internal ALB — Services communicate through an Application Load Balancer.
- Plain EC2/Task IP — Services hardcode IPs or use environment variables.
Most modern ECS setups use Service Connect. Cloud Map is the previous approach.
Service Connect Issues
Error: "Connection refused" or "No such host"
Diagnosis:
# Check if Service Connect proxy is running in the task
aws ecs describe-tasks \
--cluster my-cluster \
--tasks <task-arn> \
--query 'tasks[0].containers[*].{name:name,status:lastStatus}'You should see your app container AND an ecs-service-connect-agent container.
Cause 1: Service Connect not enabled on the cluster
# Check cluster Service Connect default
aws ecs describe-clusters \
--clusters my-cluster \
--query 'clusters[0].serviceConnectDefaults'Fix — enable on cluster:
aws ecs update-cluster \
--cluster my-cluster \
--service-connect-defaults namespace=my-namespaceCause 2: Service name mismatch
The client must use the exact portName configured on the server service:
// Server service definition
"serviceConnectConfiguration": {
"enabled": true,
"namespace": "my-namespace",
"services": [{
"portName": "api", // ← This exact name
"clientAliases": [{
"port": 8080,
"dnsName": "api-service" // ← DNS name clients use
}]
}]
}Client calls http://api-service:8080 — if the dnsName or port doesn't match, it fails.
Cause 3: Missing namespace
# Check Cloud Map namespace exists
aws servicediscovery list-namespaces \
--query 'Namespaces[*].{Name:Name,Id:Id,Type:Type}'If the namespace doesn't exist, Service Connect can't register services:
aws servicediscovery create-private-dns-namespace \
--name my-namespace \
--vpc vpc-xxxxxxxxError: Service Connect Agent Failing to Start
# Check agent logs
aws logs get-log-events \
--log-group-name /ecs/my-service \
--log-stream-name ecs-service-connect-agent/<task-id>Common cause: IAM permissions
The task execution role needs:
{
"Effect": "Allow",
"Action": [
"servicediscovery:RegisterInstance",
"servicediscovery:DeregisterInstance",
"servicediscovery:DiscoverInstances",
"servicediscovery:Get*",
"servicediscovery:List*",
"route53:GetHealthCheck",
"route53:CreateHealthCheck",
"route53:UpdateHealthCheck",
"route53:DeleteHealthCheck",
"route53:ChangeResourceRecordSets"
],
"Resource": "*"
}AWS provides a managed policy: AmazonECSTaskExecutionRolePolicy — make sure it's attached to your execution role.
Cloud Map DNS Issues
Error: DNS Name Not Resolving
# Test from inside the VPC (e.g., from a bastion or debug container)
nslookup myservice.my-namespace
# or
dig myservice.my-namespaceCause 1: Wrong DNS suffix
For Cloud Map private DNS namespaces, the full DNS name is:
<service-name>.<namespace-name>
But inside ECS tasks, you also need the namespace type to match your VPC DNS:
- Private DNS namespace → resolves within VPC
- Public DNS namespace → resolves publicly (not for internal comms)
Cause 2: VPC DNS resolution not enabled
# Check VPC DNS settings
aws ec2 describe-vpcs \
--vpc-ids vpc-xxxxxxxx \
--query 'Vpcs[0].{DNS_resolution:EnableDnsSupport, DNS_hostnames:EnableDnsHostnames}'Both must be true. Fix:
aws ec2 modify-vpc-attribute --vpc-id vpc-xxxxxxxx --enable-dns-support
aws ec2 modify-vpc-attribute --vpc-id vpc-xxxxxxxx --enable-dns-hostnamesCause 3: Security group blocking DNS
DNS uses port 53 (UDP and TCP). If your task's security group blocks outbound port 53, DNS won't work:
# Check security group outbound rules
aws ec2 describe-security-groups \
--group-ids sg-xxxxxxxx \
--query 'SecurityGroups[0].IpPermissionsEgress'Add outbound rule for DNS:
aws ec2 authorize-security-group-egress \
--group-id sg-xxxxxxxx \
--protocol udp \
--port 53 \
--cidr 0.0.0.0/0Error: Service Registered but Not Resolving
# Check Cloud Map service instances
aws servicediscovery discover-instances \
--namespace-name my-namespace \
--service-name my-service \
--max-results 10Cause: Health check failing
Cloud Map uses Route53 health checks. If they fail, instances are deregistered:
# Check health check status
aws route53 list-health-checks \
--query 'HealthChecks[*].{Id:Id,Status:HealthCheckConfig.Type}'
aws route53 get-health-check-status \
--health-check-id <id>Fix — configure correct health check port and path in service definition, or disable health checks for internal services:
"HealthCheckCustomConfig": {
"FailureThreshold": 1
}Task Can't Reach Another Task
Cause: Security Group Rules
ECS tasks communicate over their task IP (awsvpc mode). The source task's security group must be allowed in the destination task's security group:
# Task A tries to reach Task B on port 8080
# Task B's security group must allow inbound from Task A's SG
aws ec2 authorize-security-group-ingress \
--group-id <task-b-sg> \
--protocol tcp \
--port 8080 \
--source-group <task-a-sg># Verify security group rules
aws ec2 describe-security-groups \
--group-ids <task-b-sg> \
--query 'SecurityGroups[0].IpPermissions'Cause: Bridge Networking Mode
If tasks use bridge networking (not awsvpc), they share the EC2 host's network. Port mapping conflicts can occur. Prefer awsvpc mode for isolation and proper service discovery.
Debugging Checklist
# 1. Is the namespace correct?
aws servicediscovery list-namespaces
# 2. Is the service registered?
aws servicediscovery list-services \
--filters Name=NAMESPACE_ID,Values=<namespace-id>,Condition=EQ
# 3. Are instances healthy?
aws servicediscovery discover-instances \
--namespace-name my-namespace \
--service-name my-service
# 4. Can the task resolve DNS? (exec into the task)
aws ecs execute-command \
--cluster my-cluster \
--task <task-arn> \
--container my-container \
--interactive \
--command "/bin/sh"
# Then inside: nslookup my-service.my-namespace
# 5. Is the port correct?
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query 'services[0].serviceConnectConfiguration'Related: AWS EKS Pods Stuck Pending Fix | AWS VPC Networking Guide
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AWS ALB 504 Gateway Timeout — Every Cause and Fix (2026)
Your ALB returns 504 Gateway Timeout but the app seems fine. Here's every reason this happens — backend timeouts, keepalive mismatches, health check failures — and exactly how to fix each one.
AWS ALB Showing Unhealthy Targets — How to Fix It
Fix AWS Application Load Balancer unhealthy targets. Covers health check misconfigurations, security group issues, target group problems, and EKS-specific ALB controller debugging.
AWS CloudFront 403 Forbidden — Every Cause and Fix (2026)
CloudFront returns 403 Forbidden but your S3 bucket or origin looks fine. Here's every cause — OAC misconfiguration, bucket policy missing, wrong origin domain, geo-restriction — and the exact fix.