All Articles

AWS ALB 504 Gateway Timeout — Every Cause and Fix (2026)

Your ALB returns 504 Gateway Timeout but the app seems fine. Here's every reason this happens — backend timeouts, keepalive mismatches, health check failures — and exactly how to fix each one.

DevOpsBoysApr 9, 20266 min read
Share:Tweet

Your Application Load Balancer returns 504 Gateway Timeout to clients. Your EC2 or ECS container looks healthy. Logs show requests arriving — but no response. Users are angry.

Here's every reason this happens and exactly how to fix it.


What 504 Means at the ALB Layer

An ALB 504 means the load balancer forwarded the request to a target but didn't receive a response within the timeout window. The ALB gave up waiting.

This is different from:

  • 502 — target returned an invalid/empty response
  • 503 — no healthy targets registered
  • 504 — target is alive but too slow to respond

The ALB has an idle timeout (default: 60 seconds). If your backend doesn't respond within that window, you get 504.


Case 1: Backend Processing Time Exceeds ALB Idle Timeout

The most common cause. Your app takes longer than 60 seconds to process a request (large file upload, slow DB query, heavy computation).

Check it:

bash
# Check current ALB idle timeout
aws elbv2 describe-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:region:account:loadbalancer/app/my-alb/xxx \
  --query 'Attributes[?Key==`idle_timeout.timeout_seconds`]'

Fix — increase ALB idle timeout:

bash
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:region:account:loadbalancer/app/my-alb/xxx \
  --attributes Key=idle_timeout.timeout_seconds,Value=120

Or in Terraform:

hcl
resource "aws_lb" "main" {
  name               = "my-alb"
  internal           = false
  load_balancer_type = "application"
 
  idle_timeout = 120  # increase from default 60
 
  # ... other config
}

Also fix your backend — don't just raise the timeout as a band-aid. Optimize slow queries, add pagination, use async processing for heavy jobs.


Case 2: Target Group Health Check Failing Silently

Targets appear registered but fail health checks. ALB routes requests to them anyway during the draining window — then times out.

Check it:

bash
# Check health of all targets in target group
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/xxx

Look for State: unhealthy or State: draining. Also check:

bash
# What does the health check look like?
aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/xxx \
  --query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Protocol:HealthCheckProtocol,Threshold:HealthyThresholdCount}'

Common issues:

  • Health check path returns 404 (app doesn't have /health endpoint)
  • Health check hits wrong port
  • App not ready at startup but ALB sends traffic immediately

Fix:

hcl
resource "aws_lb_target_group" "app" {
  name     = "app-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id
 
  health_check {
    enabled             = true
    path                = "/health"    # must return 200
    port                = "traffic-port"
    protocol            = "HTTP"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 10
    matcher             = "200"
  }
}

Case 3: Keepalive Mismatch Between ALB and Backend

This is a subtle one. ALB keeps HTTP connections alive to reuse them. If your backend closes connections faster than the ALB expects, the ALB sends a request on a dead connection and waits for a response that never comes.

The rule: your backend's keepalive timeout must be longer than ALB's idle timeout.

If ALB idle timeout = 60s but nginx keepalive = 30s, nginx closes the connection at 30s. ALB tries to reuse it at second 45 — gets nothing back — returns 504.

Fix for Nginx:

nginx
# nginx.conf
http {
  keepalive_timeout 75s;  # must be > ALB idle timeout (60s)
  keepalive_requests 1000;
}

Fix for Node.js:

javascript
const server = app.listen(8080);
server.keepAliveTimeout = 75000;  // 75 seconds in ms
server.headersTimeout = 76000;    // must be > keepAliveTimeout

Fix for Python (gunicorn):

bash
gunicorn app:app \
  --bind 0.0.0.0:8080 \
  --keepalive 75 \
  --timeout 120

Case 4: Security Group Blocking Return Traffic

Your app receives the request but the response packet is blocked by a security group rule. ALB never gets the response.

Check it:

bash
# Check security group on targets — must allow traffic FROM ALB SG
aws ec2 describe-security-groups --group-ids sg-xxxxxx \
  --query 'SecurityGroups[0].IpPermissions'

The target security group must allow inbound from the ALB security group on the app port — not just a CIDR range.

Correct pattern:

hcl
# ALB security group
resource "aws_security_group" "alb" {
  name   = "alb-sg"
  vpc_id = var.vpc_id
 
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
 
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}
 
# App/EC2 security group
resource "aws_security_group" "app" {
  name   = "app-sg"
  vpc_id = var.vpc_id
 
  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]  # ALB SG, not CIDR
  }
 
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Case 5: Target Group Deregistration Delay Too High

When deploying new versions (rolling update), old targets enter draining state. During draining, ALB still sends requests to them. If your app has already shut down but the deregistration delay is 300s (default), ALB keeps trying for up to 5 minutes — getting timeouts.

Check it:

bash
aws elbv2 describe-target-group-attributes \
  --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/xxx \
  --query 'Attributes[?Key==`deregistration_delay.timeout_seconds`]'

Fix:

hcl
resource "aws_lb_target_group" "app" {
  # ...
 
  deregistration_delay = 30  # reduce from default 300s if app shuts down faster
}

Also add graceful shutdown in your app — handle SIGTERM, finish in-flight requests, then exit cleanly.


Case 6: ECS Task Stopping Mid-Request

In ECS, if a task is stopped (scale-in, deployment, OOM kill) while handling a request, ALB gets a broken connection — 504.

Fix — ECS task graceful shutdown:

In your task_definition, add a stopTimeout:

hcl
resource "aws_ecs_task_definition" "app" {
  family = "app"
 
  container_definitions = jsonencode([{
    name  = "app"
    image = "myapp:v1.0"
    portMappings = [{
      containerPort = 8080
      protocol      = "tcp"
    }]
    stopTimeout = 30  # give container 30s to finish requests before SIGKILL
  }])
}

Your app should also handle SIGTERM:

python
import signal
import sys
 
def graceful_shutdown(signum, frame):
    print("Received SIGTERM, finishing in-flight requests...")
    # stop accepting new requests
    # wait for active requests to complete
    sys.exit(0)
 
signal.signal(signal.SIGTERM, graceful_shutdown)

Quick Diagnosis Checklist

When you see ALB 504, check in this order:

CheckCommand
ALB idle timeoutaws elbv2 describe-load-balancer-attributes
Target healthaws elbv2 describe-target-health
Backend response timeCloudWatch → ALB → TargetResponseTime metric
Security group rulesCheck app SG allows ALB SG
Backend keepalive configCheck nginx/node/gunicorn settings
Deregistration delayCheck target group attributes

Reading ALB Access Logs

Enable access logs to see the actual error details:

hcl
resource "aws_lb" "main" {
  access_logs {
    bucket  = aws_s3_bucket.alb_logs.bucket
    prefix  = "alb"
    enabled = true
  }
}

In the logs, 504s appear as:

- - - [timestamp] "GET /api/slow HTTP/1.1" 504 - "-" "-" "0.000" "-" "-"

The TargetProcessingTime field tells you exactly how long the backend took before ALB gave up.


Summary

CauseFix
Backend too slowIncrease ALB idle timeout + optimize app
Unhealthy targetsFix health check path/port
Keepalive mismatchSet backend keepalive > ALB idle timeout
Security group blockingAllow ALB SG → App SG
High deregistration delayReduce to 30s + add graceful shutdown
ECS task stopping mid-requestSet stopTimeout + handle SIGTERM

ALB 504s almost always come down to one of these six causes. Work through the checklist, check CloudWatch metrics for TargetResponseTime, and you'll find the root cause within 10 minutes.


Want to go deeper? Check our AWS VPC Networking Complete Guide and AWS CloudWatch Monitoring Guide.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments