What is Rate Limiting? A Clear Explanation for DevOps Engineers

Rate limiting protects your APIs and infrastructure from overload and abuse. Here's what it is, how it works, and how to implement it in Nginx, Kubernetes, and code.

Rate limiting is one of the most fundamental protection mechanisms for web services. Without it, a single misbehaving client can take down your API for everyone.

What is Rate Limiting?

Rate limiting restricts how many requests a client can make in a given time window.

Client makes 1000 requests/second
Rate limit: 100 requests/second per IP

→ First 100: allowed
→ Requests 101-1000: rejected with HTTP 429 (Too Many Requests)

It protects against:

DDoS / Brute force attacks — automated tools sending thousands of requests
API abuse — scrapers hitting your API without restraint
Runaway clients — a bug in a client causing infinite retry loops
Uneven load — one customer consuming all capacity

Rate Limiting Algorithms

Token Bucket (Most Common)

Imagine a bucket that fills with tokens at a fixed rate. Each request costs one token. If the bucket is empty, the request is rejected.

Bucket capacity: 100 tokens
Fill rate: 10 tokens/second

Client sends 100 requests instantly → all allowed (uses all 100 tokens)
Client sends 1 more request → rejected (bucket empty)
10 seconds later → 100 tokens refilled, client can make 100 more requests

This allows short bursts (emptying the bucket) while maintaining a steady average rate.

Fixed Window Counter

Count requests in a fixed time window (e.g., per minute):

Window: 0-60 seconds → limit 100 requests
Window: 60-120 seconds → limit 100 requests (counter resets)

Simple but has a problem: a client can make 100 requests at 59 seconds and 100 more at 61 seconds — effectively 200 in 2 seconds.

Sliding Window

Counts requests in the last N seconds, not in a fixed window. More accurate:

Request at 14:00:59 → count requests from 14:00:00 to 14:00:59 → 95 (allowed)
Request at 14:01:01 → count requests from 14:00:02 to 14:01:01 → 98 (allowed)

Rate Limiting in Nginx Ingress

nginx

# nginx.conf
http {
  # Define a zone: 10MB storage for IPs, 10 req/second limit
  limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
  
  server {
    location /api/ {
      # Allow burst of 20 requests, then enforce rate
      limit_req zone=api burst=20 nodelay;
      limit_req_status 429;
      
      proxy_pass http://backend;
    }
  }
}

On Kubernetes with Nginx Ingress Controller:

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-api
  annotations:
    nginx.ingress.kubernetes.io/limit-rps: "10"          # requests per second
    nginx.ingress.kubernetes.io/limit-rpm: "300"         # requests per minute
    nginx.ingress.kubernetes.io/limit-burst-multiplier: "3"  # allow burst 3x
    nginx.ingress.kubernetes.io/limit-connections: "20"  # concurrent connections
spec:
  rules:
    - host: api.myapp.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-api
                port:
                  number: 80

Rate Limiting in Traefik

yaml

# Traefik middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: rate-limit
  namespace: default
spec:
  rateLimit:
    average: 100    # requests per second (average)
    burst: 200      # max burst
    period: 1s      # measurement period
    sourceCriterion:
      ipStrategy:
        depth: 1    # use real client IP (behind load balancer)
---
# Apply to IngressRoute
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: my-api
spec:
  routes:
    - match: Host(`api.myapp.com`)
      middlewares:
        - name: rate-limit
      services:
        - name: my-api
          port: 80

Rate Limiting in FastAPI / Python

python

from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import time
import redis
from collections import defaultdict
 
app = FastAPI()
 
# Simple in-memory rate limiter (for single instance)
class InMemoryRateLimiter:
    def __init__(self, requests_per_minute: int = 60):
        self.requests_per_minute = requests_per_minute
        self.requests = defaultdict(list)
    
    def is_allowed(self, client_id: str) -> bool:
        now = time.time()
        window_start = now - 60  # 1 minute window
        
        # Clean up old requests
        self.requests[client_id] = [
            req_time for req_time in self.requests[client_id]
            if req_time > window_start
        ]
        
        if len(self.requests[client_id]) >= self.requests_per_minute:
            return False
        
        self.requests[client_id].append(now)
        return True
 
 
# Redis-backed rate limiter (for multi-instance)
class RedisRateLimiter:
    def __init__(self, redis_client: redis.Redis, requests_per_minute: int = 60):
        self.redis = redis_client
        self.limit = requests_per_minute
    
    def is_allowed(self, client_id: str) -> tuple[bool, int]:
        key = f"rate_limit:{client_id}"
        pipe = self.redis.pipeline()
        
        now = int(time.time())
        window_start = now - 60
        
        pipe.zremrangebyscore(key, 0, window_start)
        pipe.zcard(key)
        pipe.zadd(key, {str(now): now})
        pipe.expire(key, 61)
        
        results = pipe.execute()
        current_count = results[1]
        
        return current_count < self.limit, self.limit - current_count
 
 
limiter = RedisRateLimiter(redis.Redis(host="redis"), requests_per_minute=100)
 
 
@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    # Get client identifier (use API key or IP)
    client_id = request.headers.get("X-API-Key") or request.client.host
    
    allowed, remaining = limiter.is_allowed(client_id)
    
    if not allowed:
        return JSONResponse(
            status_code=429,
            content={"error": "Rate limit exceeded. Try again in 60 seconds."},
            headers={
                "X-RateLimit-Limit": "100",
                "X-RateLimit-Remaining": "0",
                "Retry-After": "60",
            }
        )
    
    response = await call_next(request)
    response.headers["X-RateLimit-Limit"] = "100"
    response.headers["X-RateLimit-Remaining"] = str(remaining)
    return response

Rate Limit Headers

Good APIs tell clients about their rate limits:

http

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1750000000  # Unix timestamp when limit resets

When rate limited:

http

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0

Different Limits for Different Users

Good rate limiting is tiered:

python

RATE_LIMITS = {
    "free": 60,        # 60 req/min
    "pro": 600,        # 600 req/min
    "enterprise": 6000  # 6000 req/min
}
 
def get_rate_limit(api_key: str) -> int:
    tier = lookup_api_key_tier(api_key)
    return RATE_LIMITS.get(tier, 60)

Key Points to Remember

Rate limit by API key, not IP when possible — users behind NAT share an IP
Return proper 429 status with Retry-After header so clients know when to retry
Use Redis for rate limiting in multi-instance deployments (in-memory doesn't share)
Allow bursts — token bucket lets clients batch requests without constant rejection
Rate limit at the gateway level (Nginx, Traefik) to protect before requests hit your app

Rate limiting is one of those things that seems optional until you get your first DDoS or runaway client. Add it before you need it.

Tools: Nginx rate limiting | Redis | FastAPI

What is Rate Limiting? A Clear Explanation for DevOps Engineers