🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

LLM Gateway in Production: Multi-Provider Routing + Fallbacks with LiteLLM

Running one LLM provider in production is a single point of failure. Here's how to build an LLM gateway with LiteLLM that routes traffic, handles fallbacks, enforces cost limits, and gives you observability.

DevOpsBoysJun 12, 20265 min read
Share:Tweet

Your LLM application has a problem you probably haven't thought about yet: you're 100% dependent on one API provider.

Anthropic goes down. OpenAI rate-limits you during a traffic spike. Your Azure OpenAI deployment runs out of capacity. Any of these events takes your entire product offline.

The solution is an LLM Gateway — a proxy layer that sits between your application and your LLM providers, handling routing, fallbacks, cost controls, and observability. LiteLLM is the best open source option for this, and deploying it properly in production is what this post is about.

What an LLM Gateway Does

Think of it like an API gateway (Nginx, Kong, APIM) but purpose-built for LLMs:

  • Unified API — your application calls one endpoint, the gateway translates to provider-specific formats
  • Load balancing — distribute traffic across providers or deployments
  • Fallback chains — if provider A fails, automatically retry with provider B
  • Cost guardrails — enforce spend limits per team, per key, per model
  • Rate limiting — prevent a single service from consuming all your quota
  • Observability — log every request, track cost per model, latency per provider
  • Caching — return cached responses for identical prompts (cuts cost dramatically)

LiteLLM Proxy Architecture

Application Code
      │
      │ (OpenAI-compatible API)
      ▼
  LiteLLM Proxy
      │
  ┌───┴──────────────────────────────────┐
  │   Router: Load Balance + Fallback    │
  └───┬──────────┬──────────┬────────────┘
      │          │          │
  Anthropic   OpenAI    Azure OpenAI
  Claude     GPT-4o     GPT-4o

Your application code never changes when you add or switch providers. It always talks to http://litellm-proxy/v1/chat/completions with the OpenAI SDK format.

Step 1: LiteLLM Config

yaml
# litellm_config.yaml
model_list:
  # Primary: Anthropic Claude
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 50           # Requests per minute limit
      tpm: 100000       # Tokens per minute limit
 
  # Primary: OpenAI GPT-4o
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 100
      tpm: 200000
 
  # Fallback: Azure OpenAI
  - model_name: azure-gpt4
    litellm_params:
      model: azure/gpt-4o
      api_base: os.environ/AZURE_OPENAI_ENDPOINT
      api_key: os.environ/AZURE_OPENAI_API_KEY
      api_version: "2024-02-01"
      rpm: 60
 
  # Fast cheap model for simple tasks
  - model_name: claude-haiku
    litellm_params:
      model: anthropic/claude-haiku-4-5-20251001
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 200
      tpm: 500000
 
  # Router group: production-llm with fallback chain
  - model_name: production-llm
    litellm_params:
      model: claude-sonnet
    model_info:
      id: production-primary
 
router_settings:
  # Fallback chain: if claude-sonnet fails, try gpt-4o, then azure-gpt4
  fallbacks:
    - {"claude-sonnet": ["gpt-4o", "azure-gpt4"]}
    - {"gpt-4o": ["azure-gpt4", "claude-sonnet"]}
  
  # Retry configuration
  num_retries: 3
  retry_after: 5  # seconds between retries
  
  # Routing strategy
  routing_strategy: "usage-based-routing-v2"  # Routes to least loaded model
  
  # Content policy fallbacks
  content_policy_fallbacks:
    - {"claude-sonnet": ["gpt-4o"]}
 
litellm_settings:
  # Global settings
  drop_params: true        # Remove unsupported params instead of erroring
  request_timeout: 60      # Timeout per request
  
  # Caching - saves cost on repeated prompts
  cache: true
  cache_params:
    type: "redis"
    host: os.environ/REDIS_HOST
    port: 6379
    ttl: 3600              # Cache for 1 hour
  
  # Callbacks for observability
  success_callback: ["langfuse"]
  failure_callback: ["langfuse", "slack"]
 
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL  # PostgreSQL for spend tracking
  
  # Virtual keys with budget limits
  alerting: ["slack"]
  alerting_threshold: 1000  # Alert when total spend exceeds $1000

Step 2: Deploy on Kubernetes

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
  namespace: platform
spec:
  replicas: 3
  selector:
    matchLabels:
      app: litellm-proxy
  template:
    metadata:
      labels:
        app: litellm-proxy
    spec:
      containers:
      - name: litellm
        image: ghcr.io/berriai/litellm:main-latest
        ports:
        - containerPort: 4000
        args: ["--config", "/app/config/litellm_config.yaml", "--port", "4000", "--detailed_debug"]
        env:
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: llm-secrets
              key: anthropic-api-key
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: llm-secrets
              key: openai-api-key
        - name: LITELLM_MASTER_KEY
          valueFrom:
            secretKeyRef:
              name: llm-secrets
              key: litellm-master-key
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: llm-secrets
              key: database-url
        - name: REDIS_HOST
          value: "redis-service"
        volumeMounts:
        - name: config
          mountPath: /app/config
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        readinessProbe:
          httpGet:
            path: /health/readiness
            port: 4000
          initialDelaySeconds: 10
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health/liveliness
            port: 4000
          initialDelaySeconds: 30
          periodSeconds: 30
      volumes:
      - name: config
        configMap:
          name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
  name: litellm-proxy
  namespace: platform
spec:
  selector:
    app: litellm-proxy
  ports:
  - port: 4000
    targetPort: 4000

Step 3: Virtual Keys for Team Budget Control

Create separate API keys for each team with individual spend limits:

bash
# Create a key for the search team with $500/month limit
curl -X POST http://litellm-proxy:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "search-team",
    "team_id": "search",
    "max_budget": 500,
    "budget_duration": "30d",
    "models": ["claude-sonnet", "claude-haiku"],
    "tpm_limit": 50000,
    "rpm_limit": 100
  }'

The search team gets their own key. If they exceed $500 in a month, requests start failing — but only for their key. Other teams are unaffected.

Step 4: Using the Gateway in Application Code

Your application doesn't change. Just point the base URL to LiteLLM:

python
from anthropic import Anthropic
 
# Before: direct to Anthropic
client = Anthropic(api_key="sk-ant-...")
 
# After: through LiteLLM gateway (still uses Anthropic SDK!)
client = Anthropic(
    api_key="sk-your-litellm-virtual-key",
    base_url="http://litellm-proxy.platform:4000"
)
 
# Code is identical from here
response = client.messages.create(
    model="claude-sonnet",  # LiteLLM routes this to the right provider
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this log file..."}]
)

Step 5: Observability Dashboard

LiteLLM exposes metrics in Prometheus format:

yaml
# Add to your Prometheus scrape config
- job_name: litellm
  static_configs:
    - targets: ['litellm-proxy.platform:4000']
  metrics_path: /metrics

Key metrics to dashboard:

  • litellm_llm_api_latency_metric — p50/p95/p99 latency per model
  • litellm_requests_metric — request count per model, per team
  • litellm_spend_metric — cost per model, per virtual key
  • litellm_remaining_requests_metric — how much rate limit remains

Fallback Testing

Test that your fallback chain actually works:

python
import httpx
import asyncio
 
async def test_fallback():
    # Simulate claude-sonnet being unavailable by sending to a bad model
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://litellm-proxy:4000/v1/chat/completions",
            headers={"Authorization": "Bearer sk-your-key"},
            json={
                "model": "claude-sonnet",
                "messages": [{"role": "user", "content": "Hello"}],
                "mock_testing_fallbacks": True  # LiteLLM test parameter
            }
        )
        data = response.json()
        # Check which model actually responded
        print(f"Model used: {data.get('model')}")
 
asyncio.run(test_fallback())

What This Buys You

A single Anthropic API key is:

  • A single point of failure
  • Invisible (no observability)
  • Uncontrolled (any team can spend unlimited)
  • Locked in (switching providers requires code changes)

An LLM gateway fixes all four. The extra infrastructure cost (one small deployment, one Redis, one PostgreSQL) is trivial compared to the operational control you gain.

For any team running LLMs in production at scale — this is table stakes. Not optional.

Check what your LLM infrastructure should cost: DevOps Salary Calculator — also covers ML engineering compensation data.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments