LLM Gateway in Production: Multi-Provider Routing + Fallbacks with LiteLLM
Running one LLM provider in production is a single point of failure. Here's how to build an LLM gateway with LiteLLM that routes traffic, handles fallbacks, enforces cost limits, and gives you observability.
Your LLM application has a problem you probably haven't thought about yet: you're 100% dependent on one API provider.
Anthropic goes down. OpenAI rate-limits you during a traffic spike. Your Azure OpenAI deployment runs out of capacity. Any of these events takes your entire product offline.
The solution is an LLM Gateway — a proxy layer that sits between your application and your LLM providers, handling routing, fallbacks, cost controls, and observability. LiteLLM is the best open source option for this, and deploying it properly in production is what this post is about.
What an LLM Gateway Does
Think of it like an API gateway (Nginx, Kong, APIM) but purpose-built for LLMs:
- Unified API — your application calls one endpoint, the gateway translates to provider-specific formats
- Load balancing — distribute traffic across providers or deployments
- Fallback chains — if provider A fails, automatically retry with provider B
- Cost guardrails — enforce spend limits per team, per key, per model
- Rate limiting — prevent a single service from consuming all your quota
- Observability — log every request, track cost per model, latency per provider
- Caching — return cached responses for identical prompts (cuts cost dramatically)
LiteLLM Proxy Architecture
Application Code
│
│ (OpenAI-compatible API)
▼
LiteLLM Proxy
│
┌───┴──────────────────────────────────┐
│ Router: Load Balance + Fallback │
└───┬──────────┬──────────┬────────────┘
│ │ │
Anthropic OpenAI Azure OpenAI
Claude GPT-4o GPT-4o
Your application code never changes when you add or switch providers. It always talks to http://litellm-proxy/v1/chat/completions with the OpenAI SDK format.
Step 1: LiteLLM Config
# litellm_config.yaml
model_list:
# Primary: Anthropic Claude
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 50 # Requests per minute limit
tpm: 100000 # Tokens per minute limit
# Primary: OpenAI GPT-4o
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 100
tpm: 200000
# Fallback: Azure OpenAI
- model_name: azure-gpt4
litellm_params:
model: azure/gpt-4o
api_base: os.environ/AZURE_OPENAI_ENDPOINT
api_key: os.environ/AZURE_OPENAI_API_KEY
api_version: "2024-02-01"
rpm: 60
# Fast cheap model for simple tasks
- model_name: claude-haiku
litellm_params:
model: anthropic/claude-haiku-4-5-20251001
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 200
tpm: 500000
# Router group: production-llm with fallback chain
- model_name: production-llm
litellm_params:
model: claude-sonnet
model_info:
id: production-primary
router_settings:
# Fallback chain: if claude-sonnet fails, try gpt-4o, then azure-gpt4
fallbacks:
- {"claude-sonnet": ["gpt-4o", "azure-gpt4"]}
- {"gpt-4o": ["azure-gpt4", "claude-sonnet"]}
# Retry configuration
num_retries: 3
retry_after: 5 # seconds between retries
# Routing strategy
routing_strategy: "usage-based-routing-v2" # Routes to least loaded model
# Content policy fallbacks
content_policy_fallbacks:
- {"claude-sonnet": ["gpt-4o"]}
litellm_settings:
# Global settings
drop_params: true # Remove unsupported params instead of erroring
request_timeout: 60 # Timeout per request
# Caching - saves cost on repeated prompts
cache: true
cache_params:
type: "redis"
host: os.environ/REDIS_HOST
port: 6379
ttl: 3600 # Cache for 1 hour
# Callbacks for observability
success_callback: ["langfuse"]
failure_callback: ["langfuse", "slack"]
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL # PostgreSQL for spend tracking
# Virtual keys with budget limits
alerting: ["slack"]
alerting_threshold: 1000 # Alert when total spend exceeds $1000Step 2: Deploy on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
namespace: platform
spec:
replicas: 3
selector:
matchLabels:
app: litellm-proxy
template:
metadata:
labels:
app: litellm-proxy
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
ports:
- containerPort: 4000
args: ["--config", "/app/config/litellm_config.yaml", "--port", "4000", "--detailed_debug"]
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: anthropic-api-key
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: openai-api-key
- name: LITELLM_MASTER_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: litellm-master-key
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: llm-secrets
key: database-url
- name: REDIS_HOST
value: "redis-service"
volumeMounts:
- name: config
mountPath: /app/config
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health/readiness
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
livenessProbe:
httpGet:
path: /health/liveliness
port: 4000
initialDelaySeconds: 30
periodSeconds: 30
volumes:
- name: config
configMap:
name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
name: litellm-proxy
namespace: platform
spec:
selector:
app: litellm-proxy
ports:
- port: 4000
targetPort: 4000Step 3: Virtual Keys for Team Budget Control
Create separate API keys for each team with individual spend limits:
# Create a key for the search team with $500/month limit
curl -X POST http://litellm-proxy:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "search-team",
"team_id": "search",
"max_budget": 500,
"budget_duration": "30d",
"models": ["claude-sonnet", "claude-haiku"],
"tpm_limit": 50000,
"rpm_limit": 100
}'The search team gets their own key. If they exceed $500 in a month, requests start failing — but only for their key. Other teams are unaffected.
Step 4: Using the Gateway in Application Code
Your application doesn't change. Just point the base URL to LiteLLM:
from anthropic import Anthropic
# Before: direct to Anthropic
client = Anthropic(api_key="sk-ant-...")
# After: through LiteLLM gateway (still uses Anthropic SDK!)
client = Anthropic(
api_key="sk-your-litellm-virtual-key",
base_url="http://litellm-proxy.platform:4000"
)
# Code is identical from here
response = client.messages.create(
model="claude-sonnet", # LiteLLM routes this to the right provider
max_tokens=1024,
messages=[{"role": "user", "content": "Summarize this log file..."}]
)Step 5: Observability Dashboard
LiteLLM exposes metrics in Prometheus format:
# Add to your Prometheus scrape config
- job_name: litellm
static_configs:
- targets: ['litellm-proxy.platform:4000']
metrics_path: /metricsKey metrics to dashboard:
litellm_llm_api_latency_metric— p50/p95/p99 latency per modellitellm_requests_metric— request count per model, per teamlitellm_spend_metric— cost per model, per virtual keylitellm_remaining_requests_metric— how much rate limit remains
Fallback Testing
Test that your fallback chain actually works:
import httpx
import asyncio
async def test_fallback():
# Simulate claude-sonnet being unavailable by sending to a bad model
async with httpx.AsyncClient() as client:
response = await client.post(
"http://litellm-proxy:4000/v1/chat/completions",
headers={"Authorization": "Bearer sk-your-key"},
json={
"model": "claude-sonnet",
"messages": [{"role": "user", "content": "Hello"}],
"mock_testing_fallbacks": True # LiteLLM test parameter
}
)
data = response.json()
# Check which model actually responded
print(f"Model used: {data.get('model')}")
asyncio.run(test_fallback())What This Buys You
A single Anthropic API key is:
- A single point of failure
- Invisible (no observability)
- Uncontrolled (any team can spend unlimited)
- Locked in (switching providers requires code changes)
An LLM gateway fixes all four. The extra infrastructure cost (one small deployment, one Redis, one PostgreSQL) is trivial compared to the operational control you gain.
For any team running LLMs in production at scale — this is table stakes. Not optional.
Check what your LLM infrastructure should cost: DevOps Salary Calculator — also covers ML engineering compensation data.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI-Powered Incident Report Generator with Claude API (2026)
Writing postmortems takes 2-3 hours. Here's how to build an AI tool that generates a structured incident report from Slack logs, metrics screenshots, and alert data in minutes.
Build an AI Alert Classifier for Grafana Using LLMs (2026)
Tired of noisy Grafana alerts that wake you up for nothing? Build an AI layer that classifies incoming alerts as actionable or noise, enriches them with context, and routes them intelligently — using Claude or GPT-4 as the reasoning engine.
RAG Pipeline Evaluation with RAGAS + LangSmith in Production
Most teams ship RAG pipelines and never know if they're actually working. RAGAS gives you automated metrics — faithfulness, answer relevancy, context precision. LangSmith gives you tracing and regression testing. Here's how to wire both together.