Deploy LiteLLM Gateway on Kubernetes — Route OpenAI, Claude, Ollama
LiteLLM gives you one API endpoint to route between OpenAI, Anthropic Claude, Ollama, and 100+ other LLMs. Here's how to deploy it on Kubernetes with load balancing and cost tracking.
You're using OpenAI in production. You want to add Claude as a fallback. And Ollama for cost-sensitive requests. But each has a different API — different endpoints, different request formats, different response schemas.
LiteLLM solves this: one OpenAI-compatible API that routes to any LLM backend.
What LiteLLM Does
Your App (OpenAI-compatible API calls)
↓
LiteLLM Gateway (port 4000)
├── /chat/completions → Claude Sonnet (for complex tasks)
├── /chat/completions → GPT-4o (fallback)
└── /chat/completions → Ollama Mistral (for simple tasks)
Your app never changes. You swap models at the gateway level.
Kubernetes Deployment
Step 1: Create Config
# litellm-config.yaml (becomes a ConfigMap)
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: ollama-mistral
litellm_params:
model: ollama/mistral
api_base: http://ollama-service:11434
# Router — load balance across models
- model_name: production-llm
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
model_info:
id: claude-primary
- model_name: production-llm
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
model_info:
id: gpt4o-fallback
router_settings:
routing_strategy: least-busy
num_retries: 3
timeout: 30
fallbacks:
- {"claude-sonnet": ["gpt-4o"]} # fallback if Claude fails
litellm_settings:
success_callback: ["langfuse"] # observability
drop_params: true # ignore unsupported params
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL # PostgreSQL for loggingStep 2: Kubernetes Manifests
# litellm-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: litellm-config
namespace: llm-gateway
data:
config.yaml: |
# paste litellm-config.yaml content here
---
apiVersion: v1
kind: Secret
metadata:
name: litellm-secrets
namespace: llm-gateway
type: Opaque
stringData:
ANTHROPIC_API_KEY: "sk-ant-..."
OPENAI_API_KEY: "sk-..."
LITELLM_MASTER_KEY: "sk-litellm-master-key-change-me"
DATABASE_URL: "postgresql://user:pass@postgres:5432/litellm"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm
namespace: llm-gateway
spec:
replicas: 2
selector:
matchLabels:
app: litellm
template:
metadata:
labels:
app: litellm
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
ports:
- containerPort: 4000
args:
- "--config"
- "/app/config.yaml"
- "--port"
- "4000"
- "--num_workers"
- "4"
envFrom:
- secretRef:
name: litellm-secrets
volumeMounts:
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/readiness
port: 4000
initialDelaySeconds: 20
volumes:
- name: config
configMap:
name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
name: litellm
namespace: llm-gateway
spec:
selector:
app: litellm
ports:
- port: 4000
targetPort: 4000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: litellm
namespace: llm-gateway
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts: [llm.yourcompany.com]
secretName: litellm-tls
rules:
- host: llm.yourcompany.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: litellm
port:
number: 4000Step 3: Deploy
kubectl create namespace llm-gateway
kubectl apply -f litellm-deployment.yaml
# Verify
kubectl get pods -n llm-gateway
kubectl logs -n llm-gateway deployment/litellmUsing the Gateway
Your apps use the standard OpenAI SDK — just change the base URL:
from openai import OpenAI
# Point to LiteLLM gateway
client = OpenAI(
api_key="sk-litellm-master-key-change-me",
base_url="http://litellm.llm-gateway.svc.cluster.local:4000/v1"
)
# Use Claude via OpenAI SDK
response = client.chat.completions.create(
model="claude-sonnet", # model alias from config
messages=[{"role": "user", "content": "Explain Kubernetes"}]
)
# Use the load-balanced pool
response = client.chat.completions.create(
model="production-llm", # routes based on least-busy strategy
messages=[{"role": "user", "content": "Write a Dockerfile"}]
)Virtual Keys Per Team
LiteLLM supports virtual API keys with rate limits and budgets per team:
# Create a key for the data team with $100/month budget
curl -X POST http://llm.yourcompany.com/key/generate \
-H "Authorization: Bearer sk-litellm-master-key" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "data-team",
"max_budget": 100,
"tpm_limit": 100000,
"rpm_limit": 1000,
"models": ["claude-sonnet", "ollama-mistral"]
}'
# Returns: "key": "sk-virtual-key-data-team-xxx"Each team gets their own key. LiteLLM tracks usage and enforces limits.
Cost Dashboard
LiteLLM has a built-in UI at port 4000:
http://llm.yourcompany.com/ui
Shows: requests by model, cost per team, error rates, latency.
LiteLLM is the most practical LLM infrastructure tool for teams running multiple models. One deployment, one API, one place to manage costs and routing.
Store your API keys securely with AWS Secrets Manager — integrate with LiteLLM's environment variable config for zero-secret Kubernetes deployments.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI-Powered DevOps Chatbot with Streamlit on Kubernetes
Build a DevOps assistant chatbot that answers infrastructure questions, generates kubectl commands, and explains errors — deployed as a Streamlit app on Kubernetes.
Build LLM-Powered Runbook Automation with Haystack and Kubernetes
Turn your static runbooks into an AI system that answers 'what do I do when X happens' with step-by-step instructions retrieved from your actual documentation.
Build a Natural Language kubectl — Ask Questions to Your Cluster
Build a CLI tool that lets you describe what you want in plain English and generates the correct kubectl command — powered by Claude API.