🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Deploy LiteLLM Gateway on Kubernetes — Route OpenAI, Claude, Ollama

LiteLLM gives you one API endpoint to route between OpenAI, Anthropic Claude, Ollama, and 100+ other LLMs. Here's how to deploy it on Kubernetes with load balancing and cost tracking.

DevOpsBoysMay 30, 20263 min read
Share:Tweet

You're using OpenAI in production. You want to add Claude as a fallback. And Ollama for cost-sensitive requests. But each has a different API — different endpoints, different request formats, different response schemas.

LiteLLM solves this: one OpenAI-compatible API that routes to any LLM backend.


What LiteLLM Does

Your App (OpenAI-compatible API calls)
    ↓
LiteLLM Gateway (port 4000)
    ├── /chat/completions → Claude Sonnet (for complex tasks)
    ├── /chat/completions → GPT-4o (fallback)
    └── /chat/completions → Ollama Mistral (for simple tasks)

Your app never changes. You swap models at the gateway level.


Kubernetes Deployment

Step 1: Create Config

yaml
# litellm-config.yaml (becomes a ConfigMap)
model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      
  - model_name: ollama-mistral
    litellm_params:
      model: ollama/mistral
      api_base: http://ollama-service:11434
 
  # Router — load balance across models
  - model_name: production-llm
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      id: claude-primary
 
  - model_name: production-llm
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      id: gpt4o-fallback
 
router_settings:
  routing_strategy: least-busy
  num_retries: 3
  timeout: 30
  fallbacks:
    - {"claude-sonnet": ["gpt-4o"]}  # fallback if Claude fails
 
litellm_settings:
  success_callback: ["langfuse"]      # observability
  drop_params: true                   # ignore unsupported params
  
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL  # PostgreSQL for logging

Step 2: Kubernetes Manifests

yaml
# litellm-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: litellm-config
  namespace: llm-gateway
data:
  config.yaml: |
    # paste litellm-config.yaml content here
---
apiVersion: v1
kind: Secret
metadata:
  name: litellm-secrets
  namespace: llm-gateway
type: Opaque
stringData:
  ANTHROPIC_API_KEY: "sk-ant-..."
  OPENAI_API_KEY: "sk-..."
  LITELLM_MASTER_KEY: "sk-litellm-master-key-change-me"
  DATABASE_URL: "postgresql://user:pass@postgres:5432/litellm"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm
  namespace: llm-gateway
spec:
  replicas: 2
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:main-latest
          ports:
            - containerPort: 4000
          args:
            - "--config"
            - "/app/config.yaml"
            - "--port"
            - "4000"
            - "--num_workers"
            - "4"
          envFrom:
            - secretRef:
                name: litellm-secrets
          volumeMounts:
            - name: config
              mountPath: /app/config.yaml
              subPath: config.yaml
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          livenessProbe:
            httpGet:
              path: /health
              port: 4000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 4000
            initialDelaySeconds: 20
      volumes:
        - name: config
          configMap:
            name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
  name: litellm
  namespace: llm-gateway
spec:
  selector:
    app: litellm
  ports:
    - port: 4000
      targetPort: 4000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: litellm
  namespace: llm-gateway
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
    - hosts: [llm.yourcompany.com]
      secretName: litellm-tls
  rules:
    - host: llm.yourcompany.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: litellm
                port:
                  number: 4000

Step 3: Deploy

bash
kubectl create namespace llm-gateway
kubectl apply -f litellm-deployment.yaml
 
# Verify
kubectl get pods -n llm-gateway
kubectl logs -n llm-gateway deployment/litellm

Using the Gateway

Your apps use the standard OpenAI SDK — just change the base URL:

python
from openai import OpenAI
 
# Point to LiteLLM gateway
client = OpenAI(
    api_key="sk-litellm-master-key-change-me",
    base_url="http://litellm.llm-gateway.svc.cluster.local:4000/v1"
)
 
# Use Claude via OpenAI SDK
response = client.chat.completions.create(
    model="claude-sonnet",   # model alias from config
    messages=[{"role": "user", "content": "Explain Kubernetes"}]
)
 
# Use the load-balanced pool
response = client.chat.completions.create(
    model="production-llm",  # routes based on least-busy strategy
    messages=[{"role": "user", "content": "Write a Dockerfile"}]
)

Virtual Keys Per Team

LiteLLM supports virtual API keys with rate limits and budgets per team:

bash
# Create a key for the data team with $100/month budget
curl -X POST http://llm.yourcompany.com/key/generate \
  -H "Authorization: Bearer sk-litellm-master-key" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "data-team",
    "max_budget": 100,
    "tpm_limit": 100000,
    "rpm_limit": 1000,
    "models": ["claude-sonnet", "ollama-mistral"]
  }'
 
# Returns: "key": "sk-virtual-key-data-team-xxx"

Each team gets their own key. LiteLLM tracks usage and enforces limits.


Cost Dashboard

LiteLLM has a built-in UI at port 4000:

http://llm.yourcompany.com/ui

Shows: requests by model, cost per team, error rates, latency.


LiteLLM is the most practical LLM infrastructure tool for teams running multiple models. One deployment, one API, one place to manage costs and routing.

Store your API keys securely with AWS Secrets Manager — integrate with LiteLLM's environment variable config for zero-secret Kubernetes deployments.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments