LLM A/B Testing and Shadow Deployments in Production
How to safely test new LLM models and prompts in production using A/B testing, shadow mode, and traffic splitting — without risking user experience.
Switching from GPT-4 to Claude, or updating a prompt that's been running for months — these feel risky because they are. One bad prompt change can silently degrade your product for hours before you notice. Here's how to test LLM changes safely in production.
The Problem with LLM Deployments
Unlike traditional software where you can write unit tests, LLM outputs are:
- Non-deterministic (different outputs for the same input)
- Subjective (quality is hard to measure)
- Context-dependent (edge cases appear at scale, not in dev)
You can't test your way to confidence in staging. You need production traffic — but you need it safely.
Three Strategies
1. Shadow Mode (Zero Risk)
Send every request to both the old and new model, but only return the old model's response to the user. Log both responses for offline comparison.
import asyncio
import anthropic
import openai
from typing import Optional
old_client = openai.AsyncOpenAI()
new_client = anthropic.AsyncAnthropic()
async def shadow_request(user_message: str, request_id: str) -> str:
"""Send to both models, return only old model's response."""
# Run both in parallel
old_task = old_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": user_message}]
)
new_task = new_client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)
old_result, new_result = await asyncio.gather(old_task, new_task)
old_response = old_result.choices[0].message.content
new_response = new_result.content[0].text
# Log both for comparison (async, don't block the response)
asyncio.create_task(log_shadow(
request_id=request_id,
prompt=user_message,
old_response=old_response,
new_response=new_response,
))
# Always return old model's response
return old_response
async def log_shadow(request_id: str, prompt: str, old_response: str, new_response: str):
"""Store shadow results for offline analysis."""
import aiohttp
payload = {
"request_id": request_id,
"prompt": prompt,
"old_response": old_response,
"new_response": new_response,
"timestamp": datetime.utcnow().isoformat(),
}
# Send to your logging service
async with aiohttp.ClientSession() as session:
await session.post("http://llm-logger/shadow", json=payload)Shadow mode lets you collect thousands of real request/response pairs with zero user impact. Then you can run your evaluation suite offline.
2. A/B Testing (Controlled Risk)
Split traffic between old and new versions. Route users consistently (same user always gets same version) to avoid confusion.
import hashlib
from enum import Enum
class ModelVariant(Enum):
CONTROL = "gpt-4o"
TREATMENT = "claude-opus-4-8"
def get_variant(user_id: str, experiment_id: str, treatment_percentage: int = 10) -> ModelVariant:
"""Deterministically assign user to variant."""
hash_input = f"{experiment_id}:{user_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
bucket = hash_value % 100
if bucket < treatment_percentage:
return ModelVariant.TREATMENT
return ModelVariant.CONTROL
async def handle_request(user_id: str, message: str) -> dict:
variant = get_variant(user_id, experiment_id="exp-claude-migration-v1", treatment_percentage=10)
if variant == ModelVariant.CONTROL:
response = await call_openai(message)
else:
response = await call_claude(message)
# Log variant with every request for analysis
await log_event({
"user_id": user_id,
"variant": variant.value,
"prompt_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"latency_ms": response.latency,
"response": response.content,
})
return response3. Canary Deployment (Progressive Rollout)
Gradually increase the percentage sent to the new model, watching metrics at each step.
ROLLOUT_STAGES = [
{"percentage": 1, "duration_hours": 2, "auto_advance": True},
{"percentage": 5, "duration_hours": 4, "auto_advance": True},
{"percentage": 10, "duration_hours": 8, "auto_advance": False}, # manual gate
{"percentage": 25, "duration_hours": 24, "auto_advance": False},
{"percentage": 50, "duration_hours": 48, "auto_advance": False},
{"percentage": 100, "duration_hours": None, "auto_advance": False},
]
class CanaryController:
def __init__(self):
self.current_stage = 0
self.stage_start = datetime.utcnow()
def get_treatment_percentage(self) -> int:
return ROLLOUT_STAGES[self.current_stage]["percentage"]
def advance_if_ready(self, metrics: dict) -> bool:
"""Advance if metrics look good."""
stage = ROLLOUT_STAGES[self.current_stage]
if not stage["auto_advance"]:
return False
elapsed = (datetime.utcnow() - self.stage_start).total_seconds() / 3600
if elapsed < stage["duration_hours"]:
return False
# Check guardrails
if metrics["error_rate"] > 0.02: # > 2% errors
self.rollback()
return False
if metrics["p99_latency_ms"] > 5000: # > 5s p99
self.rollback()
return False
self.current_stage += 1
self.stage_start = datetime.utcnow()
return True
def rollback(self):
print(f"Rolling back from stage {self.current_stage}")
self.current_stage = 0What Metrics to Track
For each variant, track:
metrics_to_compare = {
# Reliability
"error_rate": "% of requests that failed",
"timeout_rate": "% that exceeded timeout",
# Performance
"p50_latency_ms": "median response time",
"p99_latency_ms": "tail latency",
"tokens_per_second": "throughput",
# Quality (requires human evaluation or LLM-as-judge)
"relevance_score": "0-1 relevance to prompt",
"faithfulness_score": "does response stick to facts?",
"user_rating": "explicit thumbs up/down",
# Cost
"cost_per_request_usd": "model + infrastructure cost",
"tokens_used": "input + output tokens",
}LLM-as-Judge for Quality Scoring
When you have shadow logs, use Claude to compare responses:
async def judge_response_pair(prompt: str, old_response: str, new_response: str) -> dict:
judge_prompt = f"""Compare these two AI responses to the same prompt.
Prompt: {prompt}
Response A: {old_response}
Response B: {new_response}
Rate each response on:
- Accuracy (0-10)
- Clarity (0-10)
- Completeness (0-10)
- Overall preference (A or B)
Respond in JSON: {{"a_scores": {{"accuracy": 0, "clarity": 0, "completeness": 0}}, "b_scores": {{...}}, "winner": "A|B|tie", "reasoning": "..."}}"""
result = await new_client.messages.create(
model="claude-haiku-4-5-20251001", # cheap model for judging
max_tokens=500,
messages=[{"role": "user", "content": judge_prompt}]
)
return json.loads(result.content[0].text)Kubernetes Traffic Splitting with Istio
For service-level A/B testing:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: llm-service
spec:
hosts:
- llm-service
http:
- match:
- headers:
x-experiment: { exact: "claude-v1" }
route:
- destination:
host: llm-service-claude
- route:
- destination:
host: llm-service-gpt4o
weight: 90
- destination:
host: llm-service-claude
weight: 10Decision Framework
After collecting data:
| Metric | Control (GPT-4o) | Treatment (Claude) | Decision |
|---|---|---|---|
| Error rate | 0.5% | 0.3% | ✅ Treatment better |
| P99 latency | 3.2s | 2.8s | ✅ Treatment better |
| LLM judge score | 7.2/10 | 7.8/10 | ✅ Treatment better |
| Cost/request | $0.045 | $0.031 | ✅ Treatment better |
When treatment wins on all dimensions: full rollout. When mixed: weigh by what matters most for your use case.
Tools used: Anthropic SDK, OpenAI SDK, Istio, Prometheus, Grafana
LiteLLM is also excellent for abstracting multi-provider routing.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Build an AI Kubernetes Runbook Generator with LLMs (2026)
Manual runbooks go stale. Build a system that watches your Kubernetes cluster, detects incidents, and generates step-by-step runbooks automatically using LLMs. Full implementation with Python, kubectl, and Ollama.
Build an AI-Powered SLO Breach Predictor with Claude and Prometheus
Build an SLO breach predictor that reads error budget burn rate from Prometheus, uses Claude to analyze patterns, and sends Slack alerts before SLOs breach — not after.