🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

LLM A/B Testing and Shadow Deployments in Production

How to safely test new LLM models and prompts in production using A/B testing, shadow mode, and traffic splitting — without risking user experience.

DevOpsBoys5 min read
Share:Tweet

Switching from GPT-4 to Claude, or updating a prompt that's been running for months — these feel risky because they are. One bad prompt change can silently degrade your product for hours before you notice. Here's how to test LLM changes safely in production.

The Problem with LLM Deployments

Unlike traditional software where you can write unit tests, LLM outputs are:

  • Non-deterministic (different outputs for the same input)
  • Subjective (quality is hard to measure)
  • Context-dependent (edge cases appear at scale, not in dev)

You can't test your way to confidence in staging. You need production traffic — but you need it safely.

Three Strategies

1. Shadow Mode (Zero Risk)

Send every request to both the old and new model, but only return the old model's response to the user. Log both responses for offline comparison.

python
import asyncio
import anthropic
import openai
from typing import Optional
 
old_client = openai.AsyncOpenAI()
new_client = anthropic.AsyncAnthropic()
 
async def shadow_request(user_message: str, request_id: str) -> str:
    """Send to both models, return only old model's response."""
    
    # Run both in parallel
    old_task = old_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_message}]
    )
    
    new_task = new_client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        messages=[{"role": "user", "content": user_message}]
    )
    
    old_result, new_result = await asyncio.gather(old_task, new_task)
    
    old_response = old_result.choices[0].message.content
    new_response = new_result.content[0].text
    
    # Log both for comparison (async, don't block the response)
    asyncio.create_task(log_shadow(
        request_id=request_id,
        prompt=user_message,
        old_response=old_response,
        new_response=new_response,
    ))
    
    # Always return old model's response
    return old_response
 
 
async def log_shadow(request_id: str, prompt: str, old_response: str, new_response: str):
    """Store shadow results for offline analysis."""
    import aiohttp
    payload = {
        "request_id": request_id,
        "prompt": prompt,
        "old_response": old_response,
        "new_response": new_response,
        "timestamp": datetime.utcnow().isoformat(),
    }
    # Send to your logging service
    async with aiohttp.ClientSession() as session:
        await session.post("http://llm-logger/shadow", json=payload)

Shadow mode lets you collect thousands of real request/response pairs with zero user impact. Then you can run your evaluation suite offline.

2. A/B Testing (Controlled Risk)

Split traffic between old and new versions. Route users consistently (same user always gets same version) to avoid confusion.

python
import hashlib
from enum import Enum
 
class ModelVariant(Enum):
    CONTROL = "gpt-4o"
    TREATMENT = "claude-opus-4-8"
 
def get_variant(user_id: str, experiment_id: str, treatment_percentage: int = 10) -> ModelVariant:
    """Deterministically assign user to variant."""
    hash_input = f"{experiment_id}:{user_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    bucket = hash_value % 100
    
    if bucket < treatment_percentage:
        return ModelVariant.TREATMENT
    return ModelVariant.CONTROL
 
 
async def handle_request(user_id: str, message: str) -> dict:
    variant = get_variant(user_id, experiment_id="exp-claude-migration-v1", treatment_percentage=10)
    
    if variant == ModelVariant.CONTROL:
        response = await call_openai(message)
    else:
        response = await call_claude(message)
    
    # Log variant with every request for analysis
    await log_event({
        "user_id": user_id,
        "variant": variant.value,
        "prompt_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "latency_ms": response.latency,
        "response": response.content,
    })
    
    return response

3. Canary Deployment (Progressive Rollout)

Gradually increase the percentage sent to the new model, watching metrics at each step.

python
ROLLOUT_STAGES = [
    {"percentage": 1,   "duration_hours": 2,   "auto_advance": True},
    {"percentage": 5,   "duration_hours": 4,   "auto_advance": True},
    {"percentage": 10,  "duration_hours": 8,   "auto_advance": False},  # manual gate
    {"percentage": 25,  "duration_hours": 24,  "auto_advance": False},
    {"percentage": 50,  "duration_hours": 48,  "auto_advance": False},
    {"percentage": 100, "duration_hours": None, "auto_advance": False},
]
 
class CanaryController:
    def __init__(self):
        self.current_stage = 0
        self.stage_start = datetime.utcnow()
    
    def get_treatment_percentage(self) -> int:
        return ROLLOUT_STAGES[self.current_stage]["percentage"]
    
    def advance_if_ready(self, metrics: dict) -> bool:
        """Advance if metrics look good."""
        stage = ROLLOUT_STAGES[self.current_stage]
        
        if not stage["auto_advance"]:
            return False
        
        elapsed = (datetime.utcnow() - self.stage_start).total_seconds() / 3600
        if elapsed < stage["duration_hours"]:
            return False
        
        # Check guardrails
        if metrics["error_rate"] > 0.02:  # > 2% errors
            self.rollback()
            return False
        
        if metrics["p99_latency_ms"] > 5000:  # > 5s p99
            self.rollback()
            return False
        
        self.current_stage += 1
        self.stage_start = datetime.utcnow()
        return True
    
    def rollback(self):
        print(f"Rolling back from stage {self.current_stage}")
        self.current_stage = 0

What Metrics to Track

For each variant, track:

python
metrics_to_compare = {
    # Reliability
    "error_rate": "% of requests that failed",
    "timeout_rate": "% that exceeded timeout",
    
    # Performance  
    "p50_latency_ms": "median response time",
    "p99_latency_ms": "tail latency",
    "tokens_per_second": "throughput",
    
    # Quality (requires human evaluation or LLM-as-judge)
    "relevance_score": "0-1 relevance to prompt",
    "faithfulness_score": "does response stick to facts?",
    "user_rating": "explicit thumbs up/down",
    
    # Cost
    "cost_per_request_usd": "model + infrastructure cost",
    "tokens_used": "input + output tokens",
}

LLM-as-Judge for Quality Scoring

When you have shadow logs, use Claude to compare responses:

python
async def judge_response_pair(prompt: str, old_response: str, new_response: str) -> dict:
    judge_prompt = f"""Compare these two AI responses to the same prompt.
 
Prompt: {prompt}
 
Response A: {old_response}
 
Response B: {new_response}
 
Rate each response on:
- Accuracy (0-10)
- Clarity (0-10)  
- Completeness (0-10)
- Overall preference (A or B)
 
Respond in JSON: {{"a_scores": {{"accuracy": 0, "clarity": 0, "completeness": 0}}, "b_scores": {{...}}, "winner": "A|B|tie", "reasoning": "..."}}"""
 
    result = await new_client.messages.create(
        model="claude-haiku-4-5-20251001",  # cheap model for judging
        max_tokens=500,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    
    return json.loads(result.content[0].text)

Kubernetes Traffic Splitting with Istio

For service-level A/B testing:

yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: llm-service
spec:
  hosts:
    - llm-service
  http:
    - match:
        - headers:
            x-experiment: { exact: "claude-v1" }
      route:
        - destination:
            host: llm-service-claude
    - route:
        - destination:
            host: llm-service-gpt4o
          weight: 90
        - destination:
            host: llm-service-claude
          weight: 10

Decision Framework

After collecting data:

MetricControl (GPT-4o)Treatment (Claude)Decision
Error rate0.5%0.3%✅ Treatment better
P99 latency3.2s2.8s✅ Treatment better
LLM judge score7.2/107.8/10✅ Treatment better
Cost/request$0.045$0.031✅ Treatment better

When treatment wins on all dimensions: full rollout. When mixed: weigh by what matters most for your use case.

Tools used: Anthropic SDK, OpenAI SDK, Istio, Prometheus, Grafana
LiteLLM is also excellent for abstracting multi-provider routing.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments