LLM Routing: Automatically Select the Right Model in Production
Build a model router in Python that picks cheap vs expensive LLMs based on query complexity. Covers cost-based routing, latency fallbacks, LiteLLM router, and tracking routing decisions with the Anthropic SDK.
Running a single LLM model for all production traffic is like using a forklift to move a coffee mug. Simple queries — lookup, reformatting, classification — can run on a cheap fast model. Complex reasoning, code generation, and multi-step analysis need a powerful model. A model router makes this decision automatically, saving 70–90% on inference costs without degrading quality.
The Core Idea
Every incoming query gets evaluated for complexity. Based on that evaluation, it is routed to:
- Haiku — fast, cheap, good for classification, summarization, simple Q&A
- Sonnet — balanced, good for code review, document analysis, moderate reasoning
- Opus — most powerful, reserved for complex reasoning, long-context synthesis, agentic tasks
The router runs before your LLM call. It adds ~5ms overhead but can save $0.50 per 1,000 tokens by avoiding unnecessary Opus calls.
Complexity Classification
The simplest router uses heuristics — no ML needed at this stage:
from dataclasses import dataclass
from enum import Enum
import re
class ModelTier(Enum):
CHEAP = "claude-haiku-4-5-20251001"
BALANCED = "claude-sonnet-4-5-20251001"
POWERFUL = "claude-opus-4-5-20251001"
@dataclass
class RoutingDecision:
model: str
tier: ModelTier
reason: str
estimated_cost_per_1k_tokens: float
COST_PER_1K_INPUT = {
ModelTier.CHEAP: 0.00080,
ModelTier.BALANCED: 0.00300,
ModelTier.POWERFUL: 0.01500,
}
def classify_complexity(prompt: str, context: dict = None) -> RoutingDecision:
context = context or {}
prompt_lower = prompt.lower()
word_count = len(prompt.split())
# P1: Simple classification / extraction → Haiku
simple_patterns = [
r"\b(classify|categorize|label|tag|extract|parse)\b",
r"\b(yes or no|true or false|is this a)\b",
r"\b(summarize in one|give me a one.line)\b",
r"\b(what is the|what are the|list the)\b",
]
simple_match = any(re.search(p, prompt_lower) for p in simple_patterns)
is_short = word_count < 50
if simple_match and is_short:
tier = ModelTier.CHEAP
reason = "Simple pattern match, short prompt"
return RoutingDecision(
model=tier.value,
tier=tier,
reason=reason,
estimated_cost_per_1k_tokens=COST_PER_1K_INPUT[tier],
)
# P2: Complex reasoning signals → Opus
complex_patterns = [
r"\b(analyze|reason|evaluate|compare|critique|debate)\b",
r"\b(step by step|think through|consider all|weigh the)\b",
r"\b(write a complete|implement|architect|design a system)\b",
r"\b(multi.step|chain of thought|agentic|autonomous)\b",
]
complex_match = any(re.search(p, prompt_lower) for p in complex_patterns)
is_long = word_count > 200
has_code_task = any(kw in prompt_lower for kw in ["refactor", "debug", "optimize", "implement"])
if complex_match or (is_long and has_code_task):
# Check context override
if context.get("force_tier"):
tier = ModelTier[context["force_tier"]]
else:
tier = ModelTier.POWERFUL
reason = f"Complex reasoning detected (word_count={word_count}, complex_match={complex_match})"
return RoutingDecision(
model=tier.value,
tier=tier,
reason=reason,
estimated_cost_per_1k_tokens=COST_PER_1K_INPUT[tier],
)
# Default: Sonnet
tier = ModelTier.BALANCED
return RoutingDecision(
model=tier.value,
tier=tier,
reason="Default balanced routing",
estimated_cost_per_1k_tokens=COST_PER_1K_INPUT[tier],
)The Router with Fallback Chain
If the primary model is slow or returns an error, the router falls back to the next tier:
import anthropic
import time
import logging
logger = logging.getLogger(__name__)
client = anthropic.Anthropic()
FALLBACK_CHAIN = {
ModelTier.POWERFUL: [ModelTier.BALANCED, ModelTier.CHEAP],
ModelTier.BALANCED: [ModelTier.CHEAP],
ModelTier.CHEAP: [],
}
LATENCY_THRESHOLD_MS = 8000 # Fall back if primary model exceeds 8s
def route_and_call(
prompt: str,
system: str = None,
context: dict = None,
max_tokens: int = 1024,
) -> dict:
decision = classify_complexity(prompt, context)
models_to_try = [decision.tier] + FALLBACK_CHAIN.get(decision.tier, [])
for tier in models_to_try:
model = tier.value
start = time.time()
try:
messages_args = {
"model": model,
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": prompt}],
}
if system:
messages_args["system"] = system
response = client.messages.create(**messages_args)
latency_ms = (time.time() - start) * 1000
log_routing_decision(
original_tier=decision.tier,
used_tier=tier,
latency_ms=latency_ms,
prompt_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
reason=decision.reason,
)
return {
"content": response.content[0].text,
"model_used": model,
"latency_ms": latency_ms,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"routing_reason": decision.reason,
"fallback_used": tier != decision.tier,
}
except anthropic.APIStatusError as e:
latency_ms = (time.time() - start) * 1000
logger.warning(f"Model {model} failed ({e.status_code}), latency={latency_ms:.0f}ms, trying fallback")
continue
raise RuntimeError("All models in fallback chain failed")
def log_routing_decision(original_tier, used_tier, latency_ms, prompt_tokens, output_tokens, reason):
cost = (prompt_tokens / 1000) * COST_PER_1K_INPUT[used_tier]
cost += (output_tokens / 1000) * (COST_PER_1K_INPUT[used_tier] * 5) # output ~5x input cost
logger.info(
"routing_decision",
extra={
"original_model": original_tier.value,
"used_model": used_tier.value,
"latency_ms": round(latency_ms, 2),
"prompt_tokens": prompt_tokens,
"output_tokens": output_tokens,
"estimated_cost_usd": round(cost, 6),
"reason": reason,
"fallback": original_tier != used_tier,
},
)Using LiteLLM Router for Multi-Provider
If you want to route across providers (not just Anthropic model tiers), LiteLLM's built-in router handles provider failover, rate limit retries, and load balancing:
from litellm import Router
router = Router(
model_list=[
{
"model_name": "fast-model",
"litellm_params": {"model": "claude-haiku-4-5-20251001", "api_key": "os.environ/ANTHROPIC_API_KEY"},
"tpm": 100000,
"rpm": 100,
},
{
"model_name": "balanced-model",
"litellm_params": {"model": "claude-sonnet-4-5-20251001", "api_key": "os.environ/ANTHROPIC_API_KEY"},
"tpm": 50000,
"rpm": 50,
},
{
"model_name": "powerful-model",
"litellm_params": {"model": "claude-opus-4-5-20251001", "api_key": "os.environ/ANTHROPIC_API_KEY"},
"tpm": 10000,
"rpm": 10,
},
],
routing_strategy="latency-based-routing",
fallbacks=[{"fast-model": ["balanced-model"]}, {"balanced-model": ["powerful-model"]}],
retry_after=10,
)
response = await router.acompletion(
model="fast-model",
messages=[{"role": "user", "content": "Classify this log line as error/warn/info: " + log_line}],
)LiteLLM tracks latency per model and routes to the fastest one that is under rate limits. Fallbacks trigger on rate limit errors (429) or server errors (500+).
A/B Testing Model Choices
Track your routing decisions in a database to validate that routing is accurate:
import random
def route_with_ab_test(prompt: str, ab_test_ratio: float = 0.05) -> dict:
decision = classify_complexity(prompt)
# 5% of cheap-routed queries get sent to Sonnet for quality comparison
if decision.tier == ModelTier.CHEAP and random.random() < ab_test_ratio:
override_tier = ModelTier.BALANCED
logger.info(f"A/B test: overriding {decision.tier.value} → {override_tier.value}")
decision.tier = override_tier
decision.model = override_tier.value
decision.reason += " [A/B test override]"
return route_and_call(prompt, context={"ab_test": decision.tier == ModelTier.BALANCED})Compare quality scores between the A (cheap) and B (balanced) groups to validate that routing is accurate. If quality scores are similar, keep the cheap routing. If B scores significantly higher, update the classifier to route those query patterns to the higher tier.
Cost Savings in Practice
In a real production system handling 100,000 requests/day with this distribution:
| Query Type | Volume | Model Before | Model After |
|---|---|---|---|
| Classification | 40% | Sonnet | Haiku |
| Summarization | 30% | Sonnet | Haiku |
| Code review | 20% | Sonnet | Sonnet |
| Complex analysis | 10% | Sonnet | Opus |
Without routing: 100k requests × Sonnet cost ≈ $300/day With routing: ~$45/day for Haiku queries + $60/day Sonnet + $150/day Opus ≈ $255/day
That is a 15% saving while improving quality on the complex 10%. Over a year: $16,425 saved.
If your baseline was using Opus for everything: routing to appropriate tiers saves over 85%.
Summary
A model router is production engineering applied to LLM inference. The pattern:
- Classify query complexity with heuristics (or a classifier model)
- Route cheap queries to Haiku, complex queries to Opus, everything else to Sonnet
- Add latency-based fallback chains so slow or failing models degrade gracefully
- Log every routing decision with cost and latency data
- A/B test routing decisions to validate accuracy
Start with heuristics. Graduate to a trained classifier once you have enough labeled routing data from your production logs.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
LLM Multi-Agent Orchestration with LangGraph in Production
Build a production-ready multi-agent system with LangGraph for DevOps automation — Planner, Executor, and Reviewer agents with shared state, conditional edges, human-in-the-loop checkpoints, and LangSmith observability.
Structured Outputs and JSON Mode for LLMs in Production
How to enforce structured JSON output from LLMs in production — Claude tool use, OpenAI JSON mode, Pydantic + Instructor validation, retry logic, schema versioning, and testing pipelines with the Anthropic SDK.
AI Coding Assistants Will Change DevOps — But Not in the Way You Think
GitHub Copilot, Cursor, and Claude are already writing infrastructure code. But the real disruption isn't replacing DevOps engineers — it's reshaping what the job actually is.