LLM Cost Optimization in Production — Caching, Batching, Quantization 2026
LLM API bills spiral fast. Here's every technique to cut your LLM costs in production without sacrificing quality — prompt caching, request batching, model routing, and quantization.
You deployed an LLM-powered feature. Traffic grows. The monthly bill hits $10,000. Management asks why.
Here's every technique to cut LLM costs systematically — from quick wins to architectural changes.
Understand Where Your Money Goes First
Before optimizing, know your cost breakdown:
# Track cost per request in your app
import anthropic
import time
client = anthropic.Anthropic()
def call_llm_with_tracking(prompt: str, system: str = "") -> dict:
start = time.time()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}]
)
latency = time.time() - start
# Log for cost tracking
cost_data = {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"latency_ms": latency * 1000,
"model": "claude-sonnet-4-6",
# Claude Sonnet: $3/M input, $15/M output tokens
"cost_usd": (response.usage.input_tokens * 3 + response.usage.output_tokens * 15) / 1_000_000
}
return {"response": response.content[0].text, "cost": cost_data}Aggregate these logs to find: which features cost most, which prompts are expensive, what's your token-per-request average.
Technique 1: Prompt Caching (Biggest Win)
If your prompts share a large system prompt or context, cache it. Anthropic's prompt caching reduces cost by 90% on cached tokens.
# Without caching: pay full price for system prompt every time
# Large system prompt (2000 tokens) × 1000 requests/day = 2M tokens/day = $6/day
# With caching: pay 10% on cache hits after first request
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": your_large_system_prompt, # 2000 tokens
"cache_control": {"type": "ephemeral"} # Cache this!
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check cache performance
print(response.usage.cache_creation_input_tokens) # First request
print(response.usage.cache_read_input_tokens) # Subsequent requestsWhen it helps: Any app with a large system prompt (instructions, context, documentation) that's the same across many requests.
Savings: If your system prompt is 1000 tokens and you get 10,000 requests/day:
- Without cache: 1000 × 10,000 × $3/M = $30/day
- With cache: $30 first request + 10,000 × $0.30/M = ~$3/day (90% savings)
Technique 2: Model Routing
Not every request needs GPT-4 or Claude Opus. Route by complexity:
# model_router.py
def route_to_model(task_type: str, complexity_score: float) -> str:
"""Route requests to appropriate model based on task."""
# Simple classification/extraction → cheap model
if task_type in ["classification", "extraction", "summarization"] and complexity_score < 0.5:
return "claude-haiku-4-5-20251001" # $0.25/M input — 12x cheaper than Sonnet
# Medium complexity → Sonnet
elif complexity_score < 0.8:
return "claude-sonnet-4-6" # $3/M input
# Complex reasoning, coding → Opus
else:
return "claude-opus-4-7" # $15/M inputUse embeddings to classify complexity before routing:
import openai
def estimate_complexity(prompt: str) -> float:
"""Simple heuristic — use embedding similarity for production."""
# Keywords suggesting complex tasks
complex_indicators = ["analyze", "reason", "compare", "design", "architect", "debug"]
simple_indicators = ["summarize", "classify", "extract", "list", "format"]
prompt_lower = prompt.lower()
complex_count = sum(1 for kw in complex_indicators if kw in prompt_lower)
simple_count = sum(1 for kw in simple_indicators if kw in prompt_lower)
if complex_count + simple_count == 0:
return 0.5 # Default to medium
return complex_count / (complex_count + simple_count)Technique 3: Response Caching
Cache identical (or semantically similar) requests:
import redis
import hashlib
import json
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
def cached_llm_call(prompt: str, system: str = "", ttl: int = 3600) -> str:
# Create cache key from prompt + system prompt hash
cache_key = f"llm:{hashlib.md5(f'{system}:{prompt}'.encode()).hexdigest()}"
# Check cache
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Call LLM
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}]
)
result = response.content[0].text
# Cache result
redis_client.setex(cache_key, ttl, json.dumps(result))
return resultFor semantic caching (similar questions get same answer):
# Use embeddings to find similar cached queries
# If cosine similarity > 0.95, return cached response
# Libraries: GPTCache, Redis with vector search
pip install gptcacheTechnique 4: Batching Requests
Instead of one API call per item, batch them:
# Bad: 100 API calls for 100 documents
for doc in documents:
summary = call_llm(f"Summarize: {doc}")
# Better: Batch in single call
def batch_summarize(documents: list[str], batch_size: int = 5) -> list[str]:
summaries = []
for i in range(0, len(documents), batch_size):
batch = documents[i:i+batch_size]
# Format as numbered list
prompt = "Summarize each of these documents in 2 sentences each:\n\n"
for j, doc in enumerate(batch, 1):
prompt += f"{j}. {doc}\n\n"
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500 * len(batch),
messages=[{"role": "user", "content": prompt}]
)
# Parse numbered responses
summaries.extend(parse_numbered_response(response.content[0].text))
return summariesFor Anthropic, use the Batch API for 50% cost reduction on non-realtime tasks:
# Create batch
batch = client.beta.messages.batches.create(
requests=[
{"custom_id": f"req-{i}", "params": {
"model": "claude-sonnet-4-6",
"max_tokens": 500,
"messages": [{"role": "user", "content": doc}]
}}
for i, doc in enumerate(documents)
]
)
# Results available within 24 hours at 50% costTechnique 5: Max Token Optimization
Set max_tokens as tight as possible:
# Bad: Always allow 4096 tokens output
response = client.messages.create(model="claude-sonnet-4-6", max_tokens=4096, ...)
# Good: Set based on expected output
task_max_tokens = {
"classification": 10, # Just "positive" or "negative"
"extraction": 100, # A few fields
"summarization": 300, # Short summary
"code_generation": 1000, # Reasonable for most snippets
}Also use streaming + early stopping — stop generating when you have what you need.
Technique 6: Self-Hosted Models for High Volume
At scale, self-hosted beats API:
| Volume | Cheapest Option |
|---|---|
| < 1M tokens/day | API (Haiku/Groq) |
| 1M–10M tokens/day | API + caching |
| 10M+ tokens/day | Self-hosted (vLLM + Mistral 7B) |
On a g4dn.xlarge ($380/month), Mistral 7B handles ~40M tokens/day at constant load. API equivalent: 40M × $0.25/M (Haiku) = $10,000/month.
Cost Monitoring Dashboard
Track these metrics in Grafana/Datadog:
# Prometheus metrics to expose
from prometheus_client import Counter, Histogram, Gauge
llm_requests_total = Counter('llm_requests_total', 'Total LLM requests', ['model', 'task_type'])
llm_cost_total = Counter('llm_cost_usd_total', 'Total LLM cost in USD', ['model'])
llm_tokens_total = Counter('llm_tokens_total', 'Total tokens', ['model', 'type']) # type: input/output
llm_cache_hits = Counter('llm_cache_hits_total', 'Cache hits', ['cache_type'])
llm_latency = Histogram('llm_request_duration_seconds', 'Request latency', ['model'])Alert when daily cost exceeds budget:
# Prometheus alert rule
- alert: LLMCostHigh
expr: increase(llm_cost_usd_total[24h]) > 100
annotations:
summary: "LLM costs exceeded $100 today"Combine prompt caching + model routing + response caching and most teams see 60–80% cost reduction without touching response quality. Start with prompt caching — it's one line of code and has the highest ROI.
For building cost-effective LLMOps pipelines, the Anthropic API docs have detailed pricing and caching guides.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI-Powered SLO Budget Tracker with Python + Claude (2026)
Track your error budget automatically and get AI-generated burn rate alerts and incident summaries. Build a real SLO monitoring tool with Python, Prometheus, and Claude API.
Deploy Llama 3 on AWS Bedrock — Production Guide 2026
AWS Bedrock now supports Meta's Llama 3 models. Here's how to deploy, call, and optimize Llama 3 on Bedrock for production use cases without managing GPU infrastructure.
How to Deploy Mistral 7B on AWS EC2 — Production Guide 2026
Step-by-step guide to deploying Mistral 7B on AWS EC2 for production use. Covers instance selection, quantization, serving with vLLM, and cost optimization.