🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

LLM Cost Optimization in Production — Caching, Batching, Quantization 2026

LLM API bills spiral fast. Here's every technique to cut your LLM costs in production without sacrificing quality — prompt caching, request batching, model routing, and quantization.

DevOpsBoysMay 29, 20265 min read
Share:Tweet

You deployed an LLM-powered feature. Traffic grows. The monthly bill hits $10,000. Management asks why.

Here's every technique to cut LLM costs systematically — from quick wins to architectural changes.


Understand Where Your Money Goes First

Before optimizing, know your cost breakdown:

python
# Track cost per request in your app
import anthropic
import time
 
client = anthropic.Anthropic()
 
def call_llm_with_tracking(prompt: str, system: str = "") -> dict:
    start = time.time()
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    
    latency = time.time() - start
    
    # Log for cost tracking
    cost_data = {
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "latency_ms": latency * 1000,
        "model": "claude-sonnet-4-6",
        # Claude Sonnet: $3/M input, $15/M output tokens
        "cost_usd": (response.usage.input_tokens * 3 + response.usage.output_tokens * 15) / 1_000_000
    }
    
    return {"response": response.content[0].text, "cost": cost_data}

Aggregate these logs to find: which features cost most, which prompts are expensive, what's your token-per-request average.


Technique 1: Prompt Caching (Biggest Win)

If your prompts share a large system prompt or context, cache it. Anthropic's prompt caching reduces cost by 90% on cached tokens.

python
# Without caching: pay full price for system prompt every time
# Large system prompt (2000 tokens) × 1000 requests/day = 2M tokens/day = $6/day
 
# With caching: pay 10% on cache hits after first request
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_large_system_prompt,  # 2000 tokens
            "cache_control": {"type": "ephemeral"}  # Cache this!
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)
 
# Check cache performance
print(response.usage.cache_creation_input_tokens)  # First request
print(response.usage.cache_read_input_tokens)       # Subsequent requests

When it helps: Any app with a large system prompt (instructions, context, documentation) that's the same across many requests.

Savings: If your system prompt is 1000 tokens and you get 10,000 requests/day:

  • Without cache: 1000 × 10,000 × $3/M = $30/day
  • With cache: $30 first request + 10,000 × $0.30/M = ~$3/day (90% savings)

Technique 2: Model Routing

Not every request needs GPT-4 or Claude Opus. Route by complexity:

python
# model_router.py
def route_to_model(task_type: str, complexity_score: float) -> str:
    """Route requests to appropriate model based on task."""
    
    # Simple classification/extraction → cheap model
    if task_type in ["classification", "extraction", "summarization"] and complexity_score < 0.5:
        return "claude-haiku-4-5-20251001"  # $0.25/M input — 12x cheaper than Sonnet
    
    # Medium complexity → Sonnet
    elif complexity_score < 0.8:
        return "claude-sonnet-4-6"  # $3/M input
    
    # Complex reasoning, coding → Opus
    else:
        return "claude-opus-4-7"  # $15/M input

Use embeddings to classify complexity before routing:

python
import openai
 
def estimate_complexity(prompt: str) -> float:
    """Simple heuristic — use embedding similarity for production."""
    # Keywords suggesting complex tasks
    complex_indicators = ["analyze", "reason", "compare", "design", "architect", "debug"]
    simple_indicators = ["summarize", "classify", "extract", "list", "format"]
    
    prompt_lower = prompt.lower()
    complex_count = sum(1 for kw in complex_indicators if kw in prompt_lower)
    simple_count = sum(1 for kw in simple_indicators if kw in prompt_lower)
    
    if complex_count + simple_count == 0:
        return 0.5  # Default to medium
    
    return complex_count / (complex_count + simple_count)

Technique 3: Response Caching

Cache identical (or semantically similar) requests:

python
import redis
import hashlib
import json
 
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
 
def cached_llm_call(prompt: str, system: str = "", ttl: int = 3600) -> str:
    # Create cache key from prompt + system prompt hash
    cache_key = f"llm:{hashlib.md5(f'{system}:{prompt}'.encode()).hexdigest()}"
    
    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Call LLM
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    result = response.content[0].text
    
    # Cache result
    redis_client.setex(cache_key, ttl, json.dumps(result))
    
    return result

For semantic caching (similar questions get same answer):

python
# Use embeddings to find similar cached queries
# If cosine similarity > 0.95, return cached response
# Libraries: GPTCache, Redis with vector search
pip install gptcache

Technique 4: Batching Requests

Instead of one API call per item, batch them:

python
# Bad: 100 API calls for 100 documents
for doc in documents:
    summary = call_llm(f"Summarize: {doc}")
 
# Better: Batch in single call
def batch_summarize(documents: list[str], batch_size: int = 5) -> list[str]:
    summaries = []
    
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i+batch_size]
        
        # Format as numbered list
        prompt = "Summarize each of these documents in 2 sentences each:\n\n"
        for j, doc in enumerate(batch, 1):
            prompt += f"{j}. {doc}\n\n"
        
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=500 * len(batch),
            messages=[{"role": "user", "content": prompt}]
        )
        
        # Parse numbered responses
        summaries.extend(parse_numbered_response(response.content[0].text))
    
    return summaries

For Anthropic, use the Batch API for 50% cost reduction on non-realtime tasks:

python
# Create batch
batch = client.beta.messages.batches.create(
    requests=[
        {"custom_id": f"req-{i}", "params": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 500,
            "messages": [{"role": "user", "content": doc}]
        }}
        for i, doc in enumerate(documents)
    ]
)
# Results available within 24 hours at 50% cost

Technique 5: Max Token Optimization

Set max_tokens as tight as possible:

python
# Bad: Always allow 4096 tokens output
response = client.messages.create(model="claude-sonnet-4-6", max_tokens=4096, ...)
 
# Good: Set based on expected output
task_max_tokens = {
    "classification": 10,       # Just "positive" or "negative"
    "extraction": 100,          # A few fields
    "summarization": 300,       # Short summary
    "code_generation": 1000,    # Reasonable for most snippets
}

Also use streaming + early stopping — stop generating when you have what you need.


Technique 6: Self-Hosted Models for High Volume

At scale, self-hosted beats API:

VolumeCheapest Option
< 1M tokens/dayAPI (Haiku/Groq)
1M–10M tokens/dayAPI + caching
10M+ tokens/daySelf-hosted (vLLM + Mistral 7B)

On a g4dn.xlarge ($380/month), Mistral 7B handles ~40M tokens/day at constant load. API equivalent: 40M × $0.25/M (Haiku) = $10,000/month.


Cost Monitoring Dashboard

Track these metrics in Grafana/Datadog:

python
# Prometheus metrics to expose
from prometheus_client import Counter, Histogram, Gauge
 
llm_requests_total = Counter('llm_requests_total', 'Total LLM requests', ['model', 'task_type'])
llm_cost_total = Counter('llm_cost_usd_total', 'Total LLM cost in USD', ['model'])
llm_tokens_total = Counter('llm_tokens_total', 'Total tokens', ['model', 'type'])  # type: input/output
llm_cache_hits = Counter('llm_cache_hits_total', 'Cache hits', ['cache_type'])
llm_latency = Histogram('llm_request_duration_seconds', 'Request latency', ['model'])

Alert when daily cost exceeds budget:

yaml
# Prometheus alert rule
- alert: LLMCostHigh
  expr: increase(llm_cost_usd_total[24h]) > 100
  annotations:
    summary: "LLM costs exceeded $100 today"

Combine prompt caching + model routing + response caching and most teams see 60–80% cost reduction without touching response quality. Start with prompt caching — it's one line of code and has the highest ROI.

For building cost-effective LLMOps pipelines, the Anthropic API docs have detailed pricing and caching guides.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments