LLM Cost Optimization in Production — Caching, Batching, Quantization 2026

LLM API bills spiral fast. Here's every technique to cut your LLM costs in production without sacrificing quality — prompt caching, request batching, model routing, and quantization.

You deployed an LLM-powered feature. Traffic grows. The monthly bill hits $10,000. Management asks why.

Here's every technique to cut LLM costs systematically — from quick wins to architectural changes.

Understand Where Your Money Goes First

Before optimizing, know your cost breakdown:

python

# Track cost per request in your app
import anthropic
import time
 
client = anthropic.Anthropic()
 
def call_llm_with_tracking(prompt: str, system: str = "") -> dict:
    start = time.time()
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    
    latency = time.time() - start
    
    # Log for cost tracking
    cost_data = {
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "latency_ms": latency * 1000,
        "model": "claude-sonnet-4-6",
        # Claude Sonnet: $3/M input, $15/M output tokens
        "cost_usd": (response.usage.input_tokens * 3 + response.usage.output_tokens * 15) / 1_000_000
    }
    
    return {"response": response.content[0].text, "cost": cost_data}

Aggregate these logs to find: which features cost most, which prompts are expensive, what's your token-per-request average.

Technique 1: Prompt Caching (Biggest Win)

If your prompts share a large system prompt or context, cache it. Anthropic's prompt caching reduces cost by 90% on cached tokens.

python

# Without caching: pay full price for system prompt every time
# Large system prompt (2000 tokens) × 1000 requests/day = 2M tokens/day = $6/day
 
# With caching: pay 10% on cache hits after first request
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_large_system_prompt,  # 2000 tokens
            "cache_control": {"type": "ephemeral"}  # Cache this!
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)
 
# Check cache performance
print(response.usage.cache_creation_input_tokens)  # First request
print(response.usage.cache_read_input_tokens)       # Subsequent requests

When it helps: Any app with a large system prompt (instructions, context, documentation) that's the same across many requests.

Savings: If your system prompt is 1000 tokens and you get 10,000 requests/day:

Without cache: 1000 × 10,000 × $3/M = $30/day
With cache: $30 first request + 10,000 × $0.30/M = ~$3/day (90% savings)

Technique 2: Model Routing

Not every request needs GPT-4 or Claude Opus. Route by complexity:

python

# model_router.py
def route_to_model(task_type: str, complexity_score: float) -> str:
    """Route requests to appropriate model based on task."""
    
    # Simple classification/extraction → cheap model
    if task_type in ["classification", "extraction", "summarization"] and complexity_score < 0.5:
        return "claude-haiku-4-5-20251001"  # $0.25/M input — 12x cheaper than Sonnet
    
    # Medium complexity → Sonnet
    elif complexity_score < 0.8:
        return "claude-sonnet-4-6"  # $3/M input
    
    # Complex reasoning, coding → Opus
    else:
        return "claude-opus-4-7"  # $15/M input

Use embeddings to classify complexity before routing:

python

import openai
 
def estimate_complexity(prompt: str) -> float:
    """Simple heuristic — use embedding similarity for production."""
    # Keywords suggesting complex tasks
    complex_indicators = ["analyze", "reason", "compare", "design", "architect", "debug"]
    simple_indicators = ["summarize", "classify", "extract", "list", "format"]
    
    prompt_lower = prompt.lower()
    complex_count = sum(1 for kw in complex_indicators if kw in prompt_lower)
    simple_count = sum(1 for kw in simple_indicators if kw in prompt_lower)
    
    if complex_count + simple_count == 0:
        return 0.5  # Default to medium
    
    return complex_count / (complex_count + simple_count)

Technique 3: Response Caching

Cache identical (or semantically similar) requests:

python

import redis
import hashlib
import json
 
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
 
def cached_llm_call(prompt: str, system: str = "", ttl: int = 3600) -> str:
    # Create cache key from prompt + system prompt hash
    cache_key = f"llm:{hashlib.md5(f'{system}:{prompt}'.encode()).hexdigest()}"
    
    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Call LLM
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    result = response.content[0].text
    
    # Cache result
    redis_client.setex(cache_key, ttl, json.dumps(result))
    
    return result

For semantic caching (similar questions get same answer):

python

# Use embeddings to find similar cached queries
# If cosine similarity > 0.95, return cached response
# Libraries: GPTCache, Redis with vector search
pip install gptcache

Technique 4: Batching Requests

Instead of one API call per item, batch them:

python

# Bad: 100 API calls for 100 documents
for doc in documents:
    summary = call_llm(f"Summarize: {doc}")
 
# Better: Batch in single call
def batch_summarize(documents: list[str], batch_size: int = 5) -> list[str]:
    summaries = []
    
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i+batch_size]
        
        # Format as numbered list
        prompt = "Summarize each of these documents in 2 sentences each:\n\n"
        for j, doc in enumerate(batch, 1):
            prompt += f"{j}. {doc}\n\n"
        
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=500 * len(batch),
            messages=[{"role": "user", "content": prompt}]
        )
        
        # Parse numbered responses
        summaries.extend(parse_numbered_response(response.content[0].text))
    
    return summaries

For Anthropic, use the Batch API for 50% cost reduction on non-realtime tasks:

python

# Create batch
batch = client.beta.messages.batches.create(
    requests=[
        {"custom_id": f"req-{i}", "params": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 500,
            "messages": [{"role": "user", "content": doc}]
        }}
        for i, doc in enumerate(documents)
    ]
)
# Results available within 24 hours at 50% cost

Technique 5: Max Token Optimization

Set max_tokens as tight as possible:

python

# Bad: Always allow 4096 tokens output
response = client.messages.create(model="claude-sonnet-4-6", max_tokens=4096, ...)
 
# Good: Set based on expected output
task_max_tokens = {
    "classification": 10,       # Just "positive" or "negative"
    "extraction": 100,          # A few fields
    "summarization": 300,       # Short summary
    "code_generation": 1000,    # Reasonable for most snippets
}

Also use streaming + early stopping — stop generating when you have what you need.

Technique 6: Self-Hosted Models for High Volume

At scale, self-hosted beats API:

Volume	Cheapest Option
< 1M tokens/day	API (Haiku/Groq)
1M–10M tokens/day	API + caching
10M+ tokens/day	Self-hosted (vLLM + Mistral 7B)

On a g4dn.xlarge ($380/month), Mistral 7B handles ~40M tokens/day at constant load. API equivalent: 40M × $0.25/M (Haiku) = $10,000/month.

Cost Monitoring Dashboard

Track these metrics in Grafana/Datadog:

python

# Prometheus metrics to expose
from prometheus_client import Counter, Histogram, Gauge
 
llm_requests_total = Counter('llm_requests_total', 'Total LLM requests', ['model', 'task_type'])
llm_cost_total = Counter('llm_cost_usd_total', 'Total LLM cost in USD', ['model'])
llm_tokens_total = Counter('llm_tokens_total', 'Total tokens', ['model', 'type'])  # type: input/output
llm_cache_hits = Counter('llm_cache_hits_total', 'Cache hits', ['cache_type'])
llm_latency = Histogram('llm_request_duration_seconds', 'Request latency', ['model'])

Alert when daily cost exceeds budget:

yaml

# Prometheus alert rule
- alert: LLMCostHigh
  expr: increase(llm_cost_usd_total[24h]) > 100
  annotations:
    summary: "LLM costs exceeded $100 today"

Combine prompt caching + model routing + response caching and most teams see 60–80% cost reduction without touching response quality. Start with prompt caching — it's one line of code and has the highest ROI.

For building cost-effective LLMOps pipelines, the Anthropic API docs have detailed pricing and caching guides.

LLM Cost Optimization in Production — Caching, Batching, Quantization 2026

Understand Where Your Money Goes First

Technique 1: Prompt Caching (Biggest Win)

Technique 2: Model Routing

Technique 3: Response Caching

Technique 4: Batching Requests

Technique 5: Max Token Optimization

Technique 6: Self-Hosted Models for High Volume

Cost Monitoring Dashboard

Stay ahead of the curve

Related Articles

Build an AI AWS Cost Anomaly Detector with Claude API and Cost Explorer

Build an AI Cloud Cost Anomaly Detector with Claude API + AWS Cost Explorer

Build an AI Cloud Cost Spike Detector with Claude API and Prometheus

Comments