🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

LLM Context Window Management in Production: What Nobody Tells You

How to manage LLM context windows in production systems — token budgeting, conversation compression, RAG vs context stuffing, and real strategies for keeping your LLM application fast and cheap at scale.

Shubham7 min read
Share:Tweet

Every LLM application hits the same wall eventually: the context window.

You start simple. A chatbot that answers DevOps questions. Works great. Then someone has a long conversation — 30 turns, lots of technical detail — and suddenly responses slow down, costs spike, and the model starts forgetting things it said three messages ago.

Context window management is one of those topics that only shows up once you're in production with real users. By then, fixing it is expensive. This guide covers what I've learned from building LLM-powered tools in production.

Why Context Windows Matter More Than You Think

Most developers understand context windows theoretically: "the model can only see X tokens at a time." What they don't fully internalize is the practical impact:

Cost scales with context. Every input token costs money. A 200K-token context on Claude costs dramatically more than a 2K-token context. If you're stuffing entire conversation histories into every request, your costs grow quadratically as conversations get longer.

Latency scales with context. More tokens = more compute = slower responses. A chatbot that feels instant at turn 3 might feel sluggish at turn 20 if you're not managing context.

Quality degrades with irrelevant context. This surprises people. More context is not always better. If your context is filled with tangential earlier messages, the model spends "attention" on them instead of the current question. Focused, relevant context often produces better answers than massive unfocused context.

The Four Main Strategies

1. Sliding Window (Naive but Common)

The simplest approach: keep only the last N turns of conversation.

python
def get_context_messages(conversation_history: list, max_turns: int = 10) -> list:
    """Keep only the most recent N turns."""
    return conversation_history[-max_turns * 2:]  # *2 for user + assistant pairs

This works for simple chatbots. The problem: if something important was said in turn 2 and you're now on turn 15, that information is gone. The user gets confused when the model "forgets."

2. Token Budget Management

Instead of counting turns, count tokens and trim when you exceed your budget:

python
import anthropic
 
client = anthropic.Anthropic()
 
 
def count_tokens(messages: list, model: str = "claude-sonnet-5") -> int:
    """Count tokens for a message list using Anthropic's token counting API."""
    response = client.messages.count_tokens(
        model=model,
        messages=messages
    )
    return response.input_tokens
 
 
def trim_to_token_budget(
    messages: list,
    system_prompt: str,
    max_tokens: int = 150_000,
    model: str = "claude-sonnet-5"
) -> list:
    """
    Trim conversation history to stay within token budget.
    Always preserves the system prompt and the most recent user message.
    """
    system_tokens = count_tokens([{"role": "user", "content": system_prompt}], model)
    available = max_tokens - system_tokens - 2000  # 2K buffer for response
 
    # Start with all messages, remove oldest pairs until we fit
    trimmed = messages.copy()
    while len(trimmed) > 2:  # Keep at least the last exchange
        current_tokens = count_tokens(trimmed, model)
        if current_tokens <= available:
            break
        # Remove the oldest user+assistant pair
        trimmed = trimmed[2:]
 
    return trimmed

This is more accurate than turn counting because token count varies by message length.

3. Conversation Summarization

Rather than dropping old messages entirely, compress them into a summary:

python
def summarize_conversation(messages: list, model: str = "claude-haiku-4-5-20251001") -> str:
    """
    Compress older conversation turns into a concise summary.
    Uses a faster/cheaper model for the summarization step.
    """
    client = anthropic.Anthropic()
 
    conversation_text = "\n".join([
        f"{m['role'].upper()}: {m['content']}"
        for m in messages
    ])
 
    response = client.messages.create(
        model=model,  # Use Haiku for cheap summarization
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Summarize this conversation history concisely.
Preserve: key decisions made, important facts stated, user preferences, any errors encountered.
Discard: pleasantries, repeated information, resolved issues.
 
CONVERSATION:
{conversation_text}
 
Provide a 3-5 sentence factual summary."""
        }]
    )
 
    return response.content[0].text
 
 
def manage_context_with_summary(
    messages: list,
    system_prompt: str,
    token_threshold: int = 80_000
) -> tuple[list, str]:
    """
    When context exceeds threshold, summarize old messages.
    Returns (trimmed_messages, updated_system_prompt_with_summary).
    """
    current_tokens = count_tokens(messages)
 
    if current_tokens < token_threshold:
        return messages, system_prompt
 
    # Split: keep last 6 turns fresh, summarize everything before
    recent_messages = messages[-12:]  # Last 6 exchanges
    old_messages = messages[:-12]
 
    if not old_messages:
        return messages, system_prompt
 
    summary = summarize_conversation(old_messages)
 
    # Inject summary into system prompt
    updated_system = f"""{system_prompt}
 
## Conversation History Summary
{summary}"""
 
    return recent_messages, updated_system

This is the strategy I recommend for production chatbots. Users don't notice the summarization, and you maintain continuity while controlling costs.

4. RAG Instead of Context Stuffing

For applications where users reference documents or large knowledge bases, don't put the whole document in context. Use retrieval instead.

python
from anthropic import Anthropic
 
client = Anthropic()
 
 
def rag_query(
    user_question: str,
    document_chunks: list[str],
    top_k: int = 5
) -> str:
    """
    Simple semantic search + retrieval before sending to Claude.
    In production, replace simple_search with a proper vector store.
    """
    # Retrieve only relevant chunks
    relevant_chunks = simple_search(document_chunks, user_question, top_k)
 
    context = "\n\n---\n\n".join(relevant_chunks)
 
    response = client.messages.create(
        model="claude-sonnet-5",
        max_tokens=1000,
        system=f"""Answer questions based on the provided context.
If the context doesn't contain the answer, say so clearly.
 
CONTEXT:
{context}""",
        messages=[{
            "role": "user",
            "content": user_question
        }]
    )
 
    return response.content[0].text
 
 
def simple_search(chunks: list[str], query: str, top_k: int) -> list[str]:
    """
    Placeholder — replace with embeddings + vector search in production.
    (ChromaDB, Pinecone, pgvector, etc.)
    """
    # Simple keyword overlap for illustration
    scored = [(chunk, sum(1 for word in query.lower().split()
                         if word in chunk.lower())) for chunk in chunks]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in scored[:top_k]]

RAG keeps your context window small and focused regardless of how large your knowledge base is.

Choosing the Right Strategy

ScenarioRecommended Strategy
Simple chatbot, short conversationsSliding window (last 10 turns)
Customer support bot, long sessionsSummarization + token budget
Document Q&A systemRAG
Agent with long task contextToken budget + selective retention
High-volume production APIToken budget + cheap model for summarization

Token Budget Calculator

A quick formula for sizing your context budget:

Total context budget = Model context window × 0.7  (leave headroom)
System prompt = ~500-2000 tokens
User context = remaining budget - max_output_tokens

For Claude with a 200K context window:

  • Budget: 140K tokens safely
  • System prompt: ~1K tokens
  • Available for conversation: ~129K tokens
  • Max output: 10K tokens

At ~750 tokens per page of text, that's about 170 pages of conversation before you need to compress. Most users never get there in a single session, but your long-running agents will.

Monitoring Context Usage in Production

Don't fly blind. Log token usage per request:

python
import logging
 
logger = logging.getLogger(__name__)
 
 
def call_claude_with_monitoring(
    messages: list,
    system: str,
    model: str = "claude-sonnet-5"
) -> str:
    client = Anthropic()
 
    response = client.messages.create(
        model=model,
        max_tokens=2000,
        system=system,
        messages=messages
    )
 
    # Log for monitoring
    logger.info({
        "event": "llm_call",
        "model": model,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "total_tokens": response.usage.input_tokens + response.usage.output_tokens,
        "estimated_cost_usd": _estimate_cost(response.usage, model)
    })
 
    return response.content[0].text
 
 
def _estimate_cost(usage, model: str) -> float:
    """Rough cost estimation — check Anthropic pricing for current rates."""
    pricing = {
        "claude-sonnet-5": {"input": 3.0, "output": 15.0},   # per million tokens
        "claude-haiku-4-5-20251001": {"input": 0.25, "output": 1.25},
    }
    p = pricing.get(model, {"input": 3.0, "output": 15.0})
    return (usage.input_tokens * p["input"] + usage.output_tokens * p["output"]) / 1_000_000

Ship this from day one. Without token monitoring, you'll only find out about runaway costs when the bill arrives.

The Practical Takeaway

Most LLM applications in production should use a combination:

  1. Token budget check on every request
  2. Summarization when budget is exceeded (use Haiku to save money)
  3. RAG for any large knowledge base or document reference
  4. Monitoring on every call from day one

The teams that get into trouble are the ones who build everything with "just use the full context" thinking and then can't understand why their application is slow and expensive at scale.

Context management isn't glamorous, but it's what separates a prototype from a production LLM application.


Building LLM-powered DevOps tools? Check out Build an AI deployment health checker with Claude API and our RAG in production guide.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments