LLM Context Window Management in Production: What Nobody Tells You
How to manage LLM context windows in production systems — token budgeting, conversation compression, RAG vs context stuffing, and real strategies for keeping your LLM application fast and cheap at scale.
Every LLM application hits the same wall eventually: the context window.
You start simple. A chatbot that answers DevOps questions. Works great. Then someone has a long conversation — 30 turns, lots of technical detail — and suddenly responses slow down, costs spike, and the model starts forgetting things it said three messages ago.
Context window management is one of those topics that only shows up once you're in production with real users. By then, fixing it is expensive. This guide covers what I've learned from building LLM-powered tools in production.
Why Context Windows Matter More Than You Think
Most developers understand context windows theoretically: "the model can only see X tokens at a time." What they don't fully internalize is the practical impact:
Cost scales with context. Every input token costs money. A 200K-token context on Claude costs dramatically more than a 2K-token context. If you're stuffing entire conversation histories into every request, your costs grow quadratically as conversations get longer.
Latency scales with context. More tokens = more compute = slower responses. A chatbot that feels instant at turn 3 might feel sluggish at turn 20 if you're not managing context.
Quality degrades with irrelevant context. This surprises people. More context is not always better. If your context is filled with tangential earlier messages, the model spends "attention" on them instead of the current question. Focused, relevant context often produces better answers than massive unfocused context.
The Four Main Strategies
1. Sliding Window (Naive but Common)
The simplest approach: keep only the last N turns of conversation.
def get_context_messages(conversation_history: list, max_turns: int = 10) -> list:
"""Keep only the most recent N turns."""
return conversation_history[-max_turns * 2:] # *2 for user + assistant pairsThis works for simple chatbots. The problem: if something important was said in turn 2 and you're now on turn 15, that information is gone. The user gets confused when the model "forgets."
2. Token Budget Management
Instead of counting turns, count tokens and trim when you exceed your budget:
import anthropic
client = anthropic.Anthropic()
def count_tokens(messages: list, model: str = "claude-sonnet-5") -> int:
"""Count tokens for a message list using Anthropic's token counting API."""
response = client.messages.count_tokens(
model=model,
messages=messages
)
return response.input_tokens
def trim_to_token_budget(
messages: list,
system_prompt: str,
max_tokens: int = 150_000,
model: str = "claude-sonnet-5"
) -> list:
"""
Trim conversation history to stay within token budget.
Always preserves the system prompt and the most recent user message.
"""
system_tokens = count_tokens([{"role": "user", "content": system_prompt}], model)
available = max_tokens - system_tokens - 2000 # 2K buffer for response
# Start with all messages, remove oldest pairs until we fit
trimmed = messages.copy()
while len(trimmed) > 2: # Keep at least the last exchange
current_tokens = count_tokens(trimmed, model)
if current_tokens <= available:
break
# Remove the oldest user+assistant pair
trimmed = trimmed[2:]
return trimmedThis is more accurate than turn counting because token count varies by message length.
3. Conversation Summarization
Rather than dropping old messages entirely, compress them into a summary:
def summarize_conversation(messages: list, model: str = "claude-haiku-4-5-20251001") -> str:
"""
Compress older conversation turns into a concise summary.
Uses a faster/cheaper model for the summarization step.
"""
client = anthropic.Anthropic()
conversation_text = "\n".join([
f"{m['role'].upper()}: {m['content']}"
for m in messages
])
response = client.messages.create(
model=model, # Use Haiku for cheap summarization
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Summarize this conversation history concisely.
Preserve: key decisions made, important facts stated, user preferences, any errors encountered.
Discard: pleasantries, repeated information, resolved issues.
CONVERSATION:
{conversation_text}
Provide a 3-5 sentence factual summary."""
}]
)
return response.content[0].text
def manage_context_with_summary(
messages: list,
system_prompt: str,
token_threshold: int = 80_000
) -> tuple[list, str]:
"""
When context exceeds threshold, summarize old messages.
Returns (trimmed_messages, updated_system_prompt_with_summary).
"""
current_tokens = count_tokens(messages)
if current_tokens < token_threshold:
return messages, system_prompt
# Split: keep last 6 turns fresh, summarize everything before
recent_messages = messages[-12:] # Last 6 exchanges
old_messages = messages[:-12]
if not old_messages:
return messages, system_prompt
summary = summarize_conversation(old_messages)
# Inject summary into system prompt
updated_system = f"""{system_prompt}
## Conversation History Summary
{summary}"""
return recent_messages, updated_systemThis is the strategy I recommend for production chatbots. Users don't notice the summarization, and you maintain continuity while controlling costs.
4. RAG Instead of Context Stuffing
For applications where users reference documents or large knowledge bases, don't put the whole document in context. Use retrieval instead.
from anthropic import Anthropic
client = Anthropic()
def rag_query(
user_question: str,
document_chunks: list[str],
top_k: int = 5
) -> str:
"""
Simple semantic search + retrieval before sending to Claude.
In production, replace simple_search with a proper vector store.
"""
# Retrieve only relevant chunks
relevant_chunks = simple_search(document_chunks, user_question, top_k)
context = "\n\n---\n\n".join(relevant_chunks)
response = client.messages.create(
model="claude-sonnet-5",
max_tokens=1000,
system=f"""Answer questions based on the provided context.
If the context doesn't contain the answer, say so clearly.
CONTEXT:
{context}""",
messages=[{
"role": "user",
"content": user_question
}]
)
return response.content[0].text
def simple_search(chunks: list[str], query: str, top_k: int) -> list[str]:
"""
Placeholder — replace with embeddings + vector search in production.
(ChromaDB, Pinecone, pgvector, etc.)
"""
# Simple keyword overlap for illustration
scored = [(chunk, sum(1 for word in query.lower().split()
if word in chunk.lower())) for chunk in chunks]
scored.sort(key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in scored[:top_k]]RAG keeps your context window small and focused regardless of how large your knowledge base is.
Choosing the Right Strategy
| Scenario | Recommended Strategy |
|---|---|
| Simple chatbot, short conversations | Sliding window (last 10 turns) |
| Customer support bot, long sessions | Summarization + token budget |
| Document Q&A system | RAG |
| Agent with long task context | Token budget + selective retention |
| High-volume production API | Token budget + cheap model for summarization |
Token Budget Calculator
A quick formula for sizing your context budget:
Total context budget = Model context window × 0.7 (leave headroom)
System prompt = ~500-2000 tokens
User context = remaining budget - max_output_tokens
For Claude with a 200K context window:
- Budget: 140K tokens safely
- System prompt: ~1K tokens
- Available for conversation: ~129K tokens
- Max output: 10K tokens
At ~750 tokens per page of text, that's about 170 pages of conversation before you need to compress. Most users never get there in a single session, but your long-running agents will.
Monitoring Context Usage in Production
Don't fly blind. Log token usage per request:
import logging
logger = logging.getLogger(__name__)
def call_claude_with_monitoring(
messages: list,
system: str,
model: str = "claude-sonnet-5"
) -> str:
client = Anthropic()
response = client.messages.create(
model=model,
max_tokens=2000,
system=system,
messages=messages
)
# Log for monitoring
logger.info({
"event": "llm_call",
"model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"total_tokens": response.usage.input_tokens + response.usage.output_tokens,
"estimated_cost_usd": _estimate_cost(response.usage, model)
})
return response.content[0].text
def _estimate_cost(usage, model: str) -> float:
"""Rough cost estimation — check Anthropic pricing for current rates."""
pricing = {
"claude-sonnet-5": {"input": 3.0, "output": 15.0}, # per million tokens
"claude-haiku-4-5-20251001": {"input": 0.25, "output": 1.25},
}
p = pricing.get(model, {"input": 3.0, "output": 15.0})
return (usage.input_tokens * p["input"] + usage.output_tokens * p["output"]) / 1_000_000Ship this from day one. Without token monitoring, you'll only find out about runaway costs when the bill arrives.
The Practical Takeaway
Most LLM applications in production should use a combination:
- Token budget check on every request
- Summarization when budget is exceeded (use Haiku to save money)
- RAG for any large knowledge base or document reference
- Monitoring on every call from day one
The teams that get into trouble are the ones who build everything with "just use the full context" thinking and then can't understand why their application is slow and expensive at scale.
Context management isn't glamorous, but it's what separates a prototype from a production LLM application.
Building LLM-powered DevOps tools? Check out Build an AI deployment health checker with Claude API and our RAG in production guide.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Auto-Generate Terraform Modules Using OpenAI Function Calling
Build a tool that takes plain English descriptions and generates production-ready Terraform modules using OpenAI's function calling API. No more starting from scratch.
Build an AI Cloud Cost Anomaly Detector with Claude API + AWS Cost Explorer
Cloud costs spike without warning. Build a Python bot using AWS Cost Explorer + Claude API that detects anomalies using Z-score analysis and explains the spike in plain English.
Build an AI Deployment Health Checker with Claude API and Kubernetes
Step-by-step tutorial to build an AI-powered deployment health checker using Claude API and the Kubernetes Python client. Automatically diagnose failing pods, check resource limits, and get plain-English explanations of what's wrong.