LLM Context Window Management for Long Documents in Production
When your documents exceed the context window, you need chunking, summarization, and retrieval strategies. Here's how to handle long context in production LLM apps.
Context windows have grown dramatically — Claude supports 200K tokens, GPT-4 supports 128K — but real production workloads still hit limits. Runbooks, codebases, support ticket histories, log files: these are long. Here's how to handle them properly.
The Core Problem
200K tokens ≈ 150,000 words ≈ ~500 pages of text
But:
- Full codebase: millions of tokens
- 6-month support ticket history: millions of tokens
- Server logs for debugging: millions of tokens
- All of the above together: impossible
Even if your document fits in the context window, stuffing everything in has costs:
- Latency increases linearly with context length
- Cost increases proportionally
- Model attention gets diluted over long contexts
Strategy 1: Retrieval Augmented Generation (RAG)
Instead of sending the full document, find and send only the relevant sections.
from anthropic import Anthropic
import chromadb
from sentence_transformers import SentenceTransformer
client = Anthropic()
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma_client = chromadb.HttpClient(host="chromadb", port=8000)
collection = chroma_client.get_collection("runbooks")
def answer_with_rag(question: str, top_k: int = 5) -> str:
"""Answer a question using RAG over runbook documents."""
# 1. Embed the question
question_embedding = embedding_model.encode(question).tolist()
# 2. Find relevant chunks
results = collection.query(
query_embeddings=[question_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
# 3. Build context from top-k chunks
context_parts = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
if dist < 0.8: # relevance threshold
source = meta.get("source", "unknown")
section = meta.get("section", "")
context_parts.append(f"[Source: {source} / {section}]\n{doc}")
context = "\n\n---\n\n".join(context_parts)
if not context:
return "No relevant information found in the knowledge base."
# 4. Ask Claude with the retrieved context
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1500,
system="You are a helpful SRE assistant. Answer questions based ONLY on the provided context. If the answer isn't in the context, say so.",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}]
)
return response.content[0].textStrategy 2: Sliding Window with Overlap
For sequential documents where order matters (logs, conversation history), use a sliding window:
def sliding_window_chunks(text: str, window_size: int = 2000, overlap: int = 200) -> list[dict]:
"""Split text into overlapping chunks for better context preservation."""
chunks = []
start = 0
chunk_idx = 0
while start < len(text):
end = start + window_size
chunk = text[start:end]
chunks.append({
"content": chunk,
"chunk_id": chunk_idx,
"start_char": start,
"end_char": end,
})
start += (window_size - overlap) # move forward, keeping overlap
chunk_idx += 1
return chunks
def analyze_logs_sliding(log_content: str, question: str) -> str:
"""Analyze logs using sliding window when they exceed context."""
chunks = sliding_window_chunks(log_content, window_size=8000, overlap=500)
insights = []
for i, chunk in enumerate(chunks):
response = client.messages.create(
model="claude-haiku-4-5-20251001", # use cheap model for chunks
max_tokens=500,
messages=[{
"role": "user",
"content": f"Log section {i+1}/{len(chunks)}:\n\n{chunk['content']}\n\nQuestion: {question}\n\nExtract only the relevant information (or 'nothing relevant' if this section doesn't help):"
}]
)
insight = response.content[0].text
if "nothing relevant" not in insight.lower():
insights.append(f"[Section {i+1}]: {insight}")
if not insights:
return "No relevant information found in logs."
# Synthesize with the full model
synthesis_prompt = f"""I've analyzed {len(chunks)} sections of logs for: "{question}"
Here are the relevant findings:
{chr(10).join(insights)}
Synthesize these findings into a clear answer:"""
final_response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1000,
messages=[{"role": "user", "content": synthesis_prompt}]
)
return final_response.content[0].textStrategy 3: Hierarchical Summarization
For very long documents, summarize first, then answer:
def hierarchical_summarize(text: str, max_summary_length: int = 3000) -> str:
"""Summarize a long document hierarchically."""
# Split into sections
sections = split_into_sections(text, max_chars=4000)
# Summarize each section
section_summaries = []
for i, section in enumerate(sections):
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{
"role": "user",
"content": f"Summarize this section concisely (max 200 words), preserving key technical details:\n\n{section}"
}]
)
section_summaries.append(response.content[0].text)
# Combine summaries
combined = "\n\n".join(section_summaries)
# If still too long, summarize again
if len(combined) > max_summary_length:
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=500,
messages=[{
"role": "user",
"content": f"Combine these section summaries into a single coherent summary:\n\n{combined}"
}]
)
return response.content[0].text
return combined
def split_into_sections(text: str, max_chars: int = 4000) -> list[str]:
"""Split text on double newlines, respecting max size."""
paragraphs = text.split("\n\n")
sections = []
current = ""
for para in paragraphs:
if len(current) + len(para) > max_chars and current:
sections.append(current.strip())
current = para
else:
current += "\n\n" + para
if current.strip():
sections.append(current.strip())
return sectionsStrategy 4: Smart Context Selection
For code analysis, select only the relevant files/functions:
import ast
import os
def extract_relevant_code(repo_path: str, question: str, max_chars: int = 20000) -> str:
"""Extract relevant code files based on the question."""
# Get all Python files
all_files = []
for root, dirs, files in os.walk(repo_path):
dirs[:] = [d for d in dirs if d not in ['.git', '__pycache__', 'node_modules', '.venv']]
for f in files:
if f.endswith('.py'):
all_files.append(os.path.join(root, f))
# Quick relevance filter using embeddings (simplified)
relevant_keywords = question.lower().split()
scored_files = []
for filepath in all_files:
with open(filepath) as f:
content = f.read()
score = sum(1 for kw in relevant_keywords if kw in content.lower())
if score > 0:
scored_files.append((score, filepath, content))
# Sort by relevance, take top files up to max_chars
scored_files.sort(reverse=True)
selected_content = ""
for score, filepath, content in scored_files:
relative_path = os.path.relpath(filepath, repo_path)
file_block = f"\n# File: {relative_path}\n{content}\n"
if len(selected_content) + len(file_block) > max_chars:
break
selected_content += file_block
return selected_contentStrategy 5: Conversation History Compression
For multi-turn conversations, old turns eat your context budget:
class ConversationManager:
def __init__(self, max_messages: int = 20, compression_threshold: int = 15):
self.messages = []
self.max_messages = max_messages
self.compression_threshold = compression_threshold
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
if len(self.messages) >= self.compression_threshold:
self._compress_history()
def _compress_history(self):
"""Summarize old messages to free up context."""
if len(self.messages) <= 6:
return
# Keep the last 4 messages, compress the rest
old_messages = self.messages[:-4]
recent_messages = self.messages[-4:]
summary_prompt = "Summarize this conversation history concisely, preserving key facts and decisions:\n\n"
for msg in old_messages:
summary_prompt += f"{msg['role'].upper()}: {msg['content']}\n\n"
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=500,
messages=[{"role": "user", "content": summary_prompt}]
)
summary = response.content[0].text
# Replace old history with summary
self.messages = [
{"role": "user", "content": f"[Previous conversation summary: {summary}]"},
{"role": "assistant", "content": "I understand the conversation history."},
] + recent_messages
def get_messages(self) -> list:
return self.messagesChoosing the Right Strategy
| Situation | Best Strategy |
|---|---|
| Large knowledge base, specific questions | RAG |
| Sequential logs, finding patterns | Sliding window |
| Single long document, need overview | Hierarchical summarization |
| Codebase analysis, targeted questions | Smart file selection |
| Long multi-turn chat | History compression |
| Short documents (< 100K tokens) | Full context (simplest) |
The rule: don't add complexity until the simple approach fails. Start with full context if it fits. Add RAG when it doesn't.
Tools: LangChain | ChromaDB | Sentence Transformers
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Argo Workflows vs Prefect vs Airflow — Best for ML Pipelines 2026
Choosing a workflow orchestrator for your ML pipelines? Argo Workflows, Prefect, and Apache Airflow each have distinct strengths. Here's which to pick for your use case.