LLM Context Window Management for Long Documents in Production

When your documents exceed the context window, you need chunking, summarization, and retrieval strategies. Here's how to handle long context in production LLM apps.

Context windows have grown dramatically — Claude supports 200K tokens, GPT-4 supports 128K — but real production workloads still hit limits. Runbooks, codebases, support ticket histories, log files: these are long. Here's how to handle them properly.

The Core Problem

200K tokens ≈ 150,000 words ≈ ~500 pages of text

But:
- Full codebase: millions of tokens
- 6-month support ticket history: millions of tokens
- Server logs for debugging: millions of tokens
- All of the above together: impossible

Even if your document fits in the context window, stuffing everything in has costs:

Latency increases linearly with context length
Cost increases proportionally
Model attention gets diluted over long contexts

Strategy 1: Retrieval Augmented Generation (RAG)

Instead of sending the full document, find and send only the relevant sections.

python

from anthropic import Anthropic
import chromadb
from sentence_transformers import SentenceTransformer
 
client = Anthropic()
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma_client = chromadb.HttpClient(host="chromadb", port=8000)
collection = chroma_client.get_collection("runbooks")
 
 
def answer_with_rag(question: str, top_k: int = 5) -> str:
    """Answer a question using RAG over runbook documents."""
    
    # 1. Embed the question
    question_embedding = embedding_model.encode(question).tolist()
    
    # 2. Find relevant chunks
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )
    
    # 3. Build context from top-k chunks
    context_parts = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0], 
        results["distances"][0]
    ):
        if dist < 0.8:  # relevance threshold
            source = meta.get("source", "unknown")
            section = meta.get("section", "")
            context_parts.append(f"[Source: {source} / {section}]\n{doc}")
    
    context = "\n\n---\n\n".join(context_parts)
    
    if not context:
        return "No relevant information found in the knowledge base."
    
    # 4. Ask Claude with the retrieved context
    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1500,
        system="You are a helpful SRE assistant. Answer questions based ONLY on the provided context. If the answer isn't in the context, say so.",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    
    return response.content[0].text

Strategy 2: Sliding Window with Overlap

For sequential documents where order matters (logs, conversation history), use a sliding window:

python

def sliding_window_chunks(text: str, window_size: int = 2000, overlap: int = 200) -> list[dict]:
    """Split text into overlapping chunks for better context preservation."""
    chunks = []
    start = 0
    chunk_idx = 0
    
    while start < len(text):
        end = start + window_size
        chunk = text[start:end]
        
        chunks.append({
            "content": chunk,
            "chunk_id": chunk_idx,
            "start_char": start,
            "end_char": end,
        })
        
        start += (window_size - overlap)  # move forward, keeping overlap
        chunk_idx += 1
    
    return chunks
 
 
def analyze_logs_sliding(log_content: str, question: str) -> str:
    """Analyze logs using sliding window when they exceed context."""
    chunks = sliding_window_chunks(log_content, window_size=8000, overlap=500)
    
    insights = []
    for i, chunk in enumerate(chunks):
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",  # use cheap model for chunks
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"Log section {i+1}/{len(chunks)}:\n\n{chunk['content']}\n\nQuestion: {question}\n\nExtract only the relevant information (or 'nothing relevant' if this section doesn't help):"
            }]
        )
        
        insight = response.content[0].text
        if "nothing relevant" not in insight.lower():
            insights.append(f"[Section {i+1}]: {insight}")
    
    if not insights:
        return "No relevant information found in logs."
    
    # Synthesize with the full model
    synthesis_prompt = f"""I've analyzed {len(chunks)} sections of logs for: "{question}"
    
Here are the relevant findings:
{chr(10).join(insights)}
 
Synthesize these findings into a clear answer:"""
 
    final_response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1000,
        messages=[{"role": "user", "content": synthesis_prompt}]
    )
    
    return final_response.content[0].text

Strategy 3: Hierarchical Summarization

For very long documents, summarize first, then answer:

python

def hierarchical_summarize(text: str, max_summary_length: int = 3000) -> str:
    """Summarize a long document hierarchically."""
    # Split into sections
    sections = split_into_sections(text, max_chars=4000)
    
    # Summarize each section
    section_summaries = []
    for i, section in enumerate(sections):
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"Summarize this section concisely (max 200 words), preserving key technical details:\n\n{section}"
            }]
        )
        section_summaries.append(response.content[0].text)
    
    # Combine summaries
    combined = "\n\n".join(section_summaries)
    
    # If still too long, summarize again
    if len(combined) > max_summary_length:
        response = client.messages.create(
            model="claude-opus-4-8",
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"Combine these section summaries into a single coherent summary:\n\n{combined}"
            }]
        )
        return response.content[0].text
    
    return combined
 
 
def split_into_sections(text: str, max_chars: int = 4000) -> list[str]:
    """Split text on double newlines, respecting max size."""
    paragraphs = text.split("\n\n")
    sections = []
    current = ""
    
    for para in paragraphs:
        if len(current) + len(para) > max_chars and current:
            sections.append(current.strip())
            current = para
        else:
            current += "\n\n" + para
    
    if current.strip():
        sections.append(current.strip())
    
    return sections

Strategy 4: Smart Context Selection

For code analysis, select only the relevant files/functions:

python

import ast
import os
 
def extract_relevant_code(repo_path: str, question: str, max_chars: int = 20000) -> str:
    """Extract relevant code files based on the question."""
    
    # Get all Python files
    all_files = []
    for root, dirs, files in os.walk(repo_path):
        dirs[:] = [d for d in dirs if d not in ['.git', '__pycache__', 'node_modules', '.venv']]
        for f in files:
            if f.endswith('.py'):
                all_files.append(os.path.join(root, f))
    
    # Quick relevance filter using embeddings (simplified)
    relevant_keywords = question.lower().split()
    scored_files = []
    
    for filepath in all_files:
        with open(filepath) as f:
            content = f.read()
        
        score = sum(1 for kw in relevant_keywords if kw in content.lower())
        if score > 0:
            scored_files.append((score, filepath, content))
    
    # Sort by relevance, take top files up to max_chars
    scored_files.sort(reverse=True)
    
    selected_content = ""
    for score, filepath, content in scored_files:
        relative_path = os.path.relpath(filepath, repo_path)
        file_block = f"\n# File: {relative_path}\n{content}\n"
        
        if len(selected_content) + len(file_block) > max_chars:
            break
        selected_content += file_block
    
    return selected_content

Strategy 5: Conversation History Compression

For multi-turn conversations, old turns eat your context budget:

python

class ConversationManager:
    def __init__(self, max_messages: int = 20, compression_threshold: int = 15):
        self.messages = []
        self.max_messages = max_messages
        self.compression_threshold = compression_threshold
    
    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        
        if len(self.messages) >= self.compression_threshold:
            self._compress_history()
    
    def _compress_history(self):
        """Summarize old messages to free up context."""
        if len(self.messages) <= 6:
            return
        
        # Keep the last 4 messages, compress the rest
        old_messages = self.messages[:-4]
        recent_messages = self.messages[-4:]
        
        summary_prompt = "Summarize this conversation history concisely, preserving key facts and decisions:\n\n"
        for msg in old_messages:
            summary_prompt += f"{msg['role'].upper()}: {msg['content']}\n\n"
        
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=500,
            messages=[{"role": "user", "content": summary_prompt}]
        )
        
        summary = response.content[0].text
        
        # Replace old history with summary
        self.messages = [
            {"role": "user", "content": f"[Previous conversation summary: {summary}]"},
            {"role": "assistant", "content": "I understand the conversation history."},
        ] + recent_messages
    
    def get_messages(self) -> list:
        return self.messages

Choosing the Right Strategy

Situation	Best Strategy
Large knowledge base, specific questions	RAG
Sequential logs, finding patterns	Sliding window
Single long document, need overview	Hierarchical summarization
Codebase analysis, targeted questions	Smart file selection
Long multi-turn chat	History compression
Short documents (< 100K tokens)	Full context (simplest)

The rule: don't add complexity until the simple approach fails. Start with full context if it fits. Add RAG when it doesn't.

Tools: LangChain | ChromaDB | Sentence Transformers

LLM Context Window Management for Long Documents in Production

The Core Problem

Strategy 1: Retrieval Augmented Generation (RAG)

Strategy 2: Sliding Window with Overlap

Strategy 3: Hierarchical Summarization

Strategy 4: Smart Context Selection

Strategy 5: Conversation History Compression

Choosing the Right Strategy

Stay ahead of the curve

Related Articles

AI-Driven Capacity Planning for Kubernetes Clusters (2026)

AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds

Argo Workflows vs Prefect vs Airflow — Best for ML Pipelines 2026

Comments