🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

LLM Context Window Management for Long Documents in Production

When your documents exceed the context window, you need chunking, summarization, and retrieval strategies. Here's how to handle long context in production LLM apps.

DevOpsBoys6 min read
Share:Tweet

Context windows have grown dramatically — Claude supports 200K tokens, GPT-4 supports 128K — but real production workloads still hit limits. Runbooks, codebases, support ticket histories, log files: these are long. Here's how to handle them properly.

The Core Problem

200K tokens ≈ 150,000 words ≈ ~500 pages of text

But:
- Full codebase: millions of tokens
- 6-month support ticket history: millions of tokens
- Server logs for debugging: millions of tokens
- All of the above together: impossible

Even if your document fits in the context window, stuffing everything in has costs:

  • Latency increases linearly with context length
  • Cost increases proportionally
  • Model attention gets diluted over long contexts

Strategy 1: Retrieval Augmented Generation (RAG)

Instead of sending the full document, find and send only the relevant sections.

python
from anthropic import Anthropic
import chromadb
from sentence_transformers import SentenceTransformer
 
client = Anthropic()
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma_client = chromadb.HttpClient(host="chromadb", port=8000)
collection = chroma_client.get_collection("runbooks")
 
 
def answer_with_rag(question: str, top_k: int = 5) -> str:
    """Answer a question using RAG over runbook documents."""
    
    # 1. Embed the question
    question_embedding = embedding_model.encode(question).tolist()
    
    # 2. Find relevant chunks
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )
    
    # 3. Build context from top-k chunks
    context_parts = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0], 
        results["distances"][0]
    ):
        if dist < 0.8:  # relevance threshold
            source = meta.get("source", "unknown")
            section = meta.get("section", "")
            context_parts.append(f"[Source: {source} / {section}]\n{doc}")
    
    context = "\n\n---\n\n".join(context_parts)
    
    if not context:
        return "No relevant information found in the knowledge base."
    
    # 4. Ask Claude with the retrieved context
    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1500,
        system="You are a helpful SRE assistant. Answer questions based ONLY on the provided context. If the answer isn't in the context, say so.",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    
    return response.content[0].text

Strategy 2: Sliding Window with Overlap

For sequential documents where order matters (logs, conversation history), use a sliding window:

python
def sliding_window_chunks(text: str, window_size: int = 2000, overlap: int = 200) -> list[dict]:
    """Split text into overlapping chunks for better context preservation."""
    chunks = []
    start = 0
    chunk_idx = 0
    
    while start < len(text):
        end = start + window_size
        chunk = text[start:end]
        
        chunks.append({
            "content": chunk,
            "chunk_id": chunk_idx,
            "start_char": start,
            "end_char": end,
        })
        
        start += (window_size - overlap)  # move forward, keeping overlap
        chunk_idx += 1
    
    return chunks
 
 
def analyze_logs_sliding(log_content: str, question: str) -> str:
    """Analyze logs using sliding window when they exceed context."""
    chunks = sliding_window_chunks(log_content, window_size=8000, overlap=500)
    
    insights = []
    for i, chunk in enumerate(chunks):
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",  # use cheap model for chunks
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"Log section {i+1}/{len(chunks)}:\n\n{chunk['content']}\n\nQuestion: {question}\n\nExtract only the relevant information (or 'nothing relevant' if this section doesn't help):"
            }]
        )
        
        insight = response.content[0].text
        if "nothing relevant" not in insight.lower():
            insights.append(f"[Section {i+1}]: {insight}")
    
    if not insights:
        return "No relevant information found in logs."
    
    # Synthesize with the full model
    synthesis_prompt = f"""I've analyzed {len(chunks)} sections of logs for: "{question}"
    
Here are the relevant findings:
{chr(10).join(insights)}
 
Synthesize these findings into a clear answer:"""
 
    final_response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1000,
        messages=[{"role": "user", "content": synthesis_prompt}]
    )
    
    return final_response.content[0].text

Strategy 3: Hierarchical Summarization

For very long documents, summarize first, then answer:

python
def hierarchical_summarize(text: str, max_summary_length: int = 3000) -> str:
    """Summarize a long document hierarchically."""
    # Split into sections
    sections = split_into_sections(text, max_chars=4000)
    
    # Summarize each section
    section_summaries = []
    for i, section in enumerate(sections):
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"Summarize this section concisely (max 200 words), preserving key technical details:\n\n{section}"
            }]
        )
        section_summaries.append(response.content[0].text)
    
    # Combine summaries
    combined = "\n\n".join(section_summaries)
    
    # If still too long, summarize again
    if len(combined) > max_summary_length:
        response = client.messages.create(
            model="claude-opus-4-8",
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"Combine these section summaries into a single coherent summary:\n\n{combined}"
            }]
        )
        return response.content[0].text
    
    return combined
 
 
def split_into_sections(text: str, max_chars: int = 4000) -> list[str]:
    """Split text on double newlines, respecting max size."""
    paragraphs = text.split("\n\n")
    sections = []
    current = ""
    
    for para in paragraphs:
        if len(current) + len(para) > max_chars and current:
            sections.append(current.strip())
            current = para
        else:
            current += "\n\n" + para
    
    if current.strip():
        sections.append(current.strip())
    
    return sections

Strategy 4: Smart Context Selection

For code analysis, select only the relevant files/functions:

python
import ast
import os
 
def extract_relevant_code(repo_path: str, question: str, max_chars: int = 20000) -> str:
    """Extract relevant code files based on the question."""
    
    # Get all Python files
    all_files = []
    for root, dirs, files in os.walk(repo_path):
        dirs[:] = [d for d in dirs if d not in ['.git', '__pycache__', 'node_modules', '.venv']]
        for f in files:
            if f.endswith('.py'):
                all_files.append(os.path.join(root, f))
    
    # Quick relevance filter using embeddings (simplified)
    relevant_keywords = question.lower().split()
    scored_files = []
    
    for filepath in all_files:
        with open(filepath) as f:
            content = f.read()
        
        score = sum(1 for kw in relevant_keywords if kw in content.lower())
        if score > 0:
            scored_files.append((score, filepath, content))
    
    # Sort by relevance, take top files up to max_chars
    scored_files.sort(reverse=True)
    
    selected_content = ""
    for score, filepath, content in scored_files:
        relative_path = os.path.relpath(filepath, repo_path)
        file_block = f"\n# File: {relative_path}\n{content}\n"
        
        if len(selected_content) + len(file_block) > max_chars:
            break
        selected_content += file_block
    
    return selected_content

Strategy 5: Conversation History Compression

For multi-turn conversations, old turns eat your context budget:

python
class ConversationManager:
    def __init__(self, max_messages: int = 20, compression_threshold: int = 15):
        self.messages = []
        self.max_messages = max_messages
        self.compression_threshold = compression_threshold
    
    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        
        if len(self.messages) >= self.compression_threshold:
            self._compress_history()
    
    def _compress_history(self):
        """Summarize old messages to free up context."""
        if len(self.messages) <= 6:
            return
        
        # Keep the last 4 messages, compress the rest
        old_messages = self.messages[:-4]
        recent_messages = self.messages[-4:]
        
        summary_prompt = "Summarize this conversation history concisely, preserving key facts and decisions:\n\n"
        for msg in old_messages:
            summary_prompt += f"{msg['role'].upper()}: {msg['content']}\n\n"
        
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=500,
            messages=[{"role": "user", "content": summary_prompt}]
        )
        
        summary = response.content[0].text
        
        # Replace old history with summary
        self.messages = [
            {"role": "user", "content": f"[Previous conversation summary: {summary}]"},
            {"role": "assistant", "content": "I understand the conversation history."},
        ] + recent_messages
    
    def get_messages(self) -> list:
        return self.messages

Choosing the Right Strategy

SituationBest Strategy
Large knowledge base, specific questionsRAG
Sequential logs, finding patternsSliding window
Single long document, need overviewHierarchical summarization
Codebase analysis, targeted questionsSmart file selection
Long multi-turn chatHistory compression
Short documents (< 100K tokens)Full context (simplest)

The rule: don't add complexity until the simple approach fails. Start with full context if it fits. Add RAG when it doesn't.

Tools: LangChain | ChromaDB | Sentence Transformers

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments