🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

RAG in Production — Chunking, Embedding Models, and Retrieval Tuning

Building a RAG system that actually works in production requires the right chunking strategy, embedding model, and retrieval tuning. Here's what works, what doesn't, and real configuration examples.

DevOpsBoysJun 6, 20264 min read
Share:Tweet

Most RAG tutorials show you the happy path — load documents, chunk them, embed them, retrieve them, done. Production RAG is different. Bad chunking kills retrieval. Wrong embedding model breaks semantic search. Naive retrieval returns garbage. Here's what actually matters.


The RAG Pipeline

Documents → Chunking → Embedding → Vector Store
                                        ↓
User Query → Query Embedding → Similarity Search → Top-K Chunks
                                                        ↓
                                              LLM + Context → Answer

Each step can silently degrade quality. The LLM looks competent even when retrieval is broken — it just hallucinates from bad context.


Chunking Strategies

Fixed-Size Chunking (naive)

python
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(document)

Problem: Splits mid-sentence. A 512-char split doesn't respect the semantic unit of the document.

When to use: Structured logs, code, CSV data where character count matters.


Semantic Chunking (better)

python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
 
splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,   # higher = fewer, larger chunks
)
chunks = splitter.split_text(document)

Splits on semantic boundaries — topic shifts in the text. Better retrieval accuracy, higher compute cost.


Document-Aware Chunking (production standard)

python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
 
# Parse document preserving structure
elements = partition("runbook.pdf")
 
# Chunk respecting headings, sections
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1000,
    combine_text_under_n_chars=200,
)
 
# Each chunk knows its section context
for chunk in chunks:
    print(chunk.metadata.page_number, chunk.text[:100])

Preserves document structure. Section headers become chunk metadata. Dramatically improves retrieval for technical docs.


Chunking Rules for DevOps Docs

Document typeStrategyChunk size
RunbooksBy section heading800–1500 chars
API docsBy endpointVariable
Log filesFixed-size256–512 chars
Incident reportsBy paragraph500–1000 chars
Kubernetes manifestsBy resourceVariable

Embedding Models

Comparison (2026)

ModelDimensionsCostBest for
text-embedding-3-small1536$0.02/1M tokensGeneral, low cost
text-embedding-3-large3072$0.13/1M tokensHigher accuracy
nomic-embed-text768Free (self-hosted)Privacy-sensitive
BAAI/bge-large-en-v1.51024Free (self-hosted)Technical docs
mxbai-embed-large1024Free (self-hosted)Strong open-source
python
# Self-hosted embedding with sentence-transformers
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
embeddings = model.encode(chunks, normalize_embeddings=True)

Rule: Match embedding model at index time and query time — always use the same model for both.


Vector Store Setup (Qdrant on Kubernetes)

yaml
# qdrant-deployment.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: qdrant
  namespace: ai-infra
spec:
  serviceName: qdrant
  replicas: 1
  selector:
    matchLabels:
      app: qdrant
  template:
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant:latest
        ports:
        - containerPort: 6333
        - containerPort: 6334
        volumeMounts:
        - name: storage
          mountPath: /qdrant/storage
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
          limits:
            memory: "4Gi"
  volumeClaimTemplates:
  - metadata:
      name: storage
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 50Gi
python
# Indexing pipeline
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
 
client = QdrantClient(host="qdrant.ai-infra.svc", port=6333)
 
# Create collection
client.create_collection(
    collection_name="devops_docs",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
 
# Index chunks
points = [
    PointStruct(
        id=i,
        vector=embeddings[i].tolist(),
        payload={
            "text": chunks[i],
            "source": "runbook",
            "section": metadata[i]["section"],
            "doc_id": metadata[i]["doc_id"],
        }
    )
    for i in range(len(chunks))
]
 
client.upsert(collection_name="devops_docs", points=points)

Retrieval Tuning

Basic Semantic Search (baseline)

python
def retrieve(query: str, top_k: int = 5):
    query_embedding = model.encode(query, normalize_embeddings=True)
    results = client.search(
        collection_name="devops_docs",
        query_vector=query_embedding.tolist(),
        limit=top_k,
    )
    return [r.payload["text"] for r in results]

Hybrid Search (semantic + keyword)

Pure semantic search misses exact terms. Hybrid adds BM25 keyword matching:

python
from qdrant_client.models import Prefetch, FusionQuery, Fusion
 
# Sparse + dense hybrid search
results = client.query_points(
    collection_name="devops_docs",
    prefetch=[
        Prefetch(query=dense_vector, using="dense", limit=20),
        Prefetch(query=sparse_vector, using="sparse", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
    limit=top_k,
)

Hybrid typically improves retrieval accuracy by 10–20% on technical content.

Re-ranking

python
from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def retrieve_and_rerank(query: str, top_k: int = 5):
    # Get more candidates first
    candidates = retrieve(query, top_k=20)
 
    # Re-rank with cross-encoder (more accurate, slower)
    pairs = [(query, doc) for doc in candidates]
    scores = reranker.predict(pairs)
 
    # Return top-k after re-ranking
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [doc for _, doc in ranked[:top_k]]

Evaluation Metrics

Track these in production to detect retrieval drift:

python
# RAGAS evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
 
results = evaluate(
    dataset=test_cases,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
 
# Key metrics:
# faithfulness: is the answer grounded in retrieved context?
# context_precision: were the right chunks retrieved?
# answer_relevancy: does the answer actually answer the question?

Common Production Issues

IssueSymptomFix
Retrieval returns wrong docsLLM gives confident wrong answersAdd hybrid search + reranking
Slow query latency>500ms per queryUse smaller embedding model or HNSW tuning
Context window overflowLLM cuts off or errorsReduce top_k or chunk size
Stale embeddingsNew docs not appearingAdd incremental indexing pipeline
Chunk splits mid-tableTable data in answers is wrongUse structured chunking, keep tables as single chunks

Build evaluation into your pipeline from day one. A RAG system that can't measure its own retrieval quality will silently degrade over time.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments