RAG in Production — Chunking, Embedding Models, and Retrieval Tuning

Building a RAG system that actually works in production requires the right chunking strategy, embedding model, and retrieval tuning. Here's what works, what doesn't, and real configuration examples.

Most RAG tutorials show you the happy path — load documents, chunk them, embed them, retrieve them, done. Production RAG is different. Bad chunking kills retrieval. Wrong embedding model breaks semantic search. Naive retrieval returns garbage. Here's what actually matters.

The RAG Pipeline

Documents → Chunking → Embedding → Vector Store
                                        ↓
User Query → Query Embedding → Similarity Search → Top-K Chunks
                                                        ↓
                                              LLM + Context → Answer

Each step can silently degrade quality. The LLM looks competent even when retrieval is broken — it just hallucinates from bad context.

Chunking Strategies

Fixed-Size Chunking (naive)

python

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(document)

Problem: Splits mid-sentence. A 512-char split doesn't respect the semantic unit of the document.

When to use: Structured logs, code, CSV data where character count matters.

Semantic Chunking (better)

python

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
 
splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,   # higher = fewer, larger chunks
)
chunks = splitter.split_text(document)

Splits on semantic boundaries — topic shifts in the text. Better retrieval accuracy, higher compute cost.

Document-Aware Chunking (production standard)

python

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
 
# Parse document preserving structure
elements = partition("runbook.pdf")
 
# Chunk respecting headings, sections
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1000,
    combine_text_under_n_chars=200,
)
 
# Each chunk knows its section context
for chunk in chunks:
    print(chunk.metadata.page_number, chunk.text[:100])

Preserves document structure. Section headers become chunk metadata. Dramatically improves retrieval for technical docs.

Chunking Rules for DevOps Docs

Document type	Strategy	Chunk size
Runbooks	By section heading	800–1500 chars
API docs	By endpoint	Variable
Log files	Fixed-size	256–512 chars
Incident reports	By paragraph	500–1000 chars
Kubernetes manifests	By resource	Variable

Embedding Models

Comparison (2026)

Model	Dimensions	Cost	Best for
`text-embedding-3-small`	1536	$0.02/1M tokens	General, low cost
`text-embedding-3-large`	3072	$0.13/1M tokens	Higher accuracy
`nomic-embed-text`	768	Free (self-hosted)	Privacy-sensitive
`BAAI/bge-large-en-v1.5`	1024	Free (self-hosted)	Technical docs
`mxbai-embed-large`	1024	Free (self-hosted)	Strong open-source

python

# Self-hosted embedding with sentence-transformers
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
embeddings = model.encode(chunks, normalize_embeddings=True)

Rule: Match embedding model at index time and query time — always use the same model for both.

Vector Store Setup (Qdrant on Kubernetes)

yaml

# qdrant-deployment.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: qdrant
  namespace: ai-infra
spec:
  serviceName: qdrant
  replicas: 1
  selector:
    matchLabels:
      app: qdrant
  template:
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant:latest
        ports:
        - containerPort: 6333
        - containerPort: 6334
        volumeMounts:
        - name: storage
          mountPath: /qdrant/storage
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
          limits:
            memory: "4Gi"
  volumeClaimTemplates:
  - metadata:
      name: storage
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 50Gi

python

# Indexing pipeline
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
 
client = QdrantClient(host="qdrant.ai-infra.svc", port=6333)
 
# Create collection
client.create_collection(
    collection_name="devops_docs",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
 
# Index chunks
points = [
    PointStruct(
        id=i,
        vector=embeddings[i].tolist(),
        payload={
            "text": chunks[i],
            "source": "runbook",
            "section": metadata[i]["section"],
            "doc_id": metadata[i]["doc_id"],
        }
    )
    for i in range(len(chunks))
]
 
client.upsert(collection_name="devops_docs", points=points)

Retrieval Tuning

Basic Semantic Search (baseline)

python

def retrieve(query: str, top_k: int = 5):
    query_embedding = model.encode(query, normalize_embeddings=True)
    results = client.search(
        collection_name="devops_docs",
        query_vector=query_embedding.tolist(),
        limit=top_k,
    )
    return [r.payload["text"] for r in results]

Hybrid Search (semantic + keyword)

Pure semantic search misses exact terms. Hybrid adds BM25 keyword matching:

python

from qdrant_client.models import Prefetch, FusionQuery, Fusion
 
# Sparse + dense hybrid search
results = client.query_points(
    collection_name="devops_docs",
    prefetch=[
        Prefetch(query=dense_vector, using="dense", limit=20),
        Prefetch(query=sparse_vector, using="sparse", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
    limit=top_k,
)

Hybrid typically improves retrieval accuracy by 10–20% on technical content.

Re-ranking

python

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def retrieve_and_rerank(query: str, top_k: int = 5):
    # Get more candidates first
    candidates = retrieve(query, top_k=20)
 
    # Re-rank with cross-encoder (more accurate, slower)
    pairs = [(query, doc) for doc in candidates]
    scores = reranker.predict(pairs)
 
    # Return top-k after re-ranking
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [doc for _, doc in ranked[:top_k]]

Evaluation Metrics

Track these in production to detect retrieval drift:

python

# RAGAS evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
 
results = evaluate(
    dataset=test_cases,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
 
# Key metrics:
# faithfulness: is the answer grounded in retrieved context?
# context_precision: were the right chunks retrieved?
# answer_relevancy: does the answer actually answer the question?

Common Production Issues

Issue	Symptom	Fix
Retrieval returns wrong docs	LLM gives confident wrong answers	Add hybrid search + reranking
Slow query latency	>500ms per query	Use smaller embedding model or HNSW tuning
Context window overflow	LLM cuts off or errors	Reduce top_k or chunk size
Stale embeddings	New docs not appearing	Add incremental indexing pipeline
Chunk splits mid-table	Table data in answers is wrong	Use structured chunking, keep tables as single chunks

Build evaluation into your pipeline from day one. A RAG system that can't measure its own retrieval quality will silently degrade over time.

RAG in Production — Chunking, Embedding Models, and Retrieval Tuning

The RAG Pipeline

Chunking Strategies

Fixed-Size Chunking (naive)

Semantic Chunking (better)

Document-Aware Chunking (production standard)

Chunking Rules for DevOps Docs

Embedding Models

Comparison (2026)

Vector Store Setup (Qdrant on Kubernetes)

Retrieval Tuning

Basic Semantic Search (baseline)

Hybrid Search (semantic + keyword)

Re-ranking

Evaluation Metrics

Common Production Issues

Stay ahead of the curve

Related Articles

AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds

AI-Powered Log Analysis Is Replacing Manual Debugging in DevOps (2026)

AI-Powered Log Analysis — How LLMs Are Replacing grep for DevOps Engineers

Comments