RAG in Production — Chunking, Embedding Models, and Retrieval Tuning
Building a RAG system that actually works in production requires the right chunking strategy, embedding model, and retrieval tuning. Here's what works, what doesn't, and real configuration examples.
Most RAG tutorials show you the happy path — load documents, chunk them, embed them, retrieve them, done. Production RAG is different. Bad chunking kills retrieval. Wrong embedding model breaks semantic search. Naive retrieval returns garbage. Here's what actually matters.
The RAG Pipeline
Documents → Chunking → Embedding → Vector Store
↓
User Query → Query Embedding → Similarity Search → Top-K Chunks
↓
LLM + Context → Answer
Each step can silently degrade quality. The LLM looks competent even when retrieval is broken — it just hallucinates from bad context.
Chunking Strategies
Fixed-Size Chunking (naive)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(document)Problem: Splits mid-sentence. A 512-char split doesn't respect the semantic unit of the document.
When to use: Structured logs, code, CSV data where character count matters.
Semantic Chunking (better)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95, # higher = fewer, larger chunks
)
chunks = splitter.split_text(document)Splits on semantic boundaries — topic shifts in the text. Better retrieval accuracy, higher compute cost.
Document-Aware Chunking (production standard)
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
# Parse document preserving structure
elements = partition("runbook.pdf")
# Chunk respecting headings, sections
chunks = chunk_by_title(
elements,
max_characters=1500,
new_after_n_chars=1000,
combine_text_under_n_chars=200,
)
# Each chunk knows its section context
for chunk in chunks:
print(chunk.metadata.page_number, chunk.text[:100])Preserves document structure. Section headers become chunk metadata. Dramatically improves retrieval for technical docs.
Chunking Rules for DevOps Docs
| Document type | Strategy | Chunk size |
|---|---|---|
| Runbooks | By section heading | 800–1500 chars |
| API docs | By endpoint | Variable |
| Log files | Fixed-size | 256–512 chars |
| Incident reports | By paragraph | 500–1000 chars |
| Kubernetes manifests | By resource | Variable |
Embedding Models
Comparison (2026)
| Model | Dimensions | Cost | Best for |
|---|---|---|---|
text-embedding-3-small | 1536 | $0.02/1M tokens | General, low cost |
text-embedding-3-large | 3072 | $0.13/1M tokens | Higher accuracy |
nomic-embed-text | 768 | Free (self-hosted) | Privacy-sensitive |
BAAI/bge-large-en-v1.5 | 1024 | Free (self-hosted) | Technical docs |
mxbai-embed-large | 1024 | Free (self-hosted) | Strong open-source |
# Self-hosted embedding with sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
embeddings = model.encode(chunks, normalize_embeddings=True)Rule: Match embedding model at index time and query time — always use the same model for both.
Vector Store Setup (Qdrant on Kubernetes)
# qdrant-deployment.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: qdrant
namespace: ai-infra
spec:
serviceName: qdrant
replicas: 1
selector:
matchLabels:
app: qdrant
template:
spec:
containers:
- name: qdrant
image: qdrant/qdrant:latest
ports:
- containerPort: 6333
- containerPort: 6334
volumeMounts:
- name: storage
mountPath: /qdrant/storage
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi# Indexing pipeline
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(host="qdrant.ai-infra.svc", port=6333)
# Create collection
client.create_collection(
collection_name="devops_docs",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
# Index chunks
points = [
PointStruct(
id=i,
vector=embeddings[i].tolist(),
payload={
"text": chunks[i],
"source": "runbook",
"section": metadata[i]["section"],
"doc_id": metadata[i]["doc_id"],
}
)
for i in range(len(chunks))
]
client.upsert(collection_name="devops_docs", points=points)Retrieval Tuning
Basic Semantic Search (baseline)
def retrieve(query: str, top_k: int = 5):
query_embedding = model.encode(query, normalize_embeddings=True)
results = client.search(
collection_name="devops_docs",
query_vector=query_embedding.tolist(),
limit=top_k,
)
return [r.payload["text"] for r in results]Hybrid Search (semantic + keyword)
Pure semantic search misses exact terms. Hybrid adds BM25 keyword matching:
from qdrant_client.models import Prefetch, FusionQuery, Fusion
# Sparse + dense hybrid search
results = client.query_points(
collection_name="devops_docs",
prefetch=[
Prefetch(query=dense_vector, using="dense", limit=20),
Prefetch(query=sparse_vector, using="sparse", limit=20),
],
query=FusionQuery(fusion=Fusion.RRF), # Reciprocal Rank Fusion
limit=top_k,
)Hybrid typically improves retrieval accuracy by 10–20% on technical content.
Re-ranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_and_rerank(query: str, top_k: int = 5):
# Get more candidates first
candidates = retrieve(query, top_k=20)
# Re-rank with cross-encoder (more accurate, slower)
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
# Return top-k after re-ranking
ranked = sorted(zip(scores, candidates), reverse=True)
return [doc for _, doc in ranked[:top_k]]Evaluation Metrics
Track these in production to detect retrieval drift:
# RAGAS evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=test_cases,
metrics=[faithfulness, answer_relevancy, context_precision],
)
# Key metrics:
# faithfulness: is the answer grounded in retrieved context?
# context_precision: were the right chunks retrieved?
# answer_relevancy: does the answer actually answer the question?Common Production Issues
| Issue | Symptom | Fix |
|---|---|---|
| Retrieval returns wrong docs | LLM gives confident wrong answers | Add hybrid search + reranking |
| Slow query latency | >500ms per query | Use smaller embedding model or HNSW tuning |
| Context window overflow | LLM cuts off or errors | Reduce top_k or chunk size |
| Stale embeddings | New docs not appearing | Add incremental indexing pipeline |
| Chunk splits mid-table | Table data in answers is wrong | Use structured chunking, keep tables as single chunks |
Build evaluation into your pipeline from day one. A RAG system that can't measure its own retrieval quality will silently degrade over time.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
AI-Powered Log Analysis Is Replacing Manual Debugging in DevOps (2026)
How LLMs and AI are transforming log analysis, anomaly detection, and root cause analysis — and the tools DevOps engineers should know about in 2026.
AI-Powered Log Analysis — How LLMs Are Replacing grep for DevOps Engineers
How to use LLMs and AI tools for intelligent log analysis in DevOps. Covers practical workflows, open-source tools, prompt engineering for logs, and building custom log analysis agents.