Build a DevOps RAG Pipeline with LlamaIndex on Kubernetes (2026)
Build a Retrieval-Augmented Generation (RAG) pipeline that answers questions from your runbooks, Confluence docs, and incident history. Deploy it on Kubernetes with LlamaIndex, Ollama, and Qdrant vector database.
Your team has 500 runbooks, 10 years of incident post-mortems, and architecture docs scattered across Confluence and GitHub. When an alert fires at 2am, nobody can find the right runbook fast enough.
This guide builds a RAG (Retrieval-Augmented Generation) pipeline that ingests all your documentation and lets engineers query it in plain English — deployed on Kubernetes.
What We're Building
Engineer asks: "How do we fix OOMKilled pods in the payments namespace?"
│
▼
Query → Embedding Model → Vector Search (Qdrant)
│
Relevant runbook chunks
│
LLM (Ollama/Claude) ← context
│
"Add memory limits to the payments
deployment. Check grafana for
historical usage. See runbook #42."
Stack:
- LlamaIndex — RAG framework (ingestion, retrieval, synthesis)
- Qdrant — Vector database (stores embeddings, runs on K8s)
- Ollama — Self-hosted LLM inference (no API costs)
- FastAPI — Serve the RAG pipeline as an API
- Kubernetes — Deploy everything
Project Structure
devops-rag/
├── app/
│ ├── main.py # FastAPI app
│ ├── ingest.py # Document ingestion script
│ ├── rag.py # LlamaIndex RAG pipeline
│ └── requirements.txt
├── docs/ # Your runbooks, post-mortems, etc.
│ ├── runbooks/
│ └── incidents/
├── Dockerfile
└── k8s/
├── qdrant.yaml
├── ollama.yaml
└── rag-app.yaml
Step 1: Deploy Qdrant on Kubernetes
Qdrant is a high-performance vector database that stores document embeddings.
# k8s/qdrant.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: qdrant
namespace: devops-rag
spec:
replicas: 1
selector:
matchLabels:
app: qdrant
template:
metadata:
labels:
app: qdrant
spec:
containers:
- name: qdrant
image: qdrant/qdrant:v1.9.0
ports:
- containerPort: 6333
name: http
- containerPort: 6334
name: grpc
env:
- name: QDRANT__SERVICE__API_KEY
value: "your-qdrant-api-key"
volumeMounts:
- name: qdrant-storage
mountPath: /qdrant/storage
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
volumes:
- name: qdrant-storage
persistentVolumeClaim:
claimName: qdrant-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: qdrant-pvc
namespace: devops-rag
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 20Gi
---
apiVersion: v1
kind: Service
metadata:
name: qdrant
namespace: devops-rag
spec:
selector:
app: qdrant
ports:
- name: http
port: 6333
targetPort: 6333kubectl create namespace devops-rag
kubectl apply -f k8s/qdrant.yamlStep 2: Deploy Ollama on Kubernetes
Ollama runs LLMs locally. We'll use it for both embeddings and generation.
# k8s/ollama.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: devops-rag
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
volumeMounts:
- name: ollama-models
mountPath: /root/.ollama
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
volumes:
- name: ollama-models
persistentVolumeClaim:
claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: devops-rag
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 30Gi
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: devops-rag
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434kubectl apply -f k8s/ollama.yaml
# Pull models into Ollama
kubectl exec -n devops-rag deploy/ollama -- ollama pull nomic-embed-text # embeddings
kubectl exec -n devops-rag deploy/ollama -- ollama pull qwen2.5:7b # generationStep 3: Build the RAG Pipeline
# app/requirements.txt
llama-index-core==0.12.0
llama-index-vector-stores-qdrant==0.4.0
llama-index-embeddings-ollama==0.5.0
llama-index-llms-ollama==0.5.0
qdrant-client==1.9.0
fastapi==0.115.0
uvicorn==0.30.0
python-multipart==0.0.9# app/rag.py
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from qdrant_client import QdrantClient
import os
QDRANT_URL = os.getenv("QDRANT_URL", "http://qdrant:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY", "your-qdrant-api-key")
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://ollama:11434")
COLLECTION_NAME = "devops-docs"
def get_settings():
Settings.embed_model = OllamaEmbedding(
model_name="nomic-embed-text",
base_url=OLLAMA_URL,
)
Settings.llm = Ollama(
model="qwen2.5:7b",
base_url=OLLAMA_URL,
request_timeout=120.0,
)
Settings.chunk_size = 512
Settings.chunk_overlap = 50
def get_qdrant_client() -> QdrantClient:
return QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
def get_query_engine():
get_settings()
client = get_qdrant_client()
vector_store = QdrantVectorStore(
client=client,
collection_name=COLLECTION_NAME,
)
index = VectorStoreIndex.from_vector_store(vector_store)
return index.as_query_engine(
similarity_top_k=5,
response_mode="compact",
)# app/ingest.py
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
from app.rag import get_settings, get_qdrant_client, COLLECTION_NAME
import sys
def ingest_documents(docs_path: str):
get_settings()
client = get_qdrant_client()
# Create collection if it doesn't exist
existing = [c.name for c in client.get_collections().collections]
if COLLECTION_NAME not in existing:
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)
print(f"Created collection: {COLLECTION_NAME}")
# Load documents (supports .md, .txt, .pdf, .rst)
print(f"Loading documents from {docs_path}...")
documents = SimpleDirectoryReader(
docs_path,
recursive=True,
required_exts=[".md", ".txt", ".rst"],
).load_data()
print(f"Loaded {len(documents)} documents. Generating embeddings...")
vector_store = QdrantVectorStore(
client=client,
collection_name=COLLECTION_NAME,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# This generates embeddings and stores them in Qdrant
VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
show_progress=True,
)
print(f"Ingestion complete. {len(documents)} documents indexed.")
if __name__ == "__main__":
docs_path = sys.argv[1] if len(sys.argv) > 1 else "./docs"
ingest_documents(docs_path)# app/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from app.rag import get_query_engine
import logging
logger = logging.getLogger(__name__)
app = FastAPI(title="DevOps RAG API")
# Cache the query engine
_query_engine = None
def get_engine():
global _query_engine
if _query_engine is None:
_query_engine = get_query_engine()
return _query_engine
class QueryRequest(BaseModel):
question: str
class QueryResponse(BaseModel):
answer: str
sources: list[str]
@app.get("/health")
async def health():
return {"status": "ok"}
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
if not request.question.strip():
raise HTTPException(status_code=400, detail="Question cannot be empty")
try:
engine = get_engine()
response = engine.query(request.question)
sources = []
if hasattr(response, "source_nodes"):
for node in response.source_nodes:
fname = node.metadata.get("file_name", "unknown")
if fname not in sources:
sources.append(fname)
return QueryResponse(
answer=str(response),
sources=sources,
)
except Exception as e:
logger.error(f"Query failed: {e}")
raise HTTPException(status_code=500, detail=str(e))Step 4: Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY app/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ ./app/
COPY docs/ ./docs/
# Ingest documents at build time (or run as init container)
RUN python -m app.ingest ./docs
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]docker build -t devops-rag:latest .
docker tag devops-rag:latest your-registry/devops-rag:latest
docker push your-registry/devops-rag:latestStep 5: Deploy the RAG App
# k8s/rag-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: devops-rag
namespace: devops-rag
spec:
replicas: 2
selector:
matchLabels:
app: devops-rag
template:
metadata:
labels:
app: devops-rag
spec:
containers:
- name: devops-rag
image: your-registry/devops-rag:latest
ports:
- containerPort: 8000
env:
- name: QDRANT_URL
value: "http://qdrant:6333"
- name: QDRANT_API_KEY
valueFrom:
secretKeyRef:
name: rag-secrets
key: qdrant-api-key
- name: OLLAMA_URL
value: "http://ollama:11434"
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: devops-rag
namespace: devops-rag
spec:
selector:
app: devops-rag
ports:
- port: 80
targetPort: 8000kubectl apply -f k8s/rag-app.yamlStep 6: Test It
# Port-forward to test locally
kubectl port-forward -n devops-rag svc/devops-rag 8000:80
# Query the RAG pipeline
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "How do I debug OOMKilled pods?"}'Response:
{
"answer": "OOMKilled pods occur when a container exceeds its memory limit. Steps to debug: 1) Check limits with kubectl describe pod <pod-name>. 2) Review Grafana dashboard for memory usage trends. 3) Increase memory limits in the deployment spec. See runbook: kubernetes-oom-runbook.md",
"sources": ["runbooks/kubernetes-oom-runbook.md", "incidents/2025-11-payment-oom.md"]
}Add New Documents Without Rebuilding
Instead of building into the Docker image, use a Kubernetes Job for ingestion:
apiVersion: batch/v1
kind: Job
metadata:
name: rag-ingest
namespace: devops-rag
spec:
template:
spec:
containers:
- name: ingest
image: your-registry/devops-rag:latest
command: ["python", "-m", "app.ingest", "/docs"]
volumeMounts:
- name: docs
mountPath: /docs
env:
- name: QDRANT_URL
value: "http://qdrant:6333"
volumes:
- name: docs
configMap:
name: runbooks-configmap
restartPolicy: NeverRun this job whenever you update your runbooks. The app pods don't need to restart.
Production Improvements
- Use Claude API instead of Ollama for better quality (set
ANTHROPIC_API_KEYand swap the LLM) - Add Slack bot frontend so engineers can query from the
#incidentschannel - Automatic re-ingestion via CI/CD when runbooks change in Git
- Metadata filtering — tag docs by service so queries can scope to
service=payments
Learn More
For building production RAG systems, LlamaIndex documentation is excellent and up-to-date. For the ML/AI foundations, LLM Engineering on Udemy covers RAG patterns in depth.
A DevOps RAG pipeline that actually knows your infrastructure turns 2am incidents from stressful guesswork into structured problem-solving. Start with 20–30 of your most-used runbooks and expand from there.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Build an Internal Developer Platform with Backstage (2026)
Step-by-step guide to setting up a Backstage developer portal — software catalog, TechDocs, Kubernetes plugin, and golden path templates.