All Articles

Build a DevOps RAG Pipeline with LlamaIndex on Kubernetes (2026)

Build a Retrieval-Augmented Generation (RAG) pipeline that answers questions from your runbooks, Confluence docs, and incident history. Deploy it on Kubernetes with LlamaIndex, Ollama, and Qdrant vector database.

DevOpsBoysApr 23, 20266 min read
Share:Tweet

Your team has 500 runbooks, 10 years of incident post-mortems, and architecture docs scattered across Confluence and GitHub. When an alert fires at 2am, nobody can find the right runbook fast enough.

This guide builds a RAG (Retrieval-Augmented Generation) pipeline that ingests all your documentation and lets engineers query it in plain English — deployed on Kubernetes.


What We're Building

Engineer asks: "How do we fix OOMKilled pods in the payments namespace?"
    │
    ▼
Query → Embedding Model → Vector Search (Qdrant)
                                    │
                              Relevant runbook chunks
                                    │
                              LLM (Ollama/Claude) ← context
                                    │
                          "Add memory limits to the payments 
                           deployment. Check grafana for 
                           historical usage. See runbook #42."

Stack:

  • LlamaIndex — RAG framework (ingestion, retrieval, synthesis)
  • Qdrant — Vector database (stores embeddings, runs on K8s)
  • Ollama — Self-hosted LLM inference (no API costs)
  • FastAPI — Serve the RAG pipeline as an API
  • Kubernetes — Deploy everything

Project Structure

devops-rag/
├── app/
│   ├── main.py          # FastAPI app
│   ├── ingest.py        # Document ingestion script
│   ├── rag.py           # LlamaIndex RAG pipeline
│   └── requirements.txt
├── docs/                # Your runbooks, post-mortems, etc.
│   ├── runbooks/
│   └── incidents/
├── Dockerfile
└── k8s/
    ├── qdrant.yaml
    ├── ollama.yaml
    └── rag-app.yaml

Step 1: Deploy Qdrant on Kubernetes

Qdrant is a high-performance vector database that stores document embeddings.

yaml
# k8s/qdrant.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qdrant
  namespace: devops-rag
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qdrant
  template:
    metadata:
      labels:
        app: qdrant
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant:v1.9.0
        ports:
        - containerPort: 6333
          name: http
        - containerPort: 6334
          name: grpc
        env:
        - name: QDRANT__SERVICE__API_KEY
          value: "your-qdrant-api-key"
        volumeMounts:
        - name: qdrant-storage
          mountPath: /qdrant/storage
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
      volumes:
      - name: qdrant-storage
        persistentVolumeClaim:
          claimName: qdrant-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qdrant-pvc
  namespace: devops-rag
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 20Gi
---
apiVersion: v1
kind: Service
metadata:
  name: qdrant
  namespace: devops-rag
spec:
  selector:
    app: qdrant
  ports:
  - name: http
    port: 6333
    targetPort: 6333
bash
kubectl create namespace devops-rag
kubectl apply -f k8s/qdrant.yaml

Step 2: Deploy Ollama on Kubernetes

Ollama runs LLMs locally. We'll use it for both embeddings and generation.

yaml
# k8s/ollama.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: devops-rag
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        volumeMounts:
        - name: ollama-models
          mountPath: /root/.ollama
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"
      volumes:
      - name: ollama-models
        persistentVolumeClaim:
          claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: devops-rag
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 30Gi
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: devops-rag
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
bash
kubectl apply -f k8s/ollama.yaml
 
# Pull models into Ollama
kubectl exec -n devops-rag deploy/ollama -- ollama pull nomic-embed-text  # embeddings
kubectl exec -n devops-rag deploy/ollama -- ollama pull qwen2.5:7b         # generation

Step 3: Build the RAG Pipeline

python
# app/requirements.txt
llama-index-core==0.12.0
llama-index-vector-stores-qdrant==0.4.0
llama-index-embeddings-ollama==0.5.0
llama-index-llms-ollama==0.5.0
qdrant-client==1.9.0
fastapi==0.115.0
uvicorn==0.30.0
python-multipart==0.0.9
python
# app/rag.py
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from qdrant_client import QdrantClient
import os
 
QDRANT_URL = os.getenv("QDRANT_URL", "http://qdrant:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY", "your-qdrant-api-key")
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://ollama:11434")
COLLECTION_NAME = "devops-docs"
 
 
def get_settings():
    Settings.embed_model = OllamaEmbedding(
        model_name="nomic-embed-text",
        base_url=OLLAMA_URL,
    )
    Settings.llm = Ollama(
        model="qwen2.5:7b",
        base_url=OLLAMA_URL,
        request_timeout=120.0,
    )
    Settings.chunk_size = 512
    Settings.chunk_overlap = 50
 
 
def get_qdrant_client() -> QdrantClient:
    return QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
 
 
def get_query_engine():
    get_settings()
    client = get_qdrant_client()
    
    vector_store = QdrantVectorStore(
        client=client,
        collection_name=COLLECTION_NAME,
    )
    
    index = VectorStoreIndex.from_vector_store(vector_store)
    
    return index.as_query_engine(
        similarity_top_k=5,
        response_mode="compact",
    )
python
# app/ingest.py
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
from app.rag import get_settings, get_qdrant_client, COLLECTION_NAME
import sys
 
 
def ingest_documents(docs_path: str):
    get_settings()
    client = get_qdrant_client()
    
    # Create collection if it doesn't exist
    existing = [c.name for c in client.get_collections().collections]
    if COLLECTION_NAME not in existing:
        client.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=768, distance=Distance.COSINE),
        )
        print(f"Created collection: {COLLECTION_NAME}")
    
    # Load documents (supports .md, .txt, .pdf, .rst)
    print(f"Loading documents from {docs_path}...")
    documents = SimpleDirectoryReader(
        docs_path,
        recursive=True,
        required_exts=[".md", ".txt", ".rst"],
    ).load_data()
    
    print(f"Loaded {len(documents)} documents. Generating embeddings...")
    
    vector_store = QdrantVectorStore(
        client=client,
        collection_name=COLLECTION_NAME,
    )
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    
    # This generates embeddings and stores them in Qdrant
    VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True,
    )
    
    print(f"Ingestion complete. {len(documents)} documents indexed.")
 
 
if __name__ == "__main__":
    docs_path = sys.argv[1] if len(sys.argv) > 1 else "./docs"
    ingest_documents(docs_path)
python
# app/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from app.rag import get_query_engine
import logging
 
logger = logging.getLogger(__name__)
app = FastAPI(title="DevOps RAG API")
 
# Cache the query engine
_query_engine = None
 
 
def get_engine():
    global _query_engine
    if _query_engine is None:
        _query_engine = get_query_engine()
    return _query_engine
 
 
class QueryRequest(BaseModel):
    question: str
 
 
class QueryResponse(BaseModel):
    answer: str
    sources: list[str]
 
 
@app.get("/health")
async def health():
    return {"status": "ok"}
 
 
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
    if not request.question.strip():
        raise HTTPException(status_code=400, detail="Question cannot be empty")
    
    try:
        engine = get_engine()
        response = engine.query(request.question)
        
        sources = []
        if hasattr(response, "source_nodes"):
            for node in response.source_nodes:
                fname = node.metadata.get("file_name", "unknown")
                if fname not in sources:
                    sources.append(fname)
        
        return QueryResponse(
            answer=str(response),
            sources=sources,
        )
    except Exception as e:
        logger.error(f"Query failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))

Step 4: Dockerfile

dockerfile
FROM python:3.12-slim
 
WORKDIR /app
 
COPY app/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
 
COPY app/ ./app/
COPY docs/ ./docs/
 
# Ingest documents at build time (or run as init container)
RUN python -m app.ingest ./docs
 
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
bash
docker build -t devops-rag:latest .
docker tag devops-rag:latest your-registry/devops-rag:latest
docker push your-registry/devops-rag:latest

Step 5: Deploy the RAG App

yaml
# k8s/rag-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: devops-rag
  namespace: devops-rag
spec:
  replicas: 2
  selector:
    matchLabels:
      app: devops-rag
  template:
    metadata:
      labels:
        app: devops-rag
    spec:
      containers:
      - name: devops-rag
        image: your-registry/devops-rag:latest
        ports:
        - containerPort: 8000
        env:
        - name: QDRANT_URL
          value: "http://qdrant:6333"
        - name: QDRANT_API_KEY
          valueFrom:
            secretKeyRef:
              name: rag-secrets
              key: qdrant-api-key
        - name: OLLAMA_URL
          value: "http://ollama:11434"
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: devops-rag
  namespace: devops-rag
spec:
  selector:
    app: devops-rag
  ports:
  - port: 80
    targetPort: 8000
bash
kubectl apply -f k8s/rag-app.yaml

Step 6: Test It

bash
# Port-forward to test locally
kubectl port-forward -n devops-rag svc/devops-rag 8000:80
 
# Query the RAG pipeline
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I debug OOMKilled pods?"}'

Response:

json
{
  "answer": "OOMKilled pods occur when a container exceeds its memory limit. Steps to debug: 1) Check limits with kubectl describe pod <pod-name>. 2) Review Grafana dashboard for memory usage trends. 3) Increase memory limits in the deployment spec. See runbook: kubernetes-oom-runbook.md",
  "sources": ["runbooks/kubernetes-oom-runbook.md", "incidents/2025-11-payment-oom.md"]
}

Add New Documents Without Rebuilding

Instead of building into the Docker image, use a Kubernetes Job for ingestion:

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: rag-ingest
  namespace: devops-rag
spec:
  template:
    spec:
      containers:
      - name: ingest
        image: your-registry/devops-rag:latest
        command: ["python", "-m", "app.ingest", "/docs"]
        volumeMounts:
        - name: docs
          mountPath: /docs
        env:
        - name: QDRANT_URL
          value: "http://qdrant:6333"
      volumes:
      - name: docs
        configMap:
          name: runbooks-configmap
      restartPolicy: Never

Run this job whenever you update your runbooks. The app pods don't need to restart.


Production Improvements

  1. Use Claude API instead of Ollama for better quality (set ANTHROPIC_API_KEY and swap the LLM)
  2. Add Slack bot frontend so engineers can query from the #incidents channel
  3. Automatic re-ingestion via CI/CD when runbooks change in Git
  4. Metadata filtering — tag docs by service so queries can scope to service=payments

Learn More

For building production RAG systems, LlamaIndex documentation is excellent and up-to-date. For the ML/AI foundations, LLM Engineering on Udemy covers RAG patterns in depth.

A DevOps RAG pipeline that actually knows your infrastructure turns 2am incidents from stressful guesswork into structured problem-solving. Start with 20–30 of your most-used runbooks and expand from there.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments