Set Up Qdrant Vector Database on Kubernetes for RAG Applications
Qdrant is the fastest open-source vector database for RAG pipelines. Here's how to deploy it on Kubernetes with persistent storage, set up collections, and connect it to LangChain or LlamaIndex.
If you're building RAG (Retrieval-Augmented Generation) applications with LLMs, you need a vector database. Qdrant is fast, open source, and runs well on Kubernetes. Here's the full setup.
What is Qdrant?
Qdrant stores vectors (floating-point arrays that represent text/image embeddings) and lets you search for semantically similar items.
In a RAG pipeline:
Your docs → Embedding model → Vectors → Qdrant
User query → Embedding model → Query vector → Qdrant similarity search → Relevant docs → LLM
Qdrant vs alternatives:
- Qdrant: Rust-based, fastest query speed, best self-hosted experience
- Pinecone: Fully managed, no self-hosting option
- Weaviate: Feature-rich but heavier
- Chroma: Simple, Python-native, great for dev but not production-grade
Deploy on Kubernetes
Option 1 — Helm Chart (Recommended)
helm repo add qdrant https://qdrant.github.io/qdrant-helm
helm repo update
helm install qdrant qdrant/qdrant \
--namespace qdrant \
--create-namespace \
--set replicaCount=1 \
--set persistence.size=10Gi \
--set persistence.storageClass=gp3Option 2 — Custom Manifests (More Control)
# qdrant-deployment.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: qdrant
namespace: qdrant
spec:
serviceName: qdrant
replicas: 1
selector:
matchLabels:
app: qdrant
template:
metadata:
labels:
app: qdrant
spec:
containers:
- name: qdrant
image: qdrant/qdrant:v1.9.0
ports:
- containerPort: 6333 # HTTP REST API
- containerPort: 6334 # gRPC
env:
- name: QDRANT__SERVICE__API_KEY
valueFrom:
secretKeyRef:
name: qdrant-secret
key: api-key
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "2"
volumeMounts:
- name: qdrant-storage
mountPath: /qdrant/storage
readinessProbe:
httpGet:
path: /healthz
port: 6333
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 6333
initialDelaySeconds: 30
periodSeconds: 30
volumeClaimTemplates:
- metadata:
name: qdrant-storage
spec:
accessModes: [ReadWriteOnce]
storageClassName: gp3
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: qdrant
namespace: qdrant
spec:
selector:
app: qdrant
ports:
- name: http
port: 6333
targetPort: 6333
- name: grpc
port: 6334
targetPort: 6334
type: ClusterIP# Create API key secret
kubectl create secret generic qdrant-secret \
--from-literal=api-key=your-strong-api-key-here \
-n qdrant
kubectl apply -f qdrant-deployment.yamlVerify Qdrant is Running
# Port-forward to test locally
kubectl port-forward svc/qdrant 6333:6333 -n qdrant
# Check health
curl http://localhost:6333/healthz
# Check collections (should be empty initially)
curl http://localhost:6333/collections \
-H "api-key: your-strong-api-key-here"Connect Your RAG Application
With LangChain
# rag_pipeline.py
from langchain_community.vectorstores import Qdrant
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_anthropic import ChatAnthropic
from langchain.chains import RetrievalQA
from qdrant_client import QdrantClient
import os
QDRANT_URL = "http://qdrant.qdrant.svc.cluster.local:6333"
QDRANT_API_KEY = os.environ["QDRANT_API_KEY"]
COLLECTION_NAME = "devops-docs"
# Initialize embedding model
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Connect to Qdrant
qdrant_client = QdrantClient(
url=QDRANT_URL,
api_key=QDRANT_API_KEY
)
def ingest_documents(documents: list[str], source_names: list[str]):
"""Split and embed documents, store in Qdrant"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
texts = []
metadatas = []
for doc, source in zip(documents, source_names):
chunks = splitter.split_text(doc)
texts.extend(chunks)
metadatas.extend([{"source": source, "chunk": i}
for i in range(len(chunks))])
# Store in Qdrant
vectorstore = Qdrant.from_texts(
texts=texts,
embedding=embeddings,
metadatas=metadatas,
url=QDRANT_URL,
api_key=QDRANT_API_KEY,
collection_name=COLLECTION_NAME
)
print(f"Ingested {len(texts)} chunks into Qdrant")
return vectorstore
def create_qa_chain():
"""Create a RAG QA chain using Qdrant + Claude"""
vectorstore = Qdrant(
client=qdrant_client,
collection_name=COLLECTION_NAME,
embeddings=embeddings
)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
llm = ChatAnthropic(
model="claude-sonnet-4-6",
api_key=os.environ["ANTHROPIC_API_KEY"]
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
return qa_chain
# Usage
if __name__ == "__main__":
# Ingest your DevOps runbooks/docs
docs = [
"To debug a Kubernetes pod: kubectl describe pod <name>...",
"Terraform plan shows unexpected destroy when...",
]
sources = ["k8s-runbook.md", "terraform-guide.md"]
ingest_documents(docs, sources)
qa = create_qa_chain()
result = qa.invoke({"query": "How do I debug a crashing Kubernetes pod?"})
print("Answer:", result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata['source']}")Production Configuration
Enable Qdrant Cluster Mode (Multiple Replicas)
# For production: 3-node Qdrant cluster
helm upgrade qdrant qdrant/qdrant \
--set replicaCount=3 \
--set config.cluster.enabled=true \
--set config.cluster.p2p.port=6335 \
--set service.type=ClusterIPAdd Ingress for External Access
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: qdrant-ingress
namespace: qdrant
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
rules:
- host: qdrant.internal.mycompany.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: qdrant
port:
number: 6333Backup Collections to S3
import boto3
from qdrant_client import QdrantClient
def backup_qdrant_to_s3(collection_name: str, s3_bucket: str):
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
# Create snapshot
snapshot_info = client.create_snapshot(collection_name=collection_name)
snapshot_name = snapshot_info.name
# Download snapshot
snapshot_data = client.get_snapshot(
collection_name=collection_name,
snapshot_name=snapshot_name
)
# Upload to S3
s3 = boto3.client('s3')
s3.put_object(
Bucket=s3_bucket,
Key=f"qdrant-backups/{collection_name}/{snapshot_name}",
Body=snapshot_data
)
print(f"Backup complete: s3://{s3_bucket}/qdrant-backups/{collection_name}/{snapshot_name}")Sizing Guide
| Use Case | Vectors | RAM Needed | Storage |
|---|---|---|---|
| Dev/testing | <100K | 512Mi | 2Gi |
| Small app | 100K–1M | 2Gi | 10Gi |
| Production | 1M–10M | 4–8Gi | 50Gi |
| Large scale | 10M+ | 16Gi+ | 200Gi+ |
Vector size matters: all-MiniLM-L6-v2 (384 dims) uses ~1.5KB per vector. OpenAI text-embedding-3-small (1536 dims) uses ~6KB per vector.
Qdrant on Kubernetes gives you a production-grade vector database that you fully control — no per-query pricing, no vendor lock-in.
For ML infrastructure and Kubernetes hands-on labs, KodeKloud has courses on containerized ML workloads and GPU-accelerated Kubernetes setups.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI Kubernetes Runbook Generator with LLMs (2026)
Manual runbooks go stale. Build a system that watches your Kubernetes cluster, detects incidents, and generates step-by-step runbooks automatically using LLMs. Full implementation with Python, kubectl, and Ollama.
Build an AI Alert Classifier for Grafana Using LLMs (2026)
Tired of noisy Grafana alerts that wake you up for nothing? Build an AI layer that classifies incoming alerts as actionable or noise, enriches them with context, and routes them intelligently — using Claude or GPT-4 as the reasoning engine.
Build an AI Kubernetes Troubleshooter with Claude (2026)
Build a CLI tool that automatically diagnoses Kubernetes issues — OOMKilled, CrashLoopBackOff, pending pods — by gathering cluster state and asking Claude what's wrong and how to fix it.