Run DeepSeek R1 on Kubernetes — Self-Hosted LLM Guide (2026)
Deploy DeepSeek R1 on your own Kubernetes cluster using Ollama or vLLM. Includes GPU node setup, Helm deployment, persistent model storage, and an OpenAI-compatible API.
DeepSeek R1 is one of the best open-source reasoning models available. Here's how to self-host it on Kubernetes — full GPU setup, persistent storage, and an OpenAI-compatible API endpoint.
Prerequisites
- Kubernetes cluster with GPU nodes (NVIDIA A10G, A100, or H100 for 70B; A10G works for 7B/8B)
- NVIDIA GPU Operator installed
- At least 16GB VRAM for DeepSeek R1 7B, 80GB for 70B
Step 1: Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true
# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu=true
kubectl describe node <gpu-node> | grep nvidiaStep 2: Deploy DeepSeek R1 with Ollama
Ollama is the simplest way to run DeepSeek R1 on Kubernetes.
Persistent Volume for model storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models
namespace: ai
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi # DeepSeek R1 7B needs ~4.7GB, increase for larger models
storageClassName: gp3 # Use your storage classOllama Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
env:
- name: OLLAMA_MODELS
value: /models
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /models
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ai
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434Pull DeepSeek R1:
kubectl apply -f ollama.yaml
# Wait for pod to be ready
kubectl rollout status deployment/ollama -n ai
# Pull the model (run inside the pod)
kubectl exec -it deployment/ollama -n ai -- ollama pull deepseek-r1:7b
# For the 70B model (needs 80GB+ VRAM):
# kubectl exec -it deployment/ollama -n ai -- ollama pull deepseek-r1:70bTest it:
kubectl exec -it deployment/ollama -n ai -- \
ollama run deepseek-r1:7b "Explain Kubernetes resource limits in simple terms"Step 3: OpenAI-Compatible API
Ollama exposes an OpenAI-compatible API at /v1. Add an Ingress or use port-forward:
kubectl port-forward svc/ollama -n ai 11434:11434
# Test OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1:7b",
"messages": [
{"role": "user", "content": "Write a Kubernetes readiness probe for a Python Flask app"}
]
}'Point any OpenAI SDK to http://ollama.ai.svc.cluster.local:11434/v1 for cluster-internal access.
Step 4: Deploy with vLLM (Better for Production)
vLLM gives better throughput and supports continuous batching:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-vllm
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: deepseek-vllm
template:
metadata:
labels:
app: deepseek-vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- --tensor-parallel-size
- "1"
- --gpu-memory-utilization
- "0.9"
- --max-model-len
- "8192"
ports:
- containerPort: 8000
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: hf-model-cache
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoScheduleCreate HuggingFace token secret:
kubectl create secret generic huggingface-token \
--from-literal=token=hf_your_token_here \
-n aiStep 5: Add Ingress with Auth
Don't expose your LLM API publicly without authentication:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: deepseek-ingress
namespace: ai
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: ollama-basic-auth
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
ingressClassName: nginx
rules:
- host: llm.your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama
port:
number: 11434# Create basic auth
htpasswd -c auth admin
kubectl create secret generic ollama-basic-auth \
--from-file=auth -n aiGPU Cost Optimization
Self-hosting LLMs can get expensive. Reduce costs:
# Use Spot instances for non-critical inference (EKS)
# Add a node group with GPU Spot instances
eksctl create nodegroup \
--cluster my-cluster \
--name gpu-spot \
--instance-types p3.2xlarge,g4dn.xlarge \
--spot \
--nodes-min 0 \
--nodes-max 3
# Scale to zero when not in use with KEDA
# Scale up based on HTTP request queueCheaper alternatives for testing:
deepseek-r1:1.5b— runs on CPU, decent for testing- RunPod or Lambda Labs for on-demand GPU inference
- Groq API for blazing fast DeepSeek R1 inference (not self-hosted but very fast)
DeepSeek R1 model sizes and requirements:
| Model | VRAM | Storage |
|---|---|---|
| R1 1.5B | 2GB | 1GB |
| R1 7B | 8GB | 4.7GB |
| R1 8B | 10GB | 5GB |
| R1 14B | 14GB | 9GB |
| R1 32B | 32GB | 20GB |
| R1 70B | 80GB | 43GB |
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
LLM-Powered Kubernetes Operators: How AI Is Automating Cluster Management in 2026
How teams are building Kubernetes operators powered by LLMs to auto-remediate incidents, optimize resources, and manage complex deployments — with architecture patterns and real examples.
Set Up MLflow + Airflow MLOps Pipeline on Kubernetes (2026)
Build a production MLOps pipeline on Kubernetes using MLflow for experiment tracking and model registry, and Apache Airflow for pipeline orchestration. Full setup guide.
Set Up Ray Serve on Kubernetes for ML Model Inference (2026)
Ray Serve is the best way to serve ML models at scale on Kubernetes — handles batching, scaling, model composition, and GPU sharing. Complete setup guide.