Run DeepSeek R1 on Kubernetes — Self-Hosted LLM Guide (2026)
Deploy DeepSeek R1 on your own Kubernetes cluster using Ollama or vLLM. Includes GPU node setup, Helm deployment, persistent model storage, and an OpenAI-compatible API.
DeepSeek R1 is one of the best open-source reasoning models available. Here's how to self-host it on Kubernetes — full GPU setup, persistent storage, and an OpenAI-compatible API endpoint.
Prerequisites
- Kubernetes cluster with GPU nodes (NVIDIA A10G, A100, or H100 for 70B; A10G works for 7B/8B)
- NVIDIA GPU Operator installed
- At least 16GB VRAM for DeepSeek R1 7B, 80GB for 70B
Step 1: Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true
# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu=true
kubectl describe node <gpu-node> | grep nvidiaStep 2: Deploy DeepSeek R1 with Ollama
Ollama is the simplest way to run DeepSeek R1 on Kubernetes.
Persistent Volume for model storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models
namespace: ai
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi # DeepSeek R1 7B needs ~4.7GB, increase for larger models
storageClassName: gp3 # Use your storage classOllama Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
env:
- name: OLLAMA_MODELS
value: /models
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /models
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ai
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434Pull DeepSeek R1:
kubectl apply -f ollama.yaml
# Wait for pod to be ready
kubectl rollout status deployment/ollama -n ai
# Pull the model (run inside the pod)
kubectl exec -it deployment/ollama -n ai -- ollama pull deepseek-r1:7b
# For the 70B model (needs 80GB+ VRAM):
# kubectl exec -it deployment/ollama -n ai -- ollama pull deepseek-r1:70bTest it:
kubectl exec -it deployment/ollama -n ai -- \
ollama run deepseek-r1:7b "Explain Kubernetes resource limits in simple terms"Step 3: OpenAI-Compatible API
Ollama exposes an OpenAI-compatible API at /v1. Add an Ingress or use port-forward:
kubectl port-forward svc/ollama -n ai 11434:11434
# Test OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1:7b",
"messages": [
{"role": "user", "content": "Write a Kubernetes readiness probe for a Python Flask app"}
]
}'Point any OpenAI SDK to http://ollama.ai.svc.cluster.local:11434/v1 for cluster-internal access.
Step 4: Deploy with vLLM (Better for Production)
vLLM gives better throughput and supports continuous batching:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-vllm
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: deepseek-vllm
template:
metadata:
labels:
app: deepseek-vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- --tensor-parallel-size
- "1"
- --gpu-memory-utilization
- "0.9"
- --max-model-len
- "8192"
ports:
- containerPort: 8000
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: hf-model-cache
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoScheduleCreate HuggingFace token secret:
kubectl create secret generic huggingface-token \
--from-literal=token=hf_your_token_here \
-n aiStep 5: Add Ingress with Auth
Don't expose your LLM API publicly without authentication:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: deepseek-ingress
namespace: ai
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: ollama-basic-auth
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
ingressClassName: nginx
rules:
- host: llm.your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama
port:
number: 11434# Create basic auth
htpasswd -c auth admin
kubectl create secret generic ollama-basic-auth \
--from-file=auth -n aiGPU Cost Optimization
Self-hosting LLMs can get expensive. Reduce costs:
# Use Spot instances for non-critical inference (EKS)
# Add a node group with GPU Spot instances
eksctl create nodegroup \
--cluster my-cluster \
--name gpu-spot \
--instance-types p3.2xlarge,g4dn.xlarge \
--spot \
--nodes-min 0 \
--nodes-max 3
# Scale to zero when not in use with KEDA
# Scale up based on HTTP request queueCheaper alternatives for testing:
deepseek-r1:1.5b— runs on CPU, decent for testing- RunPod or Lambda Labs for on-demand GPU inference
- Groq API for blazing fast DeepSeek R1 inference (not self-hosted but very fast)
DeepSeek R1 model sizes and requirements:
| Model | VRAM | Storage |
|---|---|---|
| R1 1.5B | 2GB | 1GB |
| R1 7B | 8GB | 4.7GB |
| R1 8B | 10GB | 5GB |
| R1 14B | 14GB | 9GB |
| R1 32B | 32GB | 20GB |
| R1 70B | 80GB | 43GB |
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI Kubernetes Deployment Readiness Checker with Claude API
Build a Python CLI tool using Claude API that analyzes Kubernetes YAML manifests before deployment — catches missing resource limits, root containers, and security issues with a go/no-go score.
Build an AI-Powered DevOps Chatbot with Streamlit on Kubernetes
Build a DevOps assistant chatbot that answers infrastructure questions, generates kubectl commands, and explains errors — deployed as a Streamlit app on Kubernetes.
Build LLM-Powered Runbook Automation with Haystack and Kubernetes
Turn your static runbooks into an AI system that answers 'what do I do when X happens' with step-by-step instructions retrieved from your actual documentation.