🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Run DeepSeek R1 on Kubernetes — Self-Hosted LLM Guide (2026)

Deploy DeepSeek R1 on your own Kubernetes cluster using Ollama or vLLM. Includes GPU node setup, Helm deployment, persistent model storage, and an OpenAI-compatible API.

DevOpsBoysApr 29, 20264 min read
Share:Tweet

DeepSeek R1 is one of the best open-source reasoning models available. Here's how to self-host it on Kubernetes — full GPU setup, persistent storage, and an OpenAI-compatible API endpoint.


Prerequisites

  • Kubernetes cluster with GPU nodes (NVIDIA A10G, A100, or H100 for 70B; A10G works for 7B/8B)
  • NVIDIA GPU Operator installed
  • At least 16GB VRAM for DeepSeek R1 7B, 80GB for 70B

Step 1: Install NVIDIA GPU Operator

bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
 
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true
 
# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu=true
kubectl describe node <gpu-node> | grep nvidia

Step 2: Deploy DeepSeek R1 with Ollama

Ollama is the simplest way to run DeepSeek R1 on Kubernetes.

Persistent Volume for model storage:

yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: ai
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi  # DeepSeek R1 7B needs ~4.7GB, increase for larger models
  storageClassName: gp3  # Use your storage class

Ollama Deployment:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        env:
        - name: OLLAMA_MODELS
          value: /models
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: models
          mountPath: /models
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434

Pull DeepSeek R1:

bash
kubectl apply -f ollama.yaml
 
# Wait for pod to be ready
kubectl rollout status deployment/ollama -n ai
 
# Pull the model (run inside the pod)
kubectl exec -it deployment/ollama -n ai -- ollama pull deepseek-r1:7b
 
# For the 70B model (needs 80GB+ VRAM):
# kubectl exec -it deployment/ollama -n ai -- ollama pull deepseek-r1:70b

Test it:

bash
kubectl exec -it deployment/ollama -n ai -- \
  ollama run deepseek-r1:7b "Explain Kubernetes resource limits in simple terms"

Step 3: OpenAI-Compatible API

Ollama exposes an OpenAI-compatible API at /v1. Add an Ingress or use port-forward:

bash
kubectl port-forward svc/ollama -n ai 11434:11434
 
# Test OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:7b",
    "messages": [
      {"role": "user", "content": "Write a Kubernetes readiness probe for a Python Flask app"}
    ]
  }'

Point any OpenAI SDK to http://ollama.ai.svc.cluster.local:11434/v1 for cluster-internal access.


Step 4: Deploy with vLLM (Better for Production)

vLLM gives better throughput and supports continuous batching:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-vllm
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-vllm
  template:
    metadata:
      labels:
        app: deepseek-vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model
          - deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
          - --tensor-parallel-size
          - "1"
          - --gpu-memory-utilization
          - "0.9"
          - --max-model-len
          - "8192"
        ports:
        - containerPort: 8000
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-token
              key: token
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "32Gi"
          requests:
            nvidia.com/gpu: "1"
            memory: "16Gi"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: hf-model-cache
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Create HuggingFace token secret:

bash
kubectl create secret generic huggingface-token \
  --from-literal=token=hf_your_token_here \
  -n ai

Step 5: Add Ingress with Auth

Don't expose your LLM API publicly without authentication:

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: deepseek-ingress
  namespace: ai
  annotations:
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: ollama-basic-auth
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: llm.your-domain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama
            port:
              number: 11434
bash
# Create basic auth
htpasswd -c auth admin
kubectl create secret generic ollama-basic-auth \
  --from-file=auth -n ai

GPU Cost Optimization

Self-hosting LLMs can get expensive. Reduce costs:

bash
# Use Spot instances for non-critical inference (EKS)
# Add a node group with GPU Spot instances
eksctl create nodegroup \
  --cluster my-cluster \
  --name gpu-spot \
  --instance-types p3.2xlarge,g4dn.xlarge \
  --spot \
  --nodes-min 0 \
  --nodes-max 3
 
# Scale to zero when not in use with KEDA
# Scale up based on HTTP request queue

Cheaper alternatives for testing:

  • deepseek-r1:1.5b — runs on CPU, decent for testing
  • RunPod or Lambda Labs for on-demand GPU inference
  • Groq API for blazing fast DeepSeek R1 inference (not self-hosted but very fast)

DeepSeek R1 model sizes and requirements:

ModelVRAMStorage
R1 1.5B2GB1GB
R1 7B8GB4.7GB
R1 8B10GB5GB
R1 14B14GB9GB
R1 32B32GB20GB
R1 70B80GB43GB
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments