Run DeepSeek R1 on Kubernetes — Self-Hosted LLM Guide (2026)

Deploy DeepSeek R1 on your own Kubernetes cluster using Ollama or vLLM. Includes GPU node setup, Helm deployment, persistent model storage, and an OpenAI-compatible API.

DeepSeek R1 is one of the best open-source reasoning models available. Here's how to self-host it on Kubernetes — full GPU setup, persistent storage, and an OpenAI-compatible API endpoint.

Prerequisites

Kubernetes cluster with GPU nodes (NVIDIA A10G, A100, or H100 for 70B; A10G works for 7B/8B)
NVIDIA GPU Operator installed
At least 16GB VRAM for DeepSeek R1 7B, 80GB for 70B

Step 1: Install NVIDIA GPU Operator

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
 
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true
 
# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu=true
kubectl describe node <gpu-node> | grep nvidia

Step 2: Deploy DeepSeek R1 with Ollama

Ollama is the simplest way to run DeepSeek R1 on Kubernetes.

Persistent Volume for model storage:

yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: ai
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi  # DeepSeek R1 7B needs ~4.7GB, increase for larger models
  storageClassName: gp3  # Use your storage class

Ollama Deployment:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        env:
        - name: OLLAMA_MODELS
          value: /models
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: models
          mountPath: /models
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434

Pull DeepSeek R1:

bash

kubectl apply -f ollama.yaml
 
# Wait for pod to be ready
kubectl rollout status deployment/ollama -n ai
 
# Pull the model (run inside the pod)
kubectl exec -it deployment/ollama -n ai -- ollama pull deepseek-r1:7b
 
# For the 70B model (needs 80GB+ VRAM):
# kubectl exec -it deployment/ollama -n ai -- ollama pull deepseek-r1:70b

Test it:

bash

kubectl exec -it deployment/ollama -n ai -- \
  ollama run deepseek-r1:7b "Explain Kubernetes resource limits in simple terms"

Step 3: OpenAI-Compatible API

Ollama exposes an OpenAI-compatible API at /v1. Add an Ingress or use port-forward:

bash

kubectl port-forward svc/ollama -n ai 11434:11434
 
# Test OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:7b",
    "messages": [
      {"role": "user", "content": "Write a Kubernetes readiness probe for a Python Flask app"}
    ]
  }'

Point any OpenAI SDK to http://ollama.ai.svc.cluster.local:11434/v1 for cluster-internal access.

Step 4: Deploy with vLLM (Better for Production)

vLLM gives better throughput and supports continuous batching:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-vllm
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-vllm
  template:
    metadata:
      labels:
        app: deepseek-vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model
          - deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
          - --tensor-parallel-size
          - "1"
          - --gpu-memory-utilization
          - "0.9"
          - --max-model-len
          - "8192"
        ports:
        - containerPort: 8000
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-token
              key: token
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "32Gi"
          requests:
            nvidia.com/gpu: "1"
            memory: "16Gi"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: hf-model-cache
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Create HuggingFace token secret:

bash

kubectl create secret generic huggingface-token \
  --from-literal=token=hf_your_token_here \
  -n ai

Step 5: Add Ingress with Auth

Don't expose your LLM API publicly without authentication:

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: deepseek-ingress
  namespace: ai
  annotations:
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: ollama-basic-auth
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: llm.your-domain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama
            port:
              number: 11434

bash

# Create basic auth
htpasswd -c auth admin
kubectl create secret generic ollama-basic-auth \
  --from-file=auth -n ai

GPU Cost Optimization

Self-hosting LLMs can get expensive. Reduce costs:

bash

# Use Spot instances for non-critical inference (EKS)
# Add a node group with GPU Spot instances
eksctl create nodegroup \
  --cluster my-cluster \
  --name gpu-spot \
  --instance-types p3.2xlarge,g4dn.xlarge \
  --spot \
  --nodes-min 0 \
  --nodes-max 3
 
# Scale to zero when not in use with KEDA
# Scale up based on HTTP request queue

Cheaper alternatives for testing:

deepseek-r1:1.5b — runs on CPU, decent for testing
RunPod or Lambda Labs for on-demand GPU inference
Groq API for blazing fast DeepSeek R1 inference (not self-hosted but very fast)

DeepSeek R1 model sizes and requirements:

Model	VRAM	Storage
R1 1.5B	2GB	1GB
R1 7B	8GB	4.7GB
R1 8B	10GB	5GB
R1 14B	14GB	9GB
R1 32B	32GB	20GB
R1 70B	80GB	43GB

Run DeepSeek R1 on Kubernetes — Self-Hosted LLM Guide (2026)

Prerequisites

Step 1: Install NVIDIA GPU Operator

Step 2: Deploy DeepSeek R1 with Ollama

Step 3: OpenAI-Compatible API

Step 4: Deploy with vLLM (Better for Production)

Step 5: Add Ingress with Auth

GPU Cost Optimization

Stay ahead of the curve

Related Articles

LLM-Powered Kubernetes Operators: How AI Is Automating Cluster Management in 2026

Set Up MLflow + Airflow MLOps Pipeline on Kubernetes (2026)

Set Up Ray Serve on Kubernetes for ML Model Inference (2026)

Comments