🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

How to Set Up NVIDIA GPU Operator on Kubernetes for AI Workloads (2026)

Running AI/ML workloads on Kubernetes requires GPUs. The NVIDIA GPU Operator automates everything — driver installation, container toolkit, device plugin, monitoring. Here's the complete setup guide.

DevOpsBoysMay 21, 20265 min read
Share:Tweet

Running LLMs, training jobs, or inference servers on Kubernetes requires GPUs. But getting GPU access inside containers is not trivial — you need drivers, the container toolkit, device plugins, and monitoring all configured correctly.

The NVIDIA GPU Operator automates all of this. Instead of configuring each component manually on every node, you install one Helm chart and the operator handles the rest.


What the GPU Operator Installs

The GPU Operator manages these components as DaemonSets:

ComponentPurpose
NVIDIA DriverGPU driver on each node
Container ToolkitAllows containers to access GPUs
Device PluginExposes nvidia.com/gpu resource to Kubernetes
DCGM ExporterGPU metrics for Prometheus
MIG ManagerMulti-Instance GPU partitioning
Node Feature DiscoveryLabels nodes with GPU capabilities

Prerequisites

  • Kubernetes cluster with GPU nodes (NVIDIA A100, H100, T4, RTX series)
  • Helm 3.x installed
  • Nodes running Ubuntu 20.04/22.04 or RHEL 8/9
  • No pre-installed GPU drivers on nodes (GPU Operator manages this)

Check GPU nodes:

bash
kubectl get nodes -o wide
lspci | grep -i nvidia  # On the node itself

Step 1: Install Node Feature Discovery

NFD detects hardware features and labels nodes:

bash
helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm repo update
 
helm install nfd nfd/node-feature-discovery \
  --namespace node-feature-discovery \
  --create-namespace \
  --set worker.config.sources.pci.deviceClassWhitelist=["02","03","0200","0207"]

Step 2: Install the GPU Operator

bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
 
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false  # Enable if using MIG GPUs (A100)

For pre-installed drivers (nodes already have NVIDIA drivers):

bash
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=false \  # Don't reinstall drivers
  --set toolkit.enabled=true

Step 3: Verify Installation

bash
# Watch pods come up (takes 3-5 minutes for driver installation)
kubectl get pods -n gpu-operator -w
 
# Check node GPU labels
kubectl describe node <gpu-node> | grep nvidia
 
# Should see labels like:
# nvidia.com/gpu.present=true
# nvidia.com/gpu.product=Tesla-T4
# nvidia.com/gpu.memory=15360Mi
# nvidia.com/gpu.count=1
 
# Check GPU resource is available
kubectl get node <gpu-node> -o json | jq '.status.capacity | to_entries | .[] | select(.key | startswith("nvidia"))'

Expected output:

json
{"key": "nvidia.com/gpu", "value": "1"}

Step 4: Run a Test GPU Workload

yaml
# gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-test
      image: nvidia/cuda:12.3.0-base-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1  # Request 1 GPU
bash
kubectl apply -f gpu-test.yaml
kubectl logs gpu-test

Expected output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08    Driver Version: 545.23.08    CUDA Version: 12.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |

Step 5: Deploy a Real AI Workload

Ollama (LLM Inference)

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 16Gi
            requests:
              memory: 8Gi
              cpu: 2
          volumeMounts:
            - name: ollama-data
              mountPath: /root/.ollama
      volumes:
        - name: ollama-data
          persistentVolumeClaim:
            claimName: ollama-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434

vLLM (High Performance Inference)

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
            - --model
            - mistralai/Mistral-7B-Instruct-v0.3
            - --gpu-memory-utilization
            - "0.90"
            - --max-model-len
            - "4096"
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 24Gi
          ports:
            - containerPort: 8000
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token

Step 6: GPU Monitoring with Prometheus

The DCGM Exporter (installed by GPU Operator) exposes GPU metrics. Add a ServiceMonitor:

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: gpu-metrics
      interval: 15s

Key GPU metrics available:

promql
# GPU utilization
DCGM_FI_DEV_GPU_UTIL
 
# GPU memory used
DCGM_FI_DEV_FB_USED
 
# GPU temperature
DCGM_FI_DEV_GPU_TEMP
 
# Power usage
DCGM_FI_DEV_POWER_USAGE
 
# GPU memory bandwidth
DCGM_FI_DEV_MEM_COPY_UTIL

Import NVIDIA DCGM Exporter Dashboard in Grafana (Dashboard ID: 12239).


Multi-GPU and MIG (A100/H100)

For NVIDIA A100 or H100 GPUs, you can partition one GPU into multiple smaller GPU instances using MIG (Multi-Instance GPU):

bash
# Enable MIG Manager in GPU Operator
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set migManager.enabled=true \
  --set driver.enabled=true
 
# Label nodes for MIG strategy
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.5gb

MIG partitions a single A100 80GB into up to 7 instances of 1g.10gb each — run 7 separate inference workloads on one GPU.

Request a MIG slice:

yaml
resources:
  limits:
    nvidia.com/mig-1g.10gb: 1  # Request 1 MIG slice

Time-Slicing (Sharing GPUs)

For smaller workloads (dev, testing), enable GPU time-slicing to share one physical GPU among multiple pods:

yaml
# time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # 4 pods share 1 GPU
bash
kubectl apply -f time-slicing-config.yaml
 
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set devicePlugin.config.name=time-slicing-config

Each pod gets nvidia.com/gpu: 1 but they share the physical GPU via time-slicing.


Common Issues

Driver installation stuck:

bash
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
# Often a kernel version mismatch — check node kernel vs supported driver versions

GPU not visible in pod:

bash
# Check toolkit is running
kubectl get pod -n gpu-operator | grep toolkit
# Check container runtime is configured
kubectl describe node <gpu-node> | grep "container runtime"
# Should show containerd with NVIDIA runtime

OOMKilled in GPU workload:

yaml
# Increase shared memory for CUDA
volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 4Gi
volumeMounts:
  - name: dshm
    mountPath: /dev/shm

The GPU Operator makes running GPU workloads on Kubernetes significantly simpler — what used to require manual driver management on every node is now automated. Combine it with Karpenter for automatic GPU node provisioning and you have a fully automated AI compute platform.

Related: Run Ollama on Kubernetes | Run vLLM on Kubernetes | Deploy NVIDIA Triton

Affiliate note: AWS EC2 P3/P4/P5 instances provide NVIDIA V100, A100, and H100 GPUs for Kubernetes workloads. Lambda Labs offers GPU cloud at competitive rates for ML teams.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments