How to Set Up NVIDIA GPU Operator on Kubernetes for AI Workloads (2026)
Running AI/ML workloads on Kubernetes requires GPUs. The NVIDIA GPU Operator automates everything — driver installation, container toolkit, device plugin, monitoring. Here's the complete setup guide.
Running LLMs, training jobs, or inference servers on Kubernetes requires GPUs. But getting GPU access inside containers is not trivial — you need drivers, the container toolkit, device plugins, and monitoring all configured correctly.
The NVIDIA GPU Operator automates all of this. Instead of configuring each component manually on every node, you install one Helm chart and the operator handles the rest.
What the GPU Operator Installs
The GPU Operator manages these components as DaemonSets:
| Component | Purpose |
|---|---|
| NVIDIA Driver | GPU driver on each node |
| Container Toolkit | Allows containers to access GPUs |
| Device Plugin | Exposes nvidia.com/gpu resource to Kubernetes |
| DCGM Exporter | GPU metrics for Prometheus |
| MIG Manager | Multi-Instance GPU partitioning |
| Node Feature Discovery | Labels nodes with GPU capabilities |
Prerequisites
- Kubernetes cluster with GPU nodes (NVIDIA A100, H100, T4, RTX series)
- Helm 3.x installed
- Nodes running Ubuntu 20.04/22.04 or RHEL 8/9
- No pre-installed GPU drivers on nodes (GPU Operator manages this)
Check GPU nodes:
kubectl get nodes -o wide
lspci | grep -i nvidia # On the node itselfStep 1: Install Node Feature Discovery
NFD detects hardware features and labels nodes:
helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm repo update
helm install nfd nfd/node-feature-discovery \
--namespace node-feature-discovery \
--create-namespace \
--set worker.config.sources.pci.deviceClassWhitelist=["02","03","0200","0207"]Step 2: Install the GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set migManager.enabled=false # Enable if using MIG GPUs (A100)For pre-installed drivers (nodes already have NVIDIA drivers):
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=false \ # Don't reinstall drivers
--set toolkit.enabled=trueStep 3: Verify Installation
# Watch pods come up (takes 3-5 minutes for driver installation)
kubectl get pods -n gpu-operator -w
# Check node GPU labels
kubectl describe node <gpu-node> | grep nvidia
# Should see labels like:
# nvidia.com/gpu.present=true
# nvidia.com/gpu.product=Tesla-T4
# nvidia.com/gpu.memory=15360Mi
# nvidia.com/gpu.count=1
# Check GPU resource is available
kubectl get node <gpu-node> -o json | jq '.status.capacity | to_entries | .[] | select(.key | startswith("nvidia"))'Expected output:
{"key": "nvidia.com/gpu", "value": "1"}Step 4: Run a Test GPU Workload
# gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-test
image: nvidia/cuda:12.3.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPUkubectl apply -f gpu-test.yaml
kubectl logs gpu-testExpected output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
Step 5: Deploy a Real AI Workload
Ollama (LLM Inference)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
limits:
nvidia.com/gpu: 1
memory: 16Gi
requests:
memory: 8Gi
cpu: 2
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ai
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434vLLM (High Performance Inference)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command:
- python
- -m
- vllm.entrypoints.openai.api_server
- --model
- mistralai/Mistral-7B-Instruct-v0.3
- --gpu-memory-utilization
- "0.90"
- --max-model-len
- "4096"
resources:
limits:
nvidia.com/gpu: 1
memory: 24Gi
ports:
- containerPort: 8000
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: tokenStep 6: GPU Monitoring with Prometheus
The DCGM Exporter (installed by GPU Operator) exposes GPU metrics. Add a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: gpu-metrics
interval: 15sKey GPU metrics available:
# GPU utilization
DCGM_FI_DEV_GPU_UTIL
# GPU memory used
DCGM_FI_DEV_FB_USED
# GPU temperature
DCGM_FI_DEV_GPU_TEMP
# Power usage
DCGM_FI_DEV_POWER_USAGE
# GPU memory bandwidth
DCGM_FI_DEV_MEM_COPY_UTILImport NVIDIA DCGM Exporter Dashboard in Grafana (Dashboard ID: 12239).
Multi-GPU and MIG (A100/H100)
For NVIDIA A100 or H100 GPUs, you can partition one GPU into multiple smaller GPU instances using MIG (Multi-Instance GPU):
# Enable MIG Manager in GPU Operator
helm upgrade gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set migManager.enabled=true \
--set driver.enabled=true
# Label nodes for MIG strategy
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.5gbMIG partitions a single A100 80GB into up to 7 instances of 1g.10gb each — run 7 separate inference workloads on one GPU.
Request a MIG slice:
resources:
limits:
nvidia.com/mig-1g.10gb: 1 # Request 1 MIG sliceTime-Slicing (Sharing GPUs)
For smaller workloads (dev, testing), enable GPU time-slicing to share one physical GPU among multiple pods:
# time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # 4 pods share 1 GPUkubectl apply -f time-slicing-config.yaml
helm upgrade gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set devicePlugin.config.name=time-slicing-configEach pod gets nvidia.com/gpu: 1 but they share the physical GPU via time-slicing.
Common Issues
Driver installation stuck:
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
# Often a kernel version mismatch — check node kernel vs supported driver versionsGPU not visible in pod:
# Check toolkit is running
kubectl get pod -n gpu-operator | grep toolkit
# Check container runtime is configured
kubectl describe node <gpu-node> | grep "container runtime"
# Should show containerd with NVIDIA runtimeOOMKilled in GPU workload:
# Increase shared memory for CUDA
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 4Gi
volumeMounts:
- name: dshm
mountPath: /dev/shmThe GPU Operator makes running GPU workloads on Kubernetes significantly simpler — what used to require manual driver management on every node is now automated. Combine it with Karpenter for automatic GPU node provisioning and you have a fully automated AI compute platform.
Related: Run Ollama on Kubernetes | Run vLLM on Kubernetes | Deploy NVIDIA Triton
Affiliate note: AWS EC2 P3/P4/P5 instances provide NVIDIA V100, A100, and H100 GPUs for Kubernetes workloads. Lambda Labs offers GPU cloud at competitive rates for ML teams.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI Kubernetes Runbook Generator with LLMs (2026)
Manual runbooks go stale. Build a system that watches your Kubernetes cluster, detects incidents, and generates step-by-step runbooks automatically using LLMs. Full implementation with Python, kubectl, and Ollama.
Build an AI Alert Classifier for Grafana Using LLMs (2026)
Tired of noisy Grafana alerts that wake you up for nothing? Build an AI layer that classifies incoming alerts as actionable or noise, enriches them with context, and routes them intelligently — using Claude or GPT-4 as the reasoning engine.
Build an AI Kubernetes Troubleshooter with Claude (2026)
Build a CLI tool that automatically diagnoses Kubernetes issues — OOMKilled, CrashLoopBackOff, pending pods — by gathering cluster state and asking Claude what's wrong and how to fix it.