How to Set Up NVIDIA GPU Operator on Kubernetes for AI Workloads (2026)

Running AI/ML workloads on Kubernetes requires GPUs. The NVIDIA GPU Operator automates everything — driver installation, container toolkit, device plugin, monitoring. Here's the complete setup guide.

Running LLMs, training jobs, or inference servers on Kubernetes requires GPUs. But getting GPU access inside containers is not trivial — you need drivers, the container toolkit, device plugins, and monitoring all configured correctly.

The NVIDIA GPU Operator automates all of this. Instead of configuring each component manually on every node, you install one Helm chart and the operator handles the rest.

What the GPU Operator Installs

The GPU Operator manages these components as DaemonSets:

Component	Purpose
NVIDIA Driver	GPU driver on each node
Container Toolkit	Allows containers to access GPUs
Device Plugin	Exposes `nvidia.com/gpu` resource to Kubernetes
DCGM Exporter	GPU metrics for Prometheus
MIG Manager	Multi-Instance GPU partitioning
Node Feature Discovery	Labels nodes with GPU capabilities

Prerequisites

Kubernetes cluster with GPU nodes (NVIDIA A100, H100, T4, RTX series)
Helm 3.x installed
Nodes running Ubuntu 20.04/22.04 or RHEL 8/9
No pre-installed GPU drivers on nodes (GPU Operator manages this)

Check GPU nodes:

bash

kubectl get nodes -o wide
lspci | grep -i nvidia  # On the node itself

Step 1: Install Node Feature Discovery

NFD detects hardware features and labels nodes:

bash

helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm repo update
 
helm install nfd nfd/node-feature-discovery \
  --namespace node-feature-discovery \
  --create-namespace \
  --set worker.config.sources.pci.deviceClassWhitelist=["02","03","0200","0207"]

Step 2: Install the GPU Operator

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
 
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false  # Enable if using MIG GPUs (A100)

For pre-installed drivers (nodes already have NVIDIA drivers):

bash

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=false \  # Don't reinstall drivers
  --set toolkit.enabled=true

Step 3: Verify Installation

bash

# Watch pods come up (takes 3-5 minutes for driver installation)
kubectl get pods -n gpu-operator -w
 
# Check node GPU labels
kubectl describe node <gpu-node> | grep nvidia
 
# Should see labels like:
# nvidia.com/gpu.present=true
# nvidia.com/gpu.product=Tesla-T4
# nvidia.com/gpu.memory=15360Mi
# nvidia.com/gpu.count=1
 
# Check GPU resource is available
kubectl get node <gpu-node> -o json | jq '.status.capacity | to_entries | .[] | select(.key | startswith("nvidia"))'

Expected output:

json

{"key": "nvidia.com/gpu", "value": "1"}

Step 4: Run a Test GPU Workload

yaml

# gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-test
      image: nvidia/cuda:12.3.0-base-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1  # Request 1 GPU

bash

kubectl apply -f gpu-test.yaml
kubectl logs gpu-test

Expected output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08    Driver Version: 545.23.08    CUDA Version: 12.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |

Step 5: Deploy a Real AI Workload

Ollama (LLM Inference)

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 16Gi
            requests:
              memory: 8Gi
              cpu: 2
          volumeMounts:
            - name: ollama-data
              mountPath: /root/.ollama
      volumes:
        - name: ollama-data
          persistentVolumeClaim:
            claimName: ollama-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434

vLLM (High Performance Inference)

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
            - --model
            - mistralai/Mistral-7B-Instruct-v0.3
            - --gpu-memory-utilization
            - "0.90"
            - --max-model-len
            - "4096"
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 24Gi
          ports:
            - containerPort: 8000
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token

Step 6: GPU Monitoring with Prometheus

The DCGM Exporter (installed by GPU Operator) exposes GPU metrics. Add a ServiceMonitor:

yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: gpu-metrics
      interval: 15s

Key GPU metrics available:

promql

# GPU utilization
DCGM_FI_DEV_GPU_UTIL
 
# GPU memory used
DCGM_FI_DEV_FB_USED
 
# GPU temperature
DCGM_FI_DEV_GPU_TEMP
 
# Power usage
DCGM_FI_DEV_POWER_USAGE
 
# GPU memory bandwidth
DCGM_FI_DEV_MEM_COPY_UTIL

Import NVIDIA DCGM Exporter Dashboard in Grafana (Dashboard ID: 12239).

Multi-GPU and MIG (A100/H100)

For NVIDIA A100 or H100 GPUs, you can partition one GPU into multiple smaller GPU instances using MIG (Multi-Instance GPU):

bash

# Enable MIG Manager in GPU Operator
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set migManager.enabled=true \
  --set driver.enabled=true
 
# Label nodes for MIG strategy
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.5gb

MIG partitions a single A100 80GB into up to 7 instances of 1g.10gb each — run 7 separate inference workloads on one GPU.

Request a MIG slice:

yaml

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1  # Request 1 MIG slice

For smaller workloads (dev, testing), enable GPU time-slicing to share one physical GPU among multiple pods:

yaml

# time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # 4 pods share 1 GPU

bash

kubectl apply -f time-slicing-config.yaml
 
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set devicePlugin.config.name=time-slicing-config

Each pod gets nvidia.com/gpu: 1 but they share the physical GPU via time-slicing.

Common Issues

Driver installation stuck:

bash

kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
# Often a kernel version mismatch — check node kernel vs supported driver versions

GPU not visible in pod:

bash

# Check toolkit is running
kubectl get pod -n gpu-operator | grep toolkit
# Check container runtime is configured
kubectl describe node <gpu-node> | grep "container runtime"
# Should show containerd with NVIDIA runtime

OOMKilled in GPU workload:

yaml

# Increase shared memory for CUDA
volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 4Gi
volumeMounts:
  - name: dshm
    mountPath: /dev/shm

The GPU Operator makes running GPU workloads on Kubernetes significantly simpler — what used to require manual driver management on every node is now automated. Combine it with Karpenter for automatic GPU node provisioning and you have a fully automated AI compute platform.

Related: Run Ollama on Kubernetes | Run vLLM on Kubernetes | Deploy NVIDIA Triton

Affiliate note: AWS EC2 P3/P4/P5 instances provide NVIDIA V100, A100, and H100 GPUs for Kubernetes workloads. Lambda Labs offers GPU cloud at competitive rates for ML teams.

How to Set Up NVIDIA GPU Operator on Kubernetes for AI Workloads (2026)

What the GPU Operator Installs

Prerequisites

Step 1: Install Node Feature Discovery

Step 2: Install the GPU Operator

Step 3: Verify Installation

Step 4: Run a Test GPU Workload

Step 5: Deploy a Real AI Workload

Ollama (LLM Inference)

vLLM (High Performance Inference)

Step 6: GPU Monitoring with Prometheus

Multi-GPU and MIG (A100/H100)

Common Issues

Stay ahead of the curve

Related Articles

Build an AI Kubernetes Runbook Generator with LLMs (2026)

Build an AI Alert Classifier for Grafana Using LLMs (2026)

Build an AI Kubernetes Troubleshooter with Claude (2026)

Comments