How to Deploy NVIDIA NIM on Kubernetes for Fast LLM Inference

NVIDIA NIM containers give you production-grade LLM inference with 3x better throughput than vanilla vLLM. Here's how to deploy NIM on Kubernetes with GPU nodes.

NVIDIA NIM (NVIDIA Inference Microservices) is the fastest way to run production LLM inference on Kubernetes. It's a pre-optimized container that packages the model, TensorRT-LLM inference engine, and an OpenAI-compatible API in a single deployable unit.

Think of it as vLLM but with NVIDIA's hardware optimizations baked in — typically 2–4x faster on NVIDIA GPUs.

What is NVIDIA NIM?

NIM is a collection of containerized microservices for AI inference. Each NIM container includes:

The model weights (or downloads them at startup)
TensorRT-LLM optimized inference engine
OpenAI-compatible REST API (/v1/chat/completions, /v1/completions)
Health endpoints for Kubernetes probes

Available NIMs include: Llama 3.1, Mistral, Mixtral, Phi-3, CodeLlama, and more.

Prerequisites

Kubernetes cluster with NVIDIA GPU nodes
NVIDIA GPU Operator installed
NGC API key (free at ngc.nvidia.com)
Minimum: 1x A100 40GB or 2x A10G for Llama 3.1 8B

Step 1 — Install NVIDIA GPU Operator

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
 
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

Verify GPU is visible to Kubernetes:

bash

kubectl get nodes -o json | jq '.items[].status.allocatable | select(."nvidia.com/gpu")'
# Should output: "nvidia.com/gpu": "1"

Step 2 — Create NGC API Key Secret

Get your API key from ngc.nvidia.com → Account → Generate API Key.

bash

kubectl create secret docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=<YOUR_NGC_API_KEY> \
  --namespace=nim

Also create a secret for the NIM API key:

bash

kubectl create secret generic nim-secrets \
  --from-literal=NGC_API_KEY=<YOUR_NGC_API_KEY> \
  --namespace=nim

Step 3 — Create Persistent Volume for Model Cache

NIM downloads model weights on first start. Cache them so restarts are fast:

yaml

# nim-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nim-model-cache
  namespace: nim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3   # Use fast SSD storage
  resources:
    requests:
      storage: 100Gi      # Llama 3.1 8B needs ~16GB, 70B needs ~140GB

bash

kubectl apply -f nim-pvc.yaml

Step 4 — Deploy NIM (Llama 3.1 8B)

yaml

# nim-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-3-1-8b-nim
  namespace: nim
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-3-1-8b-nim
  template:
    metadata:
      labels:
        app: llama-3-1-8b-nim
    spec:
      imagePullSecrets:
        - name: ngc-secret
      containers:
        - name: nim
          image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: NGC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: nim-secrets
                  key: NGC_API_KEY
            - name: NIM_CACHE_PATH
              value: /opt/nim/.cache
          volumeMounts:
            - name: model-cache
              mountPath: /opt/nim/.cache
          resources:
            limits:
              nvidia.com/gpu: 1       # Request 1 GPU
              memory: 32Gi
              cpu: "8"
            requests:
              nvidia.com/gpu: 1
              memory: 16Gi
              cpu: "4"
          readinessProbe:
            httpGet:
              path: /v1/health/ready
              port: 8000
            initialDelaySeconds: 120   # Model loading takes time
            periodSeconds: 10
            failureThreshold: 30
          livenessProbe:
            httpGet:
              path: /v1/health/live
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 30
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: nim-model-cache
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        nvidia.com/gpu.present: "true"
---
apiVersion: v1
kind: Service
metadata:
  name: llama-3-1-8b-nim
  namespace: nim
spec:
  selector:
    app: llama-3-1-8b-nim
  ports:
    - port: 8000
      targetPort: 8000
      name: http

bash

kubectl apply -f nim-deployment.yaml
 
# Watch pod startup (model download takes 5-15 minutes first time)
kubectl logs -f deployment/llama-3-1-8b-nim -n nim

Step 5 — Test the API

bash

# Port-forward for testing
kubectl port-forward svc/llama-3-1-8b-nim 8000:8000 -n nim
 
# Test chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [
      {"role": "user", "content": "Explain Kubernetes in 3 sentences"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Step 6 — Horizontal Scaling with KEDA

Scale NIM pods based on GPU queue depth:

yaml

# keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nim-scaler
  namespace: nim
spec:
  scaleTargetRef:
    name: llama-3-1-8b-nim
  minReplicaCount: 1
  maxReplicaCount: 4
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: nim_active_requests
        query: avg(nim_active_requests{service="llama-3-1-8b-nim"})
        threshold: "10"   # Scale up when avg requests > 10

NIM vs vLLM vs Ollama

	NIM	vLLM	Ollama
Performance	Fastest on NVIDIA	Fast	Moderate
Setup complexity	Medium	Medium	Easiest
GPU required	Yes (NVIDIA)	Yes	No (CPU ok)
OpenAI API compat	Yes	Yes	Yes
Production ready	Yes	Yes	Dev/testing
Cost	Free (model weights)	Free	Free
Best for	Production, high throughput	Production, flexibility	Local dev

Cost on AWS

For production NIM deployment on EKS:

Instance	GPU	Cost/hr	Best for
g5.xlarge	1x A10G 24GB	$1.01	Llama 3.1 8B
g5.12xlarge	4x A10G 96GB	$5.67	Llama 3.1 70B
p4d.24xlarge	8x A100 40GB	$32.77	Large deployments

Use DigitalOcean GPU Droplets for cheaper GPU compute during development — H100 nodes available at lower cost than AWS.

Key Monitoring Metrics

bash

# NIM exposes metrics at /metrics
curl http://localhost:8000/metrics | grep nim_
 
# Key metrics to watch:
# nim_active_requests — concurrent requests being processed
# nim_request_duration_seconds — inference latency histogram
# nim_tokens_generated_total — throughput counter
# nim_gpu_memory_used_bytes — GPU memory pressure

NIM is currently the highest-performance option for running NVIDIA-optimized LLMs in Kubernetes. If you're building an AI platform in 2026 and have NVIDIA GPUs, it's the first thing to evaluate before vLLM.

For more on MLOps and AI infrastructure on Kubernetes, KodeKloud has hands-on GPU labs.

How to Deploy NVIDIA NIM on Kubernetes for Fast LLM Inference

What is NVIDIA NIM?

Prerequisites

Step 1 — Install NVIDIA GPU Operator

Step 2 — Create NGC API Key Secret

Step 3 — Create Persistent Volume for Model Cache

Step 4 — Deploy NIM (Llama 3.1 8B)

Step 5 — Test the API

Step 6 — Horizontal Scaling with KEDA

NIM vs vLLM vs Ollama

Cost on AWS

Key Monitoring Metrics

Stay ahead of the curve

Related Articles

Build a Kubernetes Cost Optimization Bot with AI in 2026

Deploy Gemma 3 on Kubernetes with GPU — Complete Guide (2026)

Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)

Comments