How to Deploy NVIDIA NIM on Kubernetes for Fast LLM Inference
NVIDIA NIM containers give you production-grade LLM inference with 3x better throughput than vanilla vLLM. Here's how to deploy NIM on Kubernetes with GPU nodes.
NVIDIA NIM (NVIDIA Inference Microservices) is the fastest way to run production LLM inference on Kubernetes. It's a pre-optimized container that packages the model, TensorRT-LLM inference engine, and an OpenAI-compatible API in a single deployable unit.
Think of it as vLLM but with NVIDIA's hardware optimizations baked in — typically 2–4x faster on NVIDIA GPUs.
What is NVIDIA NIM?
NIM is a collection of containerized microservices for AI inference. Each NIM container includes:
- The model weights (or downloads them at startup)
- TensorRT-LLM optimized inference engine
- OpenAI-compatible REST API (
/v1/chat/completions,/v1/completions) - Health endpoints for Kubernetes probes
Available NIMs include: Llama 3.1, Mistral, Mixtral, Phi-3, CodeLlama, and more.
Prerequisites
- Kubernetes cluster with NVIDIA GPU nodes
- NVIDIA GPU Operator installed
- NGC API key (free at ngc.nvidia.com)
- Minimum: 1x A100 40GB or 2x A10G for Llama 3.1 8B
Step 1 — Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=trueVerify GPU is visible to Kubernetes:
kubectl get nodes -o json | jq '.items[].status.allocatable | select(."nvidia.com/gpu")'
# Should output: "nvidia.com/gpu": "1"Step 2 — Create NGC API Key Secret
Get your API key from ngc.nvidia.com → Account → Generate API Key.
kubectl create secret docker-registry ngc-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=<YOUR_NGC_API_KEY> \
--namespace=nimAlso create a secret for the NIM API key:
kubectl create secret generic nim-secrets \
--from-literal=NGC_API_KEY=<YOUR_NGC_API_KEY> \
--namespace=nimStep 3 — Create Persistent Volume for Model Cache
NIM downloads model weights on first start. Cache them so restarts are fast:
# nim-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nim-model-cache
namespace: nim
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3 # Use fast SSD storage
resources:
requests:
storage: 100Gi # Llama 3.1 8B needs ~16GB, 70B needs ~140GBkubectl apply -f nim-pvc.yamlStep 4 — Deploy NIM (Llama 3.1 8B)
# nim-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-3-1-8b-nim
namespace: nim
spec:
replicas: 1
selector:
matchLabels:
app: llama-3-1-8b-nim
template:
metadata:
labels:
app: llama-3-1-8b-nim
spec:
imagePullSecrets:
- name: ngc-secret
containers:
- name: nim
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
ports:
- containerPort: 8000
name: http
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: nim-secrets
key: NGC_API_KEY
- name: NIM_CACHE_PATH
value: /opt/nim/.cache
volumeMounts:
- name: model-cache
mountPath: /opt/nim/.cache
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
memory: 32Gi
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: 16Gi
cpu: "4"
readinessProbe:
httpGet:
path: /v1/health/ready
port: 8000
initialDelaySeconds: 120 # Model loading takes time
periodSeconds: 10
failureThreshold: 30
livenessProbe:
httpGet:
path: /v1/health/live
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: nim-model-cache
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
---
apiVersion: v1
kind: Service
metadata:
name: llama-3-1-8b-nim
namespace: nim
spec:
selector:
app: llama-3-1-8b-nim
ports:
- port: 8000
targetPort: 8000
name: httpkubectl apply -f nim-deployment.yaml
# Watch pod startup (model download takes 5-15 minutes first time)
kubectl logs -f deployment/llama-3-1-8b-nim -n nimStep 5 — Test the API
# Port-forward for testing
kubectl port-forward svc/llama-3-1-8b-nim 8000:8000 -n nim
# Test chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [
{"role": "user", "content": "Explain Kubernetes in 3 sentences"}
],
"max_tokens": 200,
"temperature": 0.7
}'Step 6 — Horizontal Scaling with KEDA
Scale NIM pods based on GPU queue depth:
# keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: nim-scaler
namespace: nim
spec:
scaleTargetRef:
name: llama-3-1-8b-nim
minReplicaCount: 1
maxReplicaCount: 4
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: nim_active_requests
query: avg(nim_active_requests{service="llama-3-1-8b-nim"})
threshold: "10" # Scale up when avg requests > 10NIM vs vLLM vs Ollama
| NIM | vLLM | Ollama | |
|---|---|---|---|
| Performance | Fastest on NVIDIA | Fast | Moderate |
| Setup complexity | Medium | Medium | Easiest |
| GPU required | Yes (NVIDIA) | Yes | No (CPU ok) |
| OpenAI API compat | Yes | Yes | Yes |
| Production ready | Yes | Yes | Dev/testing |
| Cost | Free (model weights) | Free | Free |
| Best for | Production, high throughput | Production, flexibility | Local dev |
Cost on AWS
For production NIM deployment on EKS:
| Instance | GPU | Cost/hr | Best for |
|---|---|---|---|
| g5.xlarge | 1x A10G 24GB | $1.01 | Llama 3.1 8B |
| g5.12xlarge | 4x A10G 96GB | $5.67 | Llama 3.1 70B |
| p4d.24xlarge | 8x A100 40GB | $32.77 | Large deployments |
Use DigitalOcean GPU Droplets for cheaper GPU compute during development — H100 nodes available at lower cost than AWS.
Key Monitoring Metrics
# NIM exposes metrics at /metrics
curl http://localhost:8000/metrics | grep nim_
# Key metrics to watch:
# nim_active_requests — concurrent requests being processed
# nim_request_duration_seconds — inference latency histogram
# nim_tokens_generated_total — throughput counter
# nim_gpu_memory_used_bytes — GPU memory pressureNIM is currently the highest-performance option for running NVIDIA-optimized LLMs in Kubernetes. If you're building an AI platform in 2026 and have NVIDIA GPUs, it's the first thing to evaluate before vLLM.
For more on MLOps and AI infrastructure on Kubernetes, KodeKloud has hands-on GPU labs.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build a Kubernetes Cost Optimization Bot with AI in 2026
Build an AI-powered bot that analyzes your Kubernetes cluster, finds idle resources, oversized pods, and unused namespaces — and gives cost-cutting recommendations.
Deploy Gemma 3 on Kubernetes with GPU — Complete Guide (2026)
Google's Gemma 3 is open-weight and runs well on a single GPU. Here's how to deploy it on Kubernetes using vLLM, expose it as an OpenAI-compatible API, and use it in your DevOps workflows.
Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)
Step-by-step guide to deploying HuggingFace transformer models on Kubernetes using GPU nodes — from cluster setup to inference API in production.