🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

How to Deploy NVIDIA NIM on Kubernetes for Fast LLM Inference

NVIDIA NIM containers give you production-grade LLM inference with 3x better throughput than vanilla vLLM. Here's how to deploy NIM on Kubernetes with GPU nodes.

DevOpsBoysMay 11, 20264 min read
Share:Tweet

NVIDIA NIM (NVIDIA Inference Microservices) is the fastest way to run production LLM inference on Kubernetes. It's a pre-optimized container that packages the model, TensorRT-LLM inference engine, and an OpenAI-compatible API in a single deployable unit.

Think of it as vLLM but with NVIDIA's hardware optimizations baked in — typically 2–4x faster on NVIDIA GPUs.


What is NVIDIA NIM?

NIM is a collection of containerized microservices for AI inference. Each NIM container includes:

  • The model weights (or downloads them at startup)
  • TensorRT-LLM optimized inference engine
  • OpenAI-compatible REST API (/v1/chat/completions, /v1/completions)
  • Health endpoints for Kubernetes probes

Available NIMs include: Llama 3.1, Mistral, Mixtral, Phi-3, CodeLlama, and more.


Prerequisites

  • Kubernetes cluster with NVIDIA GPU nodes
  • NVIDIA GPU Operator installed
  • NGC API key (free at ngc.nvidia.com)
  • Minimum: 1x A100 40GB or 2x A10G for Llama 3.1 8B

Step 1 — Install NVIDIA GPU Operator

bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
 
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

Verify GPU is visible to Kubernetes:

bash
kubectl get nodes -o json | jq '.items[].status.allocatable | select(."nvidia.com/gpu")'
# Should output: "nvidia.com/gpu": "1"

Step 2 — Create NGC API Key Secret

Get your API key from ngc.nvidia.com → Account → Generate API Key.

bash
kubectl create secret docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=<YOUR_NGC_API_KEY> \
  --namespace=nim

Also create a secret for the NIM API key:

bash
kubectl create secret generic nim-secrets \
  --from-literal=NGC_API_KEY=<YOUR_NGC_API_KEY> \
  --namespace=nim

Step 3 — Create Persistent Volume for Model Cache

NIM downloads model weights on first start. Cache them so restarts are fast:

yaml
# nim-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nim-model-cache
  namespace: nim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3   # Use fast SSD storage
  resources:
    requests:
      storage: 100Gi      # Llama 3.1 8B needs ~16GB, 70B needs ~140GB
bash
kubectl apply -f nim-pvc.yaml

Step 4 — Deploy NIM (Llama 3.1 8B)

yaml
# nim-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-3-1-8b-nim
  namespace: nim
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-3-1-8b-nim
  template:
    metadata:
      labels:
        app: llama-3-1-8b-nim
    spec:
      imagePullSecrets:
        - name: ngc-secret
      containers:
        - name: nim
          image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: NGC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: nim-secrets
                  key: NGC_API_KEY
            - name: NIM_CACHE_PATH
              value: /opt/nim/.cache
          volumeMounts:
            - name: model-cache
              mountPath: /opt/nim/.cache
          resources:
            limits:
              nvidia.com/gpu: 1       # Request 1 GPU
              memory: 32Gi
              cpu: "8"
            requests:
              nvidia.com/gpu: 1
              memory: 16Gi
              cpu: "4"
          readinessProbe:
            httpGet:
              path: /v1/health/ready
              port: 8000
            initialDelaySeconds: 120   # Model loading takes time
            periodSeconds: 10
            failureThreshold: 30
          livenessProbe:
            httpGet:
              path: /v1/health/live
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 30
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: nim-model-cache
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        nvidia.com/gpu.present: "true"
---
apiVersion: v1
kind: Service
metadata:
  name: llama-3-1-8b-nim
  namespace: nim
spec:
  selector:
    app: llama-3-1-8b-nim
  ports:
    - port: 8000
      targetPort: 8000
      name: http
bash
kubectl apply -f nim-deployment.yaml
 
# Watch pod startup (model download takes 5-15 minutes first time)
kubectl logs -f deployment/llama-3-1-8b-nim -n nim

Step 5 — Test the API

bash
# Port-forward for testing
kubectl port-forward svc/llama-3-1-8b-nim 8000:8000 -n nim
 
# Test chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [
      {"role": "user", "content": "Explain Kubernetes in 3 sentences"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Step 6 — Horizontal Scaling with KEDA

Scale NIM pods based on GPU queue depth:

yaml
# keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nim-scaler
  namespace: nim
spec:
  scaleTargetRef:
    name: llama-3-1-8b-nim
  minReplicaCount: 1
  maxReplicaCount: 4
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: nim_active_requests
        query: avg(nim_active_requests{service="llama-3-1-8b-nim"})
        threshold: "10"   # Scale up when avg requests > 10

NIM vs vLLM vs Ollama

NIMvLLMOllama
PerformanceFastest on NVIDIAFastModerate
Setup complexityMediumMediumEasiest
GPU requiredYes (NVIDIA)YesNo (CPU ok)
OpenAI API compatYesYesYes
Production readyYesYesDev/testing
CostFree (model weights)FreeFree
Best forProduction, high throughputProduction, flexibilityLocal dev

Cost on AWS

For production NIM deployment on EKS:

InstanceGPUCost/hrBest for
g5.xlarge1x A10G 24GB$1.01Llama 3.1 8B
g5.12xlarge4x A10G 96GB$5.67Llama 3.1 70B
p4d.24xlarge8x A100 40GB$32.77Large deployments

Use DigitalOcean GPU Droplets for cheaper GPU compute during development — H100 nodes available at lower cost than AWS.


Key Monitoring Metrics

bash
# NIM exposes metrics at /metrics
curl http://localhost:8000/metrics | grep nim_
 
# Key metrics to watch:
# nim_active_requests — concurrent requests being processed
# nim_request_duration_seconds — inference latency histogram
# nim_tokens_generated_total — throughput counter
# nim_gpu_memory_used_bytes — GPU memory pressure

NIM is currently the highest-performance option for running NVIDIA-optimized LLMs in Kubernetes. If you're building an AI platform in 2026 and have NVIDIA GPUs, it's the first thing to evaluate before vLLM.

For more on MLOps and AI infrastructure on Kubernetes, KodeKloud has hands-on GPU labs.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments