Deploy NVIDIA Triton Inference Server on Kubernetes (2026)

Step-by-step guide to running NVIDIA Triton Inference Server on Kubernetes with GPU nodes — model repository setup, deployment, autoscaling, and monitoring.

When you need to serve ML models in production at scale — with batching, multi-model serving, GPU sharing, and sub-millisecond latency — NVIDIA Triton Inference Server is the go-to solution.

Triton supports TensorFlow, PyTorch, ONNX, TensorRT, and more. It handles batching automatically, exposes gRPC and HTTP APIs, and integrates with Prometheus for metrics.

This guide walks through deploying Triton on Kubernetes with GPU nodes, setting up a model repository, and autoscaling with KEDA.

Why Triton Over a Simple FastAPI Server?

Feature	FastAPI + HuggingFace	Triton
Dynamic batching	Manual	Built-in
Multi-model serving	One model per service	Hundreds per instance
Model frameworks	Python only	TF, PyTorch, ONNX, TensorRT
GPU utilization	Single model	MIG / model sharing
gRPC support	Extra work	Native
Concurrent requests	Limited	Highly optimized
Prometheus metrics	Manual	Built-in

For high-throughput production, Triton is significantly more efficient.

Architecture

Client → HTTP/gRPC → Triton Inference Server (K8s pod on GPU node)
                            ↓
                    Model Repository (S3 or PVC)
                            ↓
                  Backend Executors:
                  - TensorRT (fastest)
                  - PyTorch (torchscript)
                  - ONNX Runtime
                  - TensorFlow SavedModel
                  - Python Backend (for custom logic)

Step 1 — Set Up EKS with GPU Nodes

bash

# Create cluster
eksctl create cluster \
  --name triton-cluster \
  --region ap-south-1 \
  --nodegroup-name cpu-workers \
  --node-type t3.medium \
  --nodes 2
 
# Add GPU node group
eksctl create nodegroup \
  --cluster triton-cluster \
  --name gpu-workers \
  --node-type g4dn.2xlarge \    # 1 T4 GPU, 32GB VRAM, 8 vCPUs
  --nodes 1 \
  --nodes-min 0 \
  --nodes-max 4 \
  --node-labels "role=inference"
 
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
 
# Verify GPU is visible
kubectl get nodes -l role=inference -o json | \
  jq '.items[].status.allocatable["nvidia.com/gpu"]'

Step 2 — Prepare the Model Repository

Triton uses a model repository — a directory structure in S3 (or a PVC) with a specific format.

Model Repository Structure

model-repository/
├── resnet50/
│   ├── config.pbtxt          # Model configuration
│   └── 1/                    # Version 1
│       └── model.onnx
├── bert-classifier/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
└── ensemble-pipeline/        # Multi-model pipeline
    ├── config.pbtxt
    └── 1/
        └── (empty)

Export a PyTorch Model to ONNX

python

import torch
import torchvision.models as models
 
# Load pretrained ResNet50
model = models.resnet50(pretrained=True)
model.eval()
 
# Dummy input for tracing
dummy_input = torch.randn(1, 3, 224, 224)
 
# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    },
    opset_version=17
)
 
print("Exported to model.onnx")

Model Config (`config.pbtxt`)

protobuf

name: "resnet50"
backend: "onnxruntime"
max_batch_size: 32
 
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
 
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]
 
dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 100   # Wait up to 100μs to form a batch
}
 
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

Upload to S3

bash

# Create S3 bucket for model repository
aws s3 mb s3://my-triton-models --region ap-south-1
 
# Upload model
mkdir -p resnet50/1
cp model.onnx resnet50/1/
cat > resnet50/config.pbtxt << 'EOF'
name: "resnet50"
backend: "onnxruntime"
max_batch_size: 32
input [ { name: "input" data_type: TYPE_FP32 dims: [ 3, 224, 224 ] } ]
output [ { name: "output" data_type: TYPE_FP32 dims: [ 1000 ] } ]
dynamic_batching { preferred_batch_size: [ 8, 16 ] }
instance_group [ { count: 1 kind: KIND_GPU } ]
EOF
 
aws s3 cp resnet50/ s3://my-triton-models/resnet50/ --recursive

Step 3 — Create IAM Role for S3 Access

Triton pods need to read from your S3 model repository:

bash

# Create IAM policy
cat > triton-s3-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::my-triton-models",
      "arn:aws:s3:::my-triton-models/*"
    ]
  }]
}
EOF
 
aws iam create-policy \
  --policy-name TritonS3ReadPolicy \
  --policy-document file://triton-s3-policy.json
 
# Create IRSA (IAM Roles for Service Accounts)
eksctl create iamserviceaccount \
  --cluster triton-cluster \
  --namespace ml-serving \
  --name triton-sa \
  --attach-policy-arn arn:aws:iam::<ACCOUNT_ID>:policy/TritonS3ReadPolicy \
  --approve

Step 4 — Deploy Triton on Kubernetes

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
  namespace: ml-serving
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8002"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: triton-sa
      nodeSelector:
        role: inference
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.02-py3
        args:
        - tritonserver
        - --model-repository=s3://my-triton-models
        - --model-control-mode=poll           # Auto-reload models on S3 changes
        - --repository-poll-secs=30
        - --log-verbose=1
        - --metrics-port=8002
        ports:
        - name: http
          containerPort: 8000
        - name: grpc
          containerPort: 8001
        - name: metrics
          containerPort: 8002
        resources:
          requests:
            cpu: "4"
            memory: "8Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "8"
            memory: "16Gi"
            nvidia.com/gpu: "1"
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
          failureThreshold: 12
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 90
          periodSeconds: 15
---
apiVersion: v1
kind: Service
metadata:
  name: triton-server
  namespace: ml-serving
spec:
  selector:
    app: triton-server
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001
  - name: metrics
    port: 8002
    targetPort: 8002
  type: ClusterIP

bash

kubectl create namespace ml-serving
kubectl apply -f triton-deployment.yaml
 
# Watch startup
kubectl logs -f deployment/triton-server -n ml-serving
# ...
# I0421 12:00:00.000000 1 server.cc:677] Started GRPCInferenceService at 0.0.0.0:8001
# I0421 12:00:00.000000 1 server.cc:698] Started HTTPService at 0.0.0.0:8000

Step 5 — Test Inference

bash

# Port-forward to test
kubectl port-forward svc/triton-server 8000:8000 -n ml-serving
 
# Check loaded models
curl http://localhost:8000/v2/models
# {"models":[{"name":"resnet50","version":"1","state":"READY"}]}
 
# Run inference (HTTP)
python3 << 'EOF'
import tritonclient.http as httpclient
import numpy as np
 
client = httpclient.InferenceServerClient("localhost:8000")
 
# Prepare input (random image for testing)
image = np.random.rand(1, 3, 224, 224).astype(np.float32)
inputs = [httpclient.InferInput("input", image.shape, "FP32")]
inputs[0].set_data_from_numpy(image)
 
outputs = [httpclient.InferRequestedOutput("output")]
response = client.infer("resnet50", inputs, outputs=outputs)
 
result = response.as_numpy("output")
print(f"Output shape: {result.shape}")   # (1, 1000)
print(f"Top class: {np.argmax(result)}")
EOF

Step 6 — Autoscale with KEDA

Scale Triton pods based on GPU queue depth (using Prometheus metrics):

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: triton-scaler
  namespace: ml-serving
spec:
  scaleTargetRef:
    name: triton-server
  minReplicaCount: 1
  maxReplicaCount: 4
  cooldownPeriod: 120
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
      metricName: nv_inference_queue_duration_us
      query: |
        avg(rate(nv_inference_queue_duration_us[1m])) > 1000
      threshold: "1000"   # Scale up when avg queue time > 1ms

Monitoring with Prometheus

Triton exposes rich metrics on port 8002:

nv_inference_request_success     - Successful inference requests
nv_inference_request_failure     - Failed requests  
nv_inference_count                - Total inferences
nv_inference_exec_count           - Execution count (batched)
nv_inference_request_duration_us  - End-to-end latency (microseconds)
nv_inference_queue_duration_us    - Time spent in queue
nv_inference_compute_input_duration_us  - Input processing time
nv_inference_compute_infer_duration_us  - GPU compute time
nv_inference_compute_output_duration_us - Output processing time
nv_gpu_utilization               - GPU utilization %
nv_gpu_memory_used_bytes         - GPU memory used

Add a Grafana dashboard to visualize throughput, latency, and GPU utilization.

Summary

Step	What it does
GPU node group	g4dn.2xlarge with T4 GPU
Model repository	ONNX model + config.pbtxt in S3
IRSA	Pod IAM permissions for S3
Triton Deployment	Loads models from S3, serves HTTP + gRPC
KEDA ScaledObject	Autoscale based on queue latency

Triton is significantly more efficient than a plain FastAPI model server for production use — automatic batching alone can improve throughput 10–20x for batch-friendly models.

Deploy Triton on GPU nodes with DigitalOcean GPU Droplets or EKS GPU node groups — new DigitalOcean accounts get $200 free credit, enough for several hours of GPU inference experimentation.

Deploy NVIDIA Triton Inference Server on Kubernetes (2026)

Why Triton Over a Simple FastAPI Server?

Architecture

Step 1 — Set Up EKS with GPU Nodes

Step 2 — Prepare the Model Repository

Model Repository Structure

Export a PyTorch Model to ONNX

Model Config (`config.pbtxt`)

Upload to S3

Step 3 — Create IAM Role for S3 Access

Step 4 — Deploy Triton on Kubernetes

Step 5 — Test Inference

Step 6 — Autoscale with KEDA

Monitoring with Prometheus

Summary

Stay ahead of the curve

Related Articles

Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)

How to Set Up KubeFlow ML Pipeline on Kubernetes (2026)

AWS EKS Cluster Autoscaler Not Scaling — Every Fix (2026)

Comments