Deploy NVIDIA Triton Inference Server on Kubernetes (2026)
Step-by-step guide to running NVIDIA Triton Inference Server on Kubernetes with GPU nodes — model repository setup, deployment, autoscaling, and monitoring.
When you need to serve ML models in production at scale — with batching, multi-model serving, GPU sharing, and sub-millisecond latency — NVIDIA Triton Inference Server is the go-to solution.
Triton supports TensorFlow, PyTorch, ONNX, TensorRT, and more. It handles batching automatically, exposes gRPC and HTTP APIs, and integrates with Prometheus for metrics.
This guide walks through deploying Triton on Kubernetes with GPU nodes, setting up a model repository, and autoscaling with KEDA.
Why Triton Over a Simple FastAPI Server?
| Feature | FastAPI + HuggingFace | Triton |
|---|---|---|
| Dynamic batching | Manual | Built-in |
| Multi-model serving | One model per service | Hundreds per instance |
| Model frameworks | Python only | TF, PyTorch, ONNX, TensorRT |
| GPU utilization | Single model | MIG / model sharing |
| gRPC support | Extra work | Native |
| Concurrent requests | Limited | Highly optimized |
| Prometheus metrics | Manual | Built-in |
For high-throughput production, Triton is significantly more efficient.
Architecture
Client → HTTP/gRPC → Triton Inference Server (K8s pod on GPU node)
↓
Model Repository (S3 or PVC)
↓
Backend Executors:
- TensorRT (fastest)
- PyTorch (torchscript)
- ONNX Runtime
- TensorFlow SavedModel
- Python Backend (for custom logic)
Step 1 — Set Up EKS with GPU Nodes
# Create cluster
eksctl create cluster \
--name triton-cluster \
--region ap-south-1 \
--nodegroup-name cpu-workers \
--node-type t3.medium \
--nodes 2
# Add GPU node group
eksctl create nodegroup \
--cluster triton-cluster \
--name gpu-workers \
--node-type g4dn.2xlarge \ # 1 T4 GPU, 32GB VRAM, 8 vCPUs
--nodes 1 \
--nodes-min 0 \
--nodes-max 4 \
--node-labels "role=inference"
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
# Verify GPU is visible
kubectl get nodes -l role=inference -o json | \
jq '.items[].status.allocatable["nvidia.com/gpu"]'Step 2 — Prepare the Model Repository
Triton uses a model repository — a directory structure in S3 (or a PVC) with a specific format.
Model Repository Structure
model-repository/
├── resnet50/
│ ├── config.pbtxt # Model configuration
│ └── 1/ # Version 1
│ └── model.onnx
├── bert-classifier/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
└── ensemble-pipeline/ # Multi-model pipeline
├── config.pbtxt
└── 1/
└── (empty)
Export a PyTorch Model to ONNX
import torch
import torchvision.models as models
# Load pretrained ResNet50
model = models.resnet50(pretrained=True)
model.eval()
# Dummy input for tracing
dummy_input = torch.randn(1, 3, 224, 224)
# Export to ONNX
torch.onnx.export(
model,
dummy_input,
"model.onnx",
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": {0: "batch_size"},
"output": {0: "batch_size"}
},
opset_version=17
)
print("Exported to model.onnx")Model Config (config.pbtxt)
name: "resnet50"
backend: "onnxruntime"
max_batch_size: 32
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
dynamic_batching {
preferred_batch_size: [ 8, 16, 32 ]
max_queue_delay_microseconds: 100 # Wait up to 100μs to form a batch
}
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]Upload to S3
# Create S3 bucket for model repository
aws s3 mb s3://my-triton-models --region ap-south-1
# Upload model
mkdir -p resnet50/1
cp model.onnx resnet50/1/
cat > resnet50/config.pbtxt << 'EOF'
name: "resnet50"
backend: "onnxruntime"
max_batch_size: 32
input [ { name: "input" data_type: TYPE_FP32 dims: [ 3, 224, 224 ] } ]
output [ { name: "output" data_type: TYPE_FP32 dims: [ 1000 ] } ]
dynamic_batching { preferred_batch_size: [ 8, 16 ] }
instance_group [ { count: 1 kind: KIND_GPU } ]
EOF
aws s3 cp resnet50/ s3://my-triton-models/resnet50/ --recursiveStep 3 — Create IAM Role for S3 Access
Triton pods need to read from your S3 model repository:
# Create IAM policy
cat > triton-s3-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-triton-models",
"arn:aws:s3:::my-triton-models/*"
]
}]
}
EOF
aws iam create-policy \
--policy-name TritonS3ReadPolicy \
--policy-document file://triton-s3-policy.json
# Create IRSA (IAM Roles for Service Accounts)
eksctl create iamserviceaccount \
--cluster triton-cluster \
--namespace ml-serving \
--name triton-sa \
--attach-policy-arn arn:aws:iam::<ACCOUNT_ID>:policy/TritonS3ReadPolicy \
--approveStep 4 — Deploy Triton on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-server
namespace: ml-serving
spec:
replicas: 1
selector:
matchLabels:
app: triton-server
template:
metadata:
labels:
app: triton-server
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8002"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: triton-sa
nodeSelector:
role: inference
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.02-py3
args:
- tritonserver
- --model-repository=s3://my-triton-models
- --model-control-mode=poll # Auto-reload models on S3 changes
- --repository-poll-secs=30
- --log-verbose=1
- --metrics-port=8002
ports:
- name: http
containerPort: 8000
- name: grpc
containerPort: 8001
- name: metrics
containerPort: 8002
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "16Gi"
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 12
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 90
periodSeconds: 15
---
apiVersion: v1
kind: Service
metadata:
name: triton-server
namespace: ml-serving
spec:
selector:
app: triton-server
ports:
- name: http
port: 8000
targetPort: 8000
- name: grpc
port: 8001
targetPort: 8001
- name: metrics
port: 8002
targetPort: 8002
type: ClusterIPkubectl create namespace ml-serving
kubectl apply -f triton-deployment.yaml
# Watch startup
kubectl logs -f deployment/triton-server -n ml-serving
# ...
# I0421 12:00:00.000000 1 server.cc:677] Started GRPCInferenceService at 0.0.0.0:8001
# I0421 12:00:00.000000 1 server.cc:698] Started HTTPService at 0.0.0.0:8000Step 5 — Test Inference
# Port-forward to test
kubectl port-forward svc/triton-server 8000:8000 -n ml-serving
# Check loaded models
curl http://localhost:8000/v2/models
# {"models":[{"name":"resnet50","version":"1","state":"READY"}]}
# Run inference (HTTP)
python3 << 'EOF'
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient("localhost:8000")
# Prepare input (random image for testing)
image = np.random.rand(1, 3, 224, 224).astype(np.float32)
inputs = [httpclient.InferInput("input", image.shape, "FP32")]
inputs[0].set_data_from_numpy(image)
outputs = [httpclient.InferRequestedOutput("output")]
response = client.infer("resnet50", inputs, outputs=outputs)
result = response.as_numpy("output")
print(f"Output shape: {result.shape}") # (1, 1000)
print(f"Top class: {np.argmax(result)}")
EOFStep 6 — Autoscale with KEDA
Scale Triton pods based on GPU queue depth (using Prometheus metrics):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: triton-scaler
namespace: ml-serving
spec:
scaleTargetRef:
name: triton-server
minReplicaCount: 1
maxReplicaCount: 4
cooldownPeriod: 120
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: nv_inference_queue_duration_us
query: |
avg(rate(nv_inference_queue_duration_us[1m])) > 1000
threshold: "1000" # Scale up when avg queue time > 1msMonitoring with Prometheus
Triton exposes rich metrics on port 8002:
nv_inference_request_success - Successful inference requests
nv_inference_request_failure - Failed requests
nv_inference_count - Total inferences
nv_inference_exec_count - Execution count (batched)
nv_inference_request_duration_us - End-to-end latency (microseconds)
nv_inference_queue_duration_us - Time spent in queue
nv_inference_compute_input_duration_us - Input processing time
nv_inference_compute_infer_duration_us - GPU compute time
nv_inference_compute_output_duration_us - Output processing time
nv_gpu_utilization - GPU utilization %
nv_gpu_memory_used_bytes - GPU memory used
Add a Grafana dashboard to visualize throughput, latency, and GPU utilization.
Summary
| Step | What it does |
|---|---|
| GPU node group | g4dn.2xlarge with T4 GPU |
| Model repository | ONNX model + config.pbtxt in S3 |
| IRSA | Pod IAM permissions for S3 |
| Triton Deployment | Loads models from S3, serves HTTP + gRPC |
| KEDA ScaledObject | Autoscale based on queue latency |
Triton is significantly more efficient than a plain FastAPI model server for production use — automatic batching alone can improve throughput 10–20x for batch-friendly models.
Deploy Triton on GPU nodes with DigitalOcean GPU Droplets or EKS GPU node groups — new DigitalOcean accounts get $200 free credit, enough for several hours of GPU inference experimentation.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)
Step-by-step guide to deploying HuggingFace transformer models on Kubernetes using GPU nodes — from cluster setup to inference API in production.
How to Set Up KubeFlow ML Pipeline on Kubernetes (2026)
Step-by-step guide to installing KubeFlow on Kubernetes and building your first ML pipeline — from cluster setup to a working training + serving workflow.
AWS EKS Pods Stuck in Pending State: Causes and Fixes
Pods stuck in Pending on EKS are caused by a handful of known issues — insufficient node capacity, taint mismatches, PVC problems, and more. Here's how to diagnose and fix each one.