All Articles

Deploy NVIDIA Triton Inference Server on Kubernetes (2026)

Step-by-step guide to running NVIDIA Triton Inference Server on Kubernetes with GPU nodes — model repository setup, deployment, autoscaling, and monitoring.

DevOpsBoysApr 21, 20265 min read
Share:Tweet

When you need to serve ML models in production at scale — with batching, multi-model serving, GPU sharing, and sub-millisecond latency — NVIDIA Triton Inference Server is the go-to solution.

Triton supports TensorFlow, PyTorch, ONNX, TensorRT, and more. It handles batching automatically, exposes gRPC and HTTP APIs, and integrates with Prometheus for metrics.

This guide walks through deploying Triton on Kubernetes with GPU nodes, setting up a model repository, and autoscaling with KEDA.


Why Triton Over a Simple FastAPI Server?

FeatureFastAPI + HuggingFaceTriton
Dynamic batchingManualBuilt-in
Multi-model servingOne model per serviceHundreds per instance
Model frameworksPython onlyTF, PyTorch, ONNX, TensorRT
GPU utilizationSingle modelMIG / model sharing
gRPC supportExtra workNative
Concurrent requestsLimitedHighly optimized
Prometheus metricsManualBuilt-in

For high-throughput production, Triton is significantly more efficient.


Architecture

Client → HTTP/gRPC → Triton Inference Server (K8s pod on GPU node)
                            ↓
                    Model Repository (S3 or PVC)
                            ↓
                  Backend Executors:
                  - TensorRT (fastest)
                  - PyTorch (torchscript)
                  - ONNX Runtime
                  - TensorFlow SavedModel
                  - Python Backend (for custom logic)

Step 1 — Set Up EKS with GPU Nodes

bash
# Create cluster
eksctl create cluster \
  --name triton-cluster \
  --region ap-south-1 \
  --nodegroup-name cpu-workers \
  --node-type t3.medium \
  --nodes 2
 
# Add GPU node group
eksctl create nodegroup \
  --cluster triton-cluster \
  --name gpu-workers \
  --node-type g4dn.2xlarge \    # 1 T4 GPU, 32GB VRAM, 8 vCPUs
  --nodes 1 \
  --nodes-min 0 \
  --nodes-max 4 \
  --node-labels "role=inference"
 
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
 
# Verify GPU is visible
kubectl get nodes -l role=inference -o json | \
  jq '.items[].status.allocatable["nvidia.com/gpu"]'

Step 2 — Prepare the Model Repository

Triton uses a model repository — a directory structure in S3 (or a PVC) with a specific format.

Model Repository Structure

model-repository/
├── resnet50/
│   ├── config.pbtxt          # Model configuration
│   └── 1/                    # Version 1
│       └── model.onnx
├── bert-classifier/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
└── ensemble-pipeline/        # Multi-model pipeline
    ├── config.pbtxt
    └── 1/
        └── (empty)

Export a PyTorch Model to ONNX

python
import torch
import torchvision.models as models
 
# Load pretrained ResNet50
model = models.resnet50(pretrained=True)
model.eval()
 
# Dummy input for tracing
dummy_input = torch.randn(1, 3, 224, 224)
 
# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    },
    opset_version=17
)
 
print("Exported to model.onnx")

Model Config (config.pbtxt)

protobuf
name: "resnet50"
backend: "onnxruntime"
max_batch_size: 32
 
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
 
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]
 
dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 100   # Wait up to 100μs to form a batch
}
 
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

Upload to S3

bash
# Create S3 bucket for model repository
aws s3 mb s3://my-triton-models --region ap-south-1
 
# Upload model
mkdir -p resnet50/1
cp model.onnx resnet50/1/
cat > resnet50/config.pbtxt << 'EOF'
name: "resnet50"
backend: "onnxruntime"
max_batch_size: 32
input [ { name: "input" data_type: TYPE_FP32 dims: [ 3, 224, 224 ] } ]
output [ { name: "output" data_type: TYPE_FP32 dims: [ 1000 ] } ]
dynamic_batching { preferred_batch_size: [ 8, 16 ] }
instance_group [ { count: 1 kind: KIND_GPU } ]
EOF
 
aws s3 cp resnet50/ s3://my-triton-models/resnet50/ --recursive

Step 3 — Create IAM Role for S3 Access

Triton pods need to read from your S3 model repository:

bash
# Create IAM policy
cat > triton-s3-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::my-triton-models",
      "arn:aws:s3:::my-triton-models/*"
    ]
  }]
}
EOF
 
aws iam create-policy \
  --policy-name TritonS3ReadPolicy \
  --policy-document file://triton-s3-policy.json
 
# Create IRSA (IAM Roles for Service Accounts)
eksctl create iamserviceaccount \
  --cluster triton-cluster \
  --namespace ml-serving \
  --name triton-sa \
  --attach-policy-arn arn:aws:iam::<ACCOUNT_ID>:policy/TritonS3ReadPolicy \
  --approve

Step 4 — Deploy Triton on Kubernetes

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
  namespace: ml-serving
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8002"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: triton-sa
      nodeSelector:
        role: inference
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.02-py3
        args:
        - tritonserver
        - --model-repository=s3://my-triton-models
        - --model-control-mode=poll           # Auto-reload models on S3 changes
        - --repository-poll-secs=30
        - --log-verbose=1
        - --metrics-port=8002
        ports:
        - name: http
          containerPort: 8000
        - name: grpc
          containerPort: 8001
        - name: metrics
          containerPort: 8002
        resources:
          requests:
            cpu: "4"
            memory: "8Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "8"
            memory: "16Gi"
            nvidia.com/gpu: "1"
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
          failureThreshold: 12
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 90
          periodSeconds: 15
---
apiVersion: v1
kind: Service
metadata:
  name: triton-server
  namespace: ml-serving
spec:
  selector:
    app: triton-server
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001
  - name: metrics
    port: 8002
    targetPort: 8002
  type: ClusterIP
bash
kubectl create namespace ml-serving
kubectl apply -f triton-deployment.yaml
 
# Watch startup
kubectl logs -f deployment/triton-server -n ml-serving
# ...
# I0421 12:00:00.000000 1 server.cc:677] Started GRPCInferenceService at 0.0.0.0:8001
# I0421 12:00:00.000000 1 server.cc:698] Started HTTPService at 0.0.0.0:8000

Step 5 — Test Inference

bash
# Port-forward to test
kubectl port-forward svc/triton-server 8000:8000 -n ml-serving
 
# Check loaded models
curl http://localhost:8000/v2/models
# {"models":[{"name":"resnet50","version":"1","state":"READY"}]}
 
# Run inference (HTTP)
python3 << 'EOF'
import tritonclient.http as httpclient
import numpy as np
 
client = httpclient.InferenceServerClient("localhost:8000")
 
# Prepare input (random image for testing)
image = np.random.rand(1, 3, 224, 224).astype(np.float32)
inputs = [httpclient.InferInput("input", image.shape, "FP32")]
inputs[0].set_data_from_numpy(image)
 
outputs = [httpclient.InferRequestedOutput("output")]
response = client.infer("resnet50", inputs, outputs=outputs)
 
result = response.as_numpy("output")
print(f"Output shape: {result.shape}")   # (1, 1000)
print(f"Top class: {np.argmax(result)}")
EOF

Step 6 — Autoscale with KEDA

Scale Triton pods based on GPU queue depth (using Prometheus metrics):

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: triton-scaler
  namespace: ml-serving
spec:
  scaleTargetRef:
    name: triton-server
  minReplicaCount: 1
  maxReplicaCount: 4
  cooldownPeriod: 120
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
      metricName: nv_inference_queue_duration_us
      query: |
        avg(rate(nv_inference_queue_duration_us[1m])) > 1000
      threshold: "1000"   # Scale up when avg queue time > 1ms

Monitoring with Prometheus

Triton exposes rich metrics on port 8002:

nv_inference_request_success     - Successful inference requests
nv_inference_request_failure     - Failed requests  
nv_inference_count                - Total inferences
nv_inference_exec_count           - Execution count (batched)
nv_inference_request_duration_us  - End-to-end latency (microseconds)
nv_inference_queue_duration_us    - Time spent in queue
nv_inference_compute_input_duration_us  - Input processing time
nv_inference_compute_infer_duration_us  - GPU compute time
nv_inference_compute_output_duration_us - Output processing time
nv_gpu_utilization               - GPU utilization %
nv_gpu_memory_used_bytes         - GPU memory used

Add a Grafana dashboard to visualize throughput, latency, and GPU utilization.


Summary

StepWhat it does
GPU node groupg4dn.2xlarge with T4 GPU
Model repositoryONNX model + config.pbtxt in S3
IRSAPod IAM permissions for S3
Triton DeploymentLoads models from S3, serves HTTP + gRPC
KEDA ScaledObjectAutoscale based on queue latency

Triton is significantly more efficient than a plain FastAPI model server for production use — automatic batching alone can improve throughput 10–20x for batch-friendly models.

Deploy Triton on GPU nodes with DigitalOcean GPU Droplets or EKS GPU node groups — new DigitalOcean accounts get $200 free credit, enough for several hours of GPU inference experimentation.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments