Set Up Ray Serve on Kubernetes for ML Model Inference (2026)
Ray Serve is the best way to serve ML models at scale on Kubernetes — handles batching, scaling, model composition, and GPU sharing. Complete setup guide.
Ray Serve is a framework for serving ML models in production. Unlike FastAPI + Uvicorn, Ray Serve handles batching, auto-scaling, model composition, and GPU sharing out of the box. Here's how to deploy it on Kubernetes.
Why Ray Serve Over Plain FastAPI?
| Feature | FastAPI | Ray Serve |
|---|---|---|
| Request batching | Manual | ✅ Automatic |
| GPU sharing between models | ❌ | ✅ |
| Multi-model pipelines | Manual | ✅ Native |
| Auto-scaling replicas | With extra config | ✅ Built-in |
| Zero-downtime model updates | Complex | ✅ Native |
| Fractional GPU allocation | ❌ | ✅ |
For simple single-model serving, FastAPI works fine. For production with multiple models, batching requirements, and cost-sensitive GPU usage — Ray Serve is worth it.
Step 1: Install KubeRay Operator
KubeRay manages Ray clusters on Kubernetes:
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator \
--namespace ray-system \
--create-namespace \
--version 1.1.0Step 2: Deploy a RayCluster
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: ml-inference-cluster
namespace: ray-system
spec:
rayVersion: "2.10.0"
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
num-cpus: "0" # head node handles scheduling only
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.10.0-gpu
ports:
- containerPort: 6379 # GCS server
- containerPort: 8265 # Dashboard
- containerPort: 8000 # Ray Serve
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
workerGroupSpecs:
- groupName: gpu-workers
replicas: 2
minReplicas: 1
maxReplicas: 5
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.10.0-gpu
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedulekubectl apply -f ray-cluster.yaml
# Wait for cluster to be ready
kubectl get raycluster -n ray-system
kubectl get pods -n ray-systemStep 3: Write a Ray Serve Application
# serve_app.py
import ray
from ray import serve
from transformers import pipeline
import torch
from typing import List
import numpy as np
ray.init(address="auto")
serve.start(detached=True)
@serve.deployment(
num_replicas=2,
ray_actor_options={
"num_gpus": 0.5, # fractional GPU — fit 2 models on one GPU
"num_cpus": 2,
},
max_batch_size=32, # batch up to 32 requests
batch_wait_timeout_s=0.05, # wait max 50ms to form a batch
)
class TextClassifier:
def __init__(self):
self.model = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=0 if torch.cuda.is_available() else -1,
)
@serve.batch(max_batch_size=32, batch_wait_timeout_s=0.05)
async def predict_batch(self, texts: List[str]) -> List[dict]:
results = self.model(texts, truncation=True, max_length=512)
return results
async def __call__(self, request):
text = await request.json()
result = await self.predict_batch([text["text"]])
return result[0]
@serve.deployment(
num_replicas=1,
ray_actor_options={"num_gpus": 1, "num_cpus": 4},
)
class EmbeddingModel:
def __init__(self):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer("all-MiniLM-L6-v2")
async def __call__(self, request):
data = await request.json()
embeddings = self.model.encode(data["texts"])
return {"embeddings": embeddings.tolist()}
# Deploy both models
text_classifier = TextClassifier.bind()
embedding_model = EmbeddingModel.bind()
serve.run(text_classifier, route_prefix="/classify")
serve.run(embedding_model, route_prefix="/embed")Step 4: Deploy the Application as a RayService
Instead of running the script manually, use KubeRay's RayService CRD:
apiVersion: ray.io/v1
kind: RayService
metadata:
name: ml-inference
namespace: ray-system
spec:
serviceUnhealthySecondThreshold: 300
deploymentUnhealthySecondThreshold: 300
serveConfigV2: |
applications:
- name: text-classifier
import_path: serve_app:text_classifier
route_prefix: /classify
deployments:
- name: TextClassifier
num_replicas: 2
ray_actor_options:
num_gpus: 0.5
num_cpus: 2
- name: embedding-model
import_path: serve_app:embedding_model
route_prefix: /embed
deployments:
- name: EmbeddingModel
num_replicas: 1
ray_actor_options:
num_gpus: 1
num_cpus: 4
rayClusterConfig:
rayVersion: "2.10.0"
headGroupSpec:
# ... same as RayCluster above
workerGroupSpecs:
- groupName: gpu-workers
replicas: 2
# ... same as RayCluster aboveStep 5: Expose the Service
apiVersion: v1
kind: Service
metadata:
name: ray-serve-svc
namespace: ray-system
spec:
selector:
ray.io/node-type: head
ports:
- name: serve
port: 8000
targetPort: 8000
- name: dashboard
port: 8265
targetPort: 8265
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ray-serve-ingress
namespace: ray-system
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
ingressClassName: nginx
rules:
- host: inference.your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ray-serve-svc
port:
number: 8000Test Your Models
# Port-forward for testing
kubectl port-forward svc/ray-serve-svc -n ray-system 8000:8000
# Test text classification
curl http://localhost:8000/classify \
-H "Content-Type: application/json" \
-d '{"text": "Kubernetes is the best orchestration platform"}'
# Test embeddings
curl http://localhost:8000/embed \
-H "Content-Type: application/json" \
-d '{"texts": ["Hello world", "Kubernetes is great"]}'
# Access Ray Dashboard
kubectl port-forward svc/ray-serve-svc -n ray-system 8265:8265
# Open http://localhost:8265Autoscaling Ray Serve Deployments
@serve.deployment(
autoscaling_config={
"min_replicas": 1,
"max_replicas": 10,
"target_num_ongoing_requests_per_replica": 5,
"upscale_delay_s": 30,
"downscale_delay_s": 300,
},
ray_actor_options={"num_gpus": 0.5},
)
class AutoScaledModel:
# Ray Serve scales replicas based on request queue depth
...Ray Serve scales up when each replica has more than 5 pending requests, and scales down after 5 minutes of low traffic.
What you get:
- Multiple ML models serving on the same GPU cluster
- Automatic request batching (2–10x throughput improvement)
- Auto-scaling based on request load
- Zero-downtime model updates via RayService rolling updates
- Ray Dashboard for monitoring latency, throughput, and replica health
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
LLM-Powered Kubernetes Operators: How AI Is Automating Cluster Management in 2026
How teams are building Kubernetes operators powered by LLMs to auto-remediate incidents, optimize resources, and manage complex deployments — with architecture patterns and real examples.
Run DeepSeek R1 on Kubernetes — Self-Hosted LLM Guide (2026)
Deploy DeepSeek R1 on your own Kubernetes cluster using Ollama or vLLM. Includes GPU node setup, Helm deployment, persistent model storage, and an OpenAI-compatible API.
Set Up MLflow + Airflow MLOps Pipeline on Kubernetes (2026)
Build a production MLOps pipeline on Kubernetes using MLflow for experiment tracking and model registry, and Apache Airflow for pipeline orchestration. Full setup guide.