🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Set Up Ray Serve on Kubernetes for ML Model Inference (2026)

Ray Serve is the best way to serve ML models at scale on Kubernetes — handles batching, scaling, model composition, and GPU sharing. Complete setup guide.

DevOpsBoysMay 1, 20264 min read
Share:Tweet

Ray Serve is a framework for serving ML models in production. Unlike FastAPI + Uvicorn, Ray Serve handles batching, auto-scaling, model composition, and GPU sharing out of the box. Here's how to deploy it on Kubernetes.


Why Ray Serve Over Plain FastAPI?

FeatureFastAPIRay Serve
Request batchingManual✅ Automatic
GPU sharing between models
Multi-model pipelinesManual✅ Native
Auto-scaling replicasWith extra config✅ Built-in
Zero-downtime model updatesComplex✅ Native
Fractional GPU allocation

For simple single-model serving, FastAPI works fine. For production with multiple models, batching requirements, and cost-sensitive GPU usage — Ray Serve is worth it.


Step 1: Install KubeRay Operator

KubeRay manages Ray clusters on Kubernetes:

bash
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
 
helm install kuberay-operator kuberay/kuberay-operator \
  --namespace ray-system \
  --create-namespace \
  --version 1.1.0

Step 2: Deploy a RayCluster

yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: ml-inference-cluster
  namespace: ray-system
spec:
  rayVersion: "2.10.0"
  
  headGroupSpec:
    rayStartParams:
      dashboard-host: "0.0.0.0"
      num-cpus: "0"         # head node handles scheduling only
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-ml:2.10.0-gpu
          ports:
          - containerPort: 6379   # GCS server
          - containerPort: 8265   # Dashboard
          - containerPort: 8000   # Ray Serve
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
            limits:
              cpu: "4"
              memory: "8Gi"
 
  workerGroupSpecs:
  - groupName: gpu-workers
    replicas: 2
    minReplicas: 1
    maxReplicas: 5
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.10.0-gpu
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"
        tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
bash
kubectl apply -f ray-cluster.yaml
 
# Wait for cluster to be ready
kubectl get raycluster -n ray-system
kubectl get pods -n ray-system

Step 3: Write a Ray Serve Application

python
# serve_app.py
import ray
from ray import serve
from transformers import pipeline
import torch
from typing import List
import numpy as np
 
ray.init(address="auto")
serve.start(detached=True)
 
@serve.deployment(
    num_replicas=2,
    ray_actor_options={
        "num_gpus": 0.5,      # fractional GPU — fit 2 models on one GPU
        "num_cpus": 2,
    },
    max_batch_size=32,         # batch up to 32 requests
    batch_wait_timeout_s=0.05, # wait max 50ms to form a batch
)
class TextClassifier:
    def __init__(self):
        self.model = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0 if torch.cuda.is_available() else -1,
        )
 
    @serve.batch(max_batch_size=32, batch_wait_timeout_s=0.05)
    async def predict_batch(self, texts: List[str]) -> List[dict]:
        results = self.model(texts, truncation=True, max_length=512)
        return results
 
    async def __call__(self, request):
        text = await request.json()
        result = await self.predict_batch([text["text"]])
        return result[0]
 
 
@serve.deployment(
    num_replicas=1,
    ray_actor_options={"num_gpus": 1, "num_cpus": 4},
)
class EmbeddingModel:
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
 
    async def __call__(self, request):
        data = await request.json()
        embeddings = self.model.encode(data["texts"])
        return {"embeddings": embeddings.tolist()}
 
 
# Deploy both models
text_classifier = TextClassifier.bind()
embedding_model = EmbeddingModel.bind()
 
serve.run(text_classifier, route_prefix="/classify")
serve.run(embedding_model, route_prefix="/embed")

Step 4: Deploy the Application as a RayService

Instead of running the script manually, use KubeRay's RayService CRD:

yaml
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ml-inference
  namespace: ray-system
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  
  serveConfigV2: |
    applications:
    - name: text-classifier
      import_path: serve_app:text_classifier
      route_prefix: /classify
      deployments:
      - name: TextClassifier
        num_replicas: 2
        ray_actor_options:
          num_gpus: 0.5
          num_cpus: 2
    
    - name: embedding-model
      import_path: serve_app:embedding_model
      route_prefix: /embed
      deployments:
      - name: EmbeddingModel
        num_replicas: 1
        ray_actor_options:
          num_gpus: 1
          num_cpus: 4
 
  rayClusterConfig:
    rayVersion: "2.10.0"
    headGroupSpec:
      # ... same as RayCluster above
    workerGroupSpecs:
    - groupName: gpu-workers
      replicas: 2
      # ... same as RayCluster above

Step 5: Expose the Service

yaml
apiVersion: v1
kind: Service
metadata:
  name: ray-serve-svc
  namespace: ray-system
spec:
  selector:
    ray.io/node-type: head
  ports:
  - name: serve
    port: 8000
    targetPort: 8000
  - name: dashboard
    port: 8265
    targetPort: 8265
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ray-serve-ingress
  namespace: ray-system
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: inference.your-domain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ray-serve-svc
            port:
              number: 8000

Test Your Models

bash
# Port-forward for testing
kubectl port-forward svc/ray-serve-svc -n ray-system 8000:8000
 
# Test text classification
curl http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"text": "Kubernetes is the best orchestration platform"}'
 
# Test embeddings
curl http://localhost:8000/embed \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Hello world", "Kubernetes is great"]}'
 
# Access Ray Dashboard
kubectl port-forward svc/ray-serve-svc -n ray-system 8265:8265
# Open http://localhost:8265

Autoscaling Ray Serve Deployments

python
@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_num_ongoing_requests_per_replica": 5,
        "upscale_delay_s": 30,
        "downscale_delay_s": 300,
    },
    ray_actor_options={"num_gpus": 0.5},
)
class AutoScaledModel:
    # Ray Serve scales replicas based on request queue depth
    ...

Ray Serve scales up when each replica has more than 5 pending requests, and scales down after 5 minutes of low traffic.


What you get:

  • Multiple ML models serving on the same GPU cluster
  • Automatic request batching (2–10x throughput improvement)
  • Auto-scaling based on request load
  • Zero-downtime model updates via RayService rolling updates
  • Ray Dashboard for monitoring latency, throughput, and replica health
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments