Set Up Ray Serve on Kubernetes for ML Model Inference (2026)

Ray Serve is the best way to serve ML models at scale on Kubernetes — handles batching, scaling, model composition, and GPU sharing. Complete setup guide.

Ray Serve is a framework for serving ML models in production. Unlike FastAPI + Uvicorn, Ray Serve handles batching, auto-scaling, model composition, and GPU sharing out of the box. Here's how to deploy it on Kubernetes.

Why Ray Serve Over Plain FastAPI?

Feature	FastAPI	Ray Serve
Request batching	Manual	✅ Automatic
GPU sharing between models	❌	✅
Multi-model pipelines	Manual	✅ Native
Auto-scaling replicas	With extra config	✅ Built-in
Zero-downtime model updates	Complex	✅ Native
Fractional GPU allocation	❌	✅

For simple single-model serving, FastAPI works fine. For production with multiple models, batching requirements, and cost-sensitive GPU usage — Ray Serve is worth it.

Step 1: Install KubeRay Operator

KubeRay manages Ray clusters on Kubernetes:

bash

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
 
helm install kuberay-operator kuberay/kuberay-operator \
  --namespace ray-system \
  --create-namespace \
  --version 1.1.0

Step 2: Deploy a RayCluster

yaml

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: ml-inference-cluster
  namespace: ray-system
spec:
  rayVersion: "2.10.0"
  
  headGroupSpec:
    rayStartParams:
      dashboard-host: "0.0.0.0"
      num-cpus: "0"         # head node handles scheduling only
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-ml:2.10.0-gpu
          ports:
          - containerPort: 6379   # GCS server
          - containerPort: 8265   # Dashboard
          - containerPort: 8000   # Ray Serve
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
            limits:
              cpu: "4"
              memory: "8Gi"
 
  workerGroupSpecs:
  - groupName: gpu-workers
    replicas: 2
    minReplicas: 1
    maxReplicas: 5
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.10.0-gpu
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"
        tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

bash

kubectl apply -f ray-cluster.yaml
 
# Wait for cluster to be ready
kubectl get raycluster -n ray-system
kubectl get pods -n ray-system

Step 3: Write a Ray Serve Application

python

# serve_app.py
import ray
from ray import serve
from transformers import pipeline
import torch
from typing import List
import numpy as np
 
ray.init(address="auto")
serve.start(detached=True)
 
@serve.deployment(
    num_replicas=2,
    ray_actor_options={
        "num_gpus": 0.5,      # fractional GPU — fit 2 models on one GPU
        "num_cpus": 2,
    },
    max_batch_size=32,         # batch up to 32 requests
    batch_wait_timeout_s=0.05, # wait max 50ms to form a batch
)
class TextClassifier:
    def __init__(self):
        self.model = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0 if torch.cuda.is_available() else -1,
        )
 
    @serve.batch(max_batch_size=32, batch_wait_timeout_s=0.05)
    async def predict_batch(self, texts: List[str]) -> List[dict]:
        results = self.model(texts, truncation=True, max_length=512)
        return results
 
    async def __call__(self, request):
        text = await request.json()
        result = await self.predict_batch([text["text"]])
        return result[0]
 
 
@serve.deployment(
    num_replicas=1,
    ray_actor_options={"num_gpus": 1, "num_cpus": 4},
)
class EmbeddingModel:
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
 
    async def __call__(self, request):
        data = await request.json()
        embeddings = self.model.encode(data["texts"])
        return {"embeddings": embeddings.tolist()}
 
 
# Deploy both models
text_classifier = TextClassifier.bind()
embedding_model = EmbeddingModel.bind()
 
serve.run(text_classifier, route_prefix="/classify")
serve.run(embedding_model, route_prefix="/embed")

Step 4: Deploy the Application as a RayService

Instead of running the script manually, use KubeRay's RayService CRD:

yaml

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ml-inference
  namespace: ray-system
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  
  serveConfigV2: |
    applications:
    - name: text-classifier
      import_path: serve_app:text_classifier
      route_prefix: /classify
      deployments:
      - name: TextClassifier
        num_replicas: 2
        ray_actor_options:
          num_gpus: 0.5
          num_cpus: 2
    
    - name: embedding-model
      import_path: serve_app:embedding_model
      route_prefix: /embed
      deployments:
      - name: EmbeddingModel
        num_replicas: 1
        ray_actor_options:
          num_gpus: 1
          num_cpus: 4
 
  rayClusterConfig:
    rayVersion: "2.10.0"
    headGroupSpec:
      # ... same as RayCluster above
    workerGroupSpecs:
    - groupName: gpu-workers
      replicas: 2
      # ... same as RayCluster above

Step 5: Expose the Service

yaml

apiVersion: v1
kind: Service
metadata:
  name: ray-serve-svc
  namespace: ray-system
spec:
  selector:
    ray.io/node-type: head
  ports:
  - name: serve
    port: 8000
    targetPort: 8000
  - name: dashboard
    port: 8265
    targetPort: 8265
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ray-serve-ingress
  namespace: ray-system
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: inference.your-domain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ray-serve-svc
            port:
              number: 8000

Test Your Models

bash

# Port-forward for testing
kubectl port-forward svc/ray-serve-svc -n ray-system 8000:8000
 
# Test text classification
curl http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"text": "Kubernetes is the best orchestration platform"}'
 
# Test embeddings
curl http://localhost:8000/embed \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Hello world", "Kubernetes is great"]}'
 
# Access Ray Dashboard
kubectl port-forward svc/ray-serve-svc -n ray-system 8265:8265
# Open http://localhost:8265

Autoscaling Ray Serve Deployments

python

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_num_ongoing_requests_per_replica": 5,
        "upscale_delay_s": 30,
        "downscale_delay_s": 300,
    },
    ray_actor_options={"num_gpus": 0.5},
)
class AutoScaledModel:
    # Ray Serve scales replicas based on request queue depth
    ...

Ray Serve scales up when each replica has more than 5 pending requests, and scales down after 5 minutes of low traffic.

What you get:

Multiple ML models serving on the same GPU cluster
Automatic request batching (2–10x throughput improvement)
Auto-scaling based on request load
Zero-downtime model updates via RayService rolling updates
Ray Dashboard for monitoring latency, throughput, and replica health

Set Up Ray Serve on Kubernetes for ML Model Inference (2026)

Why Ray Serve Over Plain FastAPI?

Step 1: Install KubeRay Operator

Step 2: Deploy a RayCluster

Step 3: Write a Ray Serve Application

Step 4: Deploy the Application as a RayService

Step 5: Expose the Service

Test Your Models

Autoscaling Ray Serve Deployments

Stay ahead of the curve

Related Articles

LLM-Powered Kubernetes Operators: How AI Is Automating Cluster Management in 2026

Run DeepSeek R1 on Kubernetes — Self-Hosted LLM Guide (2026)

Set Up MLflow + Airflow MLOps Pipeline on Kubernetes (2026)

Comments