Run vLLM on Kubernetes for Fast LLM Inference (2026)

vLLM is the fastest open-source LLM inference engine. Here's how to deploy it on Kubernetes with GPU nodes, expose an OpenAI-compatible API, and scale it.

vLLM serves LLMs 24x faster than naive HuggingFace inference by using PagedAttention — a KV cache memory manager that eliminates memory waste. Here's how to run it on Kubernetes with GPU nodes and expose an OpenAI-compatible API.

Why vLLM?

Method	Throughput (tokens/s)	Memory efficiency
HuggingFace generate()	~200	Poor (KV cache fragmented)
Text Generation Inference (TGI)	~800	Good
vLLM	~2400	Excellent (PagedAttention)

vLLM also provides an OpenAI-compatible API out of the box — drop-in replacement, no code changes.

Architecture

Internet/Apps
     ↓
Kubernetes Service (LoadBalancer)
     ↓
vLLM Pod (GPU node)
  - vllm serve
  - OpenAI-compatible API on :8000
  - PagedAttention KV cache
     ↓
Model (from HuggingFace Hub or PVC)

Step 1: GPU Node Setup on EKS

bash

# Create EKS cluster with GPU node group
eksctl create cluster --name llm-cluster --region us-east-1
 
# Add GPU node group (g4dn.xlarge = 1x T4 GPU, 16 GiB VRAM)
eksctl create nodegroup \
  --cluster llm-cluster \
  --name gpu-nodes \
  --node-type g4dn.xlarge \
  --nodes 1 \
  --nodes-min 1 \
  --nodes-max 3 \
  --node-ami-family AmazonLinux2
 
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
 
# Verify GPU is visible
kubectl get nodes -l accelerator=vgpu
kubectl describe node <gpu-node> | grep nvidia.com/gpu

Step 2: Persistent Volume for Model Storage

Models are large (7B model ≈ 14 GB). Mount a PVC to cache downloads:

yaml

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: vllm
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 50Gi

bash

kubectl create namespace vllm
kubectl apply -f pvc.yaml

Step 3: vLLM Deployment

For Mistral 7B (requires ~14GB VRAM — fits on g4dn.2xlarge or g5.xlarge)

yaml

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-mistral
  namespace: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-mistral
  template:
    metadata:
      labels:
        app: vllm-mistral
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
          - "--model"
          - "mistralai/Mistral-7B-Instruct-v0.2"
          - "--host"
          - "0.0.0.0"
          - "--port"
          - "8000"
          - "--max-model-len"
          - "4096"
          - "--tensor-parallel-size"
          - "1"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        - name: TRANSFORMERS_CACHE
          value: /model-cache
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "24Gi"
            cpu: "4"
          requests:
            nvidia.com/gpu: "1"
            memory: "16Gi"
            cpu: "2"
        volumeMounts:
        - name: model-cache
          mountPath: /model-cache
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache

bash

# Create HuggingFace token secret
kubectl create secret generic hf-token \
  --from-literal=token=hf_your_token_here \
  -n vllm
 
kubectl apply -f vllm-deployment.yaml

Step 4: Service

yaml

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: vllm
spec:
  selector:
    app: vllm-mistral
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP   # Use LoadBalancer for external access

Step 5: Test the OpenAI-Compatible API

bash

# Port-forward for testing
kubectl port-forward svc/vllm-service 8000:8000 -n vllm
 
# List available models
curl http://localhost:8000/v1/models
 
# Chat completion (same format as OpenAI)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [
      {"role": "user", "content": "Write a Kubernetes HPA YAML for a web app"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Step 6: Use Smaller Models (CPU-only for testing)

For development/testing without GPU:

yaml

args:
  - "--model"
  - "microsoft/phi-2"       # 2.7B, runs on CPU (slowly)
  - "--device"
  - "cpu"
  - "--max-model-len"
  - "2048"
# Remove GPU resource limits

Step 7: Horizontal Pod Autoscaler

Scale based on GPU memory or queue length:

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: vllm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-mistral
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

For production, use KEDA with a custom metric (pending requests queue).

Step 8: Multi-GPU Tensor Parallelism

For large models (Llama 70B) that don't fit on one GPU:

yaml

args:
  - "--model"
  - "meta-llama/Meta-Llama-3-70B-Instruct"
  - "--tensor-parallel-size"
  - "4"   # Split across 4 GPUs
resources:
  limits:
    nvidia.com/gpu: "4"

Use p4d.24xlarge (8x A100) or p3.8xlarge (4x V100) on EKS.

Cost Estimates (AWS, 2026)

Instance	GPU	VRAM	Models	Spot cost/hr
g4dn.xlarge	1x T4	16GB	7B models	~$0.16
g4dn.2xlarge	1x T4	16GB	13B (quantized)	~$0.30
g5.xlarge	1x A10G	24GB	13B, small 70B	~$0.41
p3.2xlarge	1x V100	16GB	7B-13B	~$0.45
p4d.24xlarge	8x A100	320GB	70B+	~$10

Cost saving tip: Use Spot instances — vLLM handles interruptions gracefully with proper health checks.

Compared to OpenAI API

	OpenAI API (gpt-4o)	vLLM (Mistral 7B on g4dn)
Cost	$5/1M input tokens	~$0.003/1M tokens
Latency	200-800ms	100-400ms
Privacy	Data sent to OpenAI	Fully private
Control	None	Full
Quality	Better (larger model)	Good for most tasks

At scale, vLLM self-hosted is 1000x cheaper than OpenAI API.

Resources

AWS GPU Instances — g4dn spot instances for affordable GPU compute
DigitalOcean GPU Droplets — simpler alternative for smaller scale
vLLM Documentation — official deployment guide

vLLM on Kubernetes gives you OpenAI-grade inference at a fraction of the cost, with full data privacy and control.

Run vLLM on Kubernetes for Fast LLM Inference (2026)

Why vLLM?

Architecture

Step 1: GPU Node Setup on EKS

Step 2: Persistent Volume for Model Storage

Step 3: vLLM Deployment

For Mistral 7B (requires ~14GB VRAM — fits on g4dn.2xlarge or g5.xlarge)

Step 4: Service

Step 5: Test the OpenAI-Compatible API

Step 6: Use Smaller Models (CPU-only for testing)

Step 7: Horizontal Pod Autoscaler

Step 8: Multi-GPU Tensor Parallelism

Cost Estimates (AWS, 2026)

Compared to OpenAI API

Resources

Stay ahead of the curve

Related Articles

Build an AI Kubernetes Cost Optimizer with Python and Claude API

Build an AI Kubernetes Resource Rightsizer with Claude API

Build a Kubernetes Cost Optimization Bot with AI in 2026

Comments