Run vLLM on Kubernetes for Fast LLM Inference (2026)
vLLM is the fastest open-source LLM inference engine. Here's how to deploy it on Kubernetes with GPU nodes, expose an OpenAI-compatible API, and scale it.
vLLM serves LLMs 24x faster than naive HuggingFace inference by using PagedAttention — a KV cache memory manager that eliminates memory waste. Here's how to run it on Kubernetes with GPU nodes and expose an OpenAI-compatible API.
Why vLLM?
| Method | Throughput (tokens/s) | Memory efficiency |
|---|---|---|
| HuggingFace generate() | ~200 | Poor (KV cache fragmented) |
| Text Generation Inference (TGI) | ~800 | Good |
| vLLM | ~2400 | Excellent (PagedAttention) |
vLLM also provides an OpenAI-compatible API out of the box — drop-in replacement, no code changes.
Architecture
Internet/Apps
↓
Kubernetes Service (LoadBalancer)
↓
vLLM Pod (GPU node)
- vllm serve
- OpenAI-compatible API on :8000
- PagedAttention KV cache
↓
Model (from HuggingFace Hub or PVC)
Step 1: GPU Node Setup on EKS
# Create EKS cluster with GPU node group
eksctl create cluster --name llm-cluster --region us-east-1
# Add GPU node group (g4dn.xlarge = 1x T4 GPU, 16 GiB VRAM)
eksctl create nodegroup \
--cluster llm-cluster \
--name gpu-nodes \
--node-type g4dn.xlarge \
--nodes 1 \
--nodes-min 1 \
--nodes-max 3 \
--node-ami-family AmazonLinux2
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# Verify GPU is visible
kubectl get nodes -l accelerator=vgpu
kubectl describe node <gpu-node> | grep nvidia.com/gpuStep 2: Persistent Volume for Model Storage
Models are large (7B model ≈ 14 GB). Mount a PVC to cache downloads:
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: vllm
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3
resources:
requests:
storage: 50Gikubectl create namespace vllm
kubectl apply -f pvc.yamlStep 3: vLLM Deployment
For Mistral 7B (requires ~14GB VRAM — fits on g4dn.2xlarge or g5.xlarge)
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-mistral
namespace: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm-mistral
template:
metadata:
labels:
app: vllm-mistral
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "mistralai/Mistral-7B-Instruct-v0.2"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
- "--max-model-len"
- "4096"
- "--tensor-parallel-size"
- "1"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: TRANSFORMERS_CACHE
value: /model-cache
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "4"
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "2"
volumeMounts:
- name: model-cache
mountPath: /model-cache
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache# Create HuggingFace token secret
kubectl create secret generic hf-token \
--from-literal=token=hf_your_token_here \
-n vllm
kubectl apply -f vllm-deployment.yamlStep 4: Service
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: vllm
spec:
selector:
app: vllm-mistral
ports:
- port: 8000
targetPort: 8000
type: ClusterIP # Use LoadBalancer for external accessStep 5: Test the OpenAI-Compatible API
# Port-forward for testing
kubectl port-forward svc/vllm-service 8000:8000 -n vllm
# List available models
curl http://localhost:8000/v1/models
# Chat completion (same format as OpenAI)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [
{"role": "user", "content": "Write a Kubernetes HPA YAML for a web app"}
],
"max_tokens": 512,
"temperature": 0.7
}'Step 6: Use Smaller Models (CPU-only for testing)
For development/testing without GPU:
args:
- "--model"
- "microsoft/phi-2" # 2.7B, runs on CPU (slowly)
- "--device"
- "cpu"
- "--max-model-len"
- "2048"
# Remove GPU resource limitsStep 7: Horizontal Pod Autoscaler
Scale based on GPU memory or queue length:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: vllm
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-mistral
minReplicas: 1
maxReplicas: 4
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70For production, use KEDA with a custom metric (pending requests queue).
Step 8: Multi-GPU Tensor Parallelism
For large models (Llama 70B) that don't fit on one GPU:
args:
- "--model"
- "meta-llama/Meta-Llama-3-70B-Instruct"
- "--tensor-parallel-size"
- "4" # Split across 4 GPUs
resources:
limits:
nvidia.com/gpu: "4"Use p4d.24xlarge (8x A100) or p3.8xlarge (4x V100) on EKS.
Cost Estimates (AWS, 2026)
| Instance | GPU | VRAM | Models | Spot cost/hr |
|---|---|---|---|---|
| g4dn.xlarge | 1x T4 | 16GB | 7B models | ~$0.16 |
| g4dn.2xlarge | 1x T4 | 16GB | 13B (quantized) | ~$0.30 |
| g5.xlarge | 1x A10G | 24GB | 13B, small 70B | ~$0.41 |
| p3.2xlarge | 1x V100 | 16GB | 7B-13B | ~$0.45 |
| p4d.24xlarge | 8x A100 | 320GB | 70B+ | ~$10 |
Cost saving tip: Use Spot instances — vLLM handles interruptions gracefully with proper health checks.
Compared to OpenAI API
| OpenAI API (gpt-4o) | vLLM (Mistral 7B on g4dn) | |
|---|---|---|
| Cost | $5/1M input tokens | ~$0.003/1M tokens |
| Latency | 200-800ms | 100-400ms |
| Privacy | Data sent to OpenAI | Fully private |
| Control | None | Full |
| Quality | Better (larger model) | Good for most tasks |
At scale, vLLM self-hosted is 1000x cheaper than OpenAI API.
Resources
- AWS GPU Instances — g4dn spot instances for affordable GPU compute
- DigitalOcean GPU Droplets — simpler alternative for smaller scale
- vLLM Documentation — official deployment guide
vLLM on Kubernetes gives you OpenAI-grade inference at a fraction of the cost, with full data privacy and control.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
How to Run Ollama on Kubernetes — Self-Host LLMs in Your Cluster (2026)
Ollama makes running LLMs locally easy. Running it on Kubernetes makes it scalable, persistent, and accessible to your whole team or application stack. Here's the complete setup — CPU and GPU, with persistent model storage and a production-ready deployment.
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.