All Articles

Run vLLM on Kubernetes for Fast LLM Inference (2026)

vLLM is the fastest open-source LLM inference engine. Here's how to deploy it on Kubernetes with GPU nodes, expose an OpenAI-compatible API, and scale it.

DevOpsBoysApr 11, 20264 min read
Share:Tweet

vLLM serves LLMs 24x faster than naive HuggingFace inference by using PagedAttention — a KV cache memory manager that eliminates memory waste. Here's how to run it on Kubernetes with GPU nodes and expose an OpenAI-compatible API.

Why vLLM?

MethodThroughput (tokens/s)Memory efficiency
HuggingFace generate()~200Poor (KV cache fragmented)
Text Generation Inference (TGI)~800Good
vLLM~2400Excellent (PagedAttention)

vLLM also provides an OpenAI-compatible API out of the box — drop-in replacement, no code changes.


Architecture

Internet/Apps
     ↓
Kubernetes Service (LoadBalancer)
     ↓
vLLM Pod (GPU node)
  - vllm serve
  - OpenAI-compatible API on :8000
  - PagedAttention KV cache
     ↓
Model (from HuggingFace Hub or PVC)

Step 1: GPU Node Setup on EKS

bash
# Create EKS cluster with GPU node group
eksctl create cluster --name llm-cluster --region us-east-1
 
# Add GPU node group (g4dn.xlarge = 1x T4 GPU, 16 GiB VRAM)
eksctl create nodegroup \
  --cluster llm-cluster \
  --name gpu-nodes \
  --node-type g4dn.xlarge \
  --nodes 1 \
  --nodes-min 1 \
  --nodes-max 3 \
  --node-ami-family AmazonLinux2
 
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
 
# Verify GPU is visible
kubectl get nodes -l accelerator=vgpu
kubectl describe node <gpu-node> | grep nvidia.com/gpu

Step 2: Persistent Volume for Model Storage

Models are large (7B model ≈ 14 GB). Mount a PVC to cache downloads:

yaml
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: vllm
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 50Gi
bash
kubectl create namespace vllm
kubectl apply -f pvc.yaml

Step 3: vLLM Deployment

For Mistral 7B (requires ~14GB VRAM — fits on g4dn.2xlarge or g5.xlarge)

yaml
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-mistral
  namespace: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-mistral
  template:
    metadata:
      labels:
        app: vllm-mistral
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
          - "--model"
          - "mistralai/Mistral-7B-Instruct-v0.2"
          - "--host"
          - "0.0.0.0"
          - "--port"
          - "8000"
          - "--max-model-len"
          - "4096"
          - "--tensor-parallel-size"
          - "1"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        - name: TRANSFORMERS_CACHE
          value: /model-cache
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "24Gi"
            cpu: "4"
          requests:
            nvidia.com/gpu: "1"
            memory: "16Gi"
            cpu: "2"
        volumeMounts:
        - name: model-cache
          mountPath: /model-cache
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache
bash
# Create HuggingFace token secret
kubectl create secret generic hf-token \
  --from-literal=token=hf_your_token_here \
  -n vllm
 
kubectl apply -f vllm-deployment.yaml

Step 4: Service

yaml
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: vllm
spec:
  selector:
    app: vllm-mistral
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP   # Use LoadBalancer for external access

Step 5: Test the OpenAI-Compatible API

bash
# Port-forward for testing
kubectl port-forward svc/vllm-service 8000:8000 -n vllm
 
# List available models
curl http://localhost:8000/v1/models
 
# Chat completion (same format as OpenAI)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [
      {"role": "user", "content": "Write a Kubernetes HPA YAML for a web app"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Step 6: Use Smaller Models (CPU-only for testing)

For development/testing without GPU:

yaml
args:
  - "--model"
  - "microsoft/phi-2"       # 2.7B, runs on CPU (slowly)
  - "--device"
  - "cpu"
  - "--max-model-len"
  - "2048"
# Remove GPU resource limits

Step 7: Horizontal Pod Autoscaler

Scale based on GPU memory or queue length:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: vllm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-mistral
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

For production, use KEDA with a custom metric (pending requests queue).


Step 8: Multi-GPU Tensor Parallelism

For large models (Llama 70B) that don't fit on one GPU:

yaml
args:
  - "--model"
  - "meta-llama/Meta-Llama-3-70B-Instruct"
  - "--tensor-parallel-size"
  - "4"   # Split across 4 GPUs
resources:
  limits:
    nvidia.com/gpu: "4"

Use p4d.24xlarge (8x A100) or p3.8xlarge (4x V100) on EKS.


Cost Estimates (AWS, 2026)

InstanceGPUVRAMModelsSpot cost/hr
g4dn.xlarge1x T416GB7B models~$0.16
g4dn.2xlarge1x T416GB13B (quantized)~$0.30
g5.xlarge1x A10G24GB13B, small 70B~$0.41
p3.2xlarge1x V10016GB7B-13B~$0.45
p4d.24xlarge8x A100320GB70B+~$10

Cost saving tip: Use Spot instances — vLLM handles interruptions gracefully with proper health checks.


Compared to OpenAI API

OpenAI API (gpt-4o)vLLM (Mistral 7B on g4dn)
Cost$5/1M input tokens~$0.003/1M tokens
Latency200-800ms100-400ms
PrivacyData sent to OpenAIFully private
ControlNoneFull
QualityBetter (larger model)Good for most tasks

At scale, vLLM self-hosted is 1000x cheaper than OpenAI API.


Resources

vLLM on Kubernetes gives you OpenAI-grade inference at a fraction of the cost, with full data privacy and control.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments