Deploy Qwen2.5-Coder on Kubernetes for Private Code AI (2026)
Run Alibaba's Qwen2.5-Coder LLM on your own Kubernetes cluster with GPU nodes. Complete guide — from GPU node setup to serving with vLLM and integrating with VS Code via Continue.dev.
Qwen2.5-Coder is one of the best open-source code generation models in 2026 — comparable to GPT-4o for coding tasks, running entirely on your own infrastructure. No code leaves your cluster, no API costs, no rate limits.
This guide deploys Qwen2.5-Coder-32B-Instruct on Kubernetes with vLLM and connects it to VS Code via Continue.dev.
Why Qwen2.5-Coder
- 32B parameter model — strong code completion, explanation, and generation
- 128K context window — fits entire codebases for context
- Apache 2.0 license — commercial use allowed
- vLLM compatible — high-throughput GPU serving
- OpenAI API compatible — drop-in replacement, works with any OpenAI SDK client
Benchmarks show Qwen2.5-Coder-32B performing above GPT-4o on HumanEval and MBPP coding benchmarks.
Prerequisites
- Kubernetes cluster with GPU nodes (NVIDIA A100/H100/L40S or consumer RTX 4090)
- NVIDIA GPU Operator installed
- StorageClass with at least 100GB available
- kubectl and helm configured
Minimum GPU requirements:
- Qwen2.5-Coder-7B: 1x A10G (24GB VRAM)
- Qwen2.5-Coder-14B: 1x A100 (40GB) or 2x 3090
- Qwen2.5-Coder-32B: 2x A100 (80GB) or 4x A10G
Step 1: Create GPU Node Pool
AWS EKS (Karpenter)
# karpenter-gpu-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-qwen
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["p4d", "p3", "g5"] # A100 or A10G instances
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-nodes
taints:
- key: nvidia.com/gpu
effect: NoSchedule
limits:
nvidia.com/gpu: 8
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-nodes
spec:
amiFamily: AL2
instanceProfile: KarpenterNodeInstanceProfile
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-clusterStep 2: Create Namespace and Storage
kubectl create namespace llm# model-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: qwen-model-storage
namespace: llm
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 120Gi # ~65GB for 32B model weights + buffer
storageClassName: gp3 # AWS EBS gp3kubectl apply -f model-pvc.yamlStep 3: Download Model Weights (Init Job)
# model-downloader.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: download-qwen-model
namespace: llm
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: downloader
image: python:3.11-slim
command:
- /bin/sh
- -c
- |
pip install huggingface_hub -q
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='Qwen/Qwen2.5-Coder-32B-Instruct',
local_dir='/models/qwen2.5-coder-32b',
ignore_patterns=['*.msgpack', '*.h5', 'original/*']
)
print('Download complete')
"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-credentials
key: token
volumeMounts:
- name: model-storage
mountPath: /models
resources:
requests:
memory: "8Gi"
cpu: "2"
limits:
memory: "16Gi"
cpu: "4"
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: qwen-model-storage# Create HuggingFace token secret (needed for gated models)
kubectl create secret generic hf-credentials \
--from-literal=token=hf_your_token_here \
-n llm
kubectl apply -f model-downloader.yaml
# Watch download progress (takes 20-40 minutes for 32B)
kubectl logs -f job/download-qwen-model -n llmStep 4: Deploy vLLM Serving
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen-vllm
namespace: llm
spec:
replicas: 1
selector:
matchLabels:
app: qwen-vllm
template:
metadata:
labels:
app: qwen-vllm
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model=/models/qwen2.5-coder-32b"
- "--served-model-name=qwen2.5-coder"
- "--tensor-parallel-size=2" # Use 2 GPUs for 32B model
- "--max-model-len=65536" # 64K context window
- "--max-num-seqs=64" # Max concurrent requests
- "--gpu-memory-utilization=0.92"
- "--dtype=bfloat16"
- "--port=8000"
env:
- name: NCCL_DEBUG
value: "WARN"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: "2"
memory: "80Gi"
cpu: "16"
requests:
nvidia.com/gpu: "2"
memory: "60Gi"
cpu: "8"
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
- name: shm
mountPath: /dev/shm
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 15
failureThreshold: 20
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: qwen-model-storage
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi
---
apiVersion: v1
kind: Service
metadata:
name: qwen-vllm
namespace: llm
spec:
selector:
app: qwen-vllm
ports:
- port: 8000
targetPort: 8000
name: http
type: ClusterIPkubectl apply -f vllm-deployment.yaml
# Watch startup (model loading takes 5-10 minutes)
kubectl logs -f deployment/qwen-vllm -n llm
# Wait for ready
kubectl wait --for=condition=ready pod -l app=qwen-vllm -n llm --timeout=600sStep 5: Test the API
# Port forward for testing
kubectl port-forward svc/qwen-vllm 8000:8000 -n llm &
# Test completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder",
"messages": [
{
"role": "user",
"content": "Write a Python function that checks if a binary tree is balanced. Include type hints and docstring."
}
],
"temperature": 0.1,
"max_tokens": 1024
}'Step 6: Expose via Ingress
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: qwen-ingress
namespace: llm
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
ingressClassName: nginx
tls:
- hosts:
- llm.internal.mycompany.com
secretName: llm-tls
rules:
- host: llm.internal.mycompany.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: qwen-vllm
port:
number: 8000Step 7: Connect VS Code via Continue.dev
Install Continue extension in VS Code, then configure:
// ~/.continue/config.json
{
"models": [
{
"title": "Qwen2.5-Coder (Private)",
"provider": "openai",
"model": "qwen2.5-coder",
"apiBase": "http://llm.internal.mycompany.com/v1",
"apiKey": "not-needed"
}
],
"tabAutocompleteModel": {
"title": "Qwen2.5-Coder Autocomplete",
"provider": "openai",
"model": "qwen2.5-coder",
"apiBase": "http://llm.internal.mycompany.com/v1",
"apiKey": "not-needed"
},
"contextProviders": [
{ "name": "codebase" },
{ "name": "files" }
]
}Now you have private, self-hosted code AI in VS Code — no external API calls.
Monitoring vLLM
# vLLM exposes Prometheus metrics
kubectl port-forward svc/qwen-vllm 8000:8000 -n llm
curl http://localhost:8000/metrics
# Key metrics to watch:
# vllm:gpu_cache_usage_perc - GPU KV cache utilization
# vllm:num_requests_running - Active requests
# vllm:time_per_output_token_seconds - Token generation speed
# vllm:e2e_request_latency_seconds - End-to-end latencyAdd to Prometheus scrape config:
- job_name: vllm
static_configs:
- targets: ['qwen-vllm.llm.svc.cluster.local:8000']Cost Estimate (AWS)
| Instance | GPUs | VRAM | Model | Cost/hr |
|---|---|---|---|---|
| g5.2xlarge | 1x A10G 24GB | 24GB | 7B INT4 | ~$1.21 |
| g5.12xlarge | 4x A10G 24GB | 96GB | 32B FP16 | ~$5.67 |
| p3.8xlarge | 4x V100 16GB | 64GB | 14B FP16 | ~$12.24 |
| p4d.24xlarge | 8x A100 40GB | 320GB | 32B FP16 | ~$32.77 |
For internal developer tooling: g5.12xlarge at ~$5.67/hr with auto-scale-to-zero = ~$45/day for 8hr developer day, shared across 20 engineers = $2.25/engineer/day. Cheaper than any paid API.
Related: Deploy Ollama on Kubernetes | GPU Nodes on EKS Complete Guide | AI-Powered Log Analysis with LLMs
Affiliate note: AWS EC2 g5 instances are the most cost-effective for running 7B–32B models. Use EC2 Spot Instances for non-critical workloads to cut costs by 60–70%.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI Kubernetes Runbook Generator with LLMs (2026)
Manual runbooks go stale. Build a system that watches your Kubernetes cluster, detects incidents, and generates step-by-step runbooks automatically using LLMs. Full implementation with Python, kubectl, and Ollama.
Build an AI Alert Classifier for Grafana Using LLMs (2026)
Tired of noisy Grafana alerts that wake you up for nothing? Build an AI layer that classifies incoming alerts as actionable or noise, enriches them with context, and routes them intelligently — using Claude or GPT-4 as the reasoning engine.
Build an AI Kubernetes Troubleshooter with Claude (2026)
Build a CLI tool that automatically diagnoses Kubernetes issues — OOMKilled, CrashLoopBackOff, pending pods — by gathering cluster state and asking Claude what's wrong and how to fix it.