🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Deploy Qwen2.5-Coder on Kubernetes for Private Code AI (2026)

Run Alibaba's Qwen2.5-Coder LLM on your own Kubernetes cluster with GPU nodes. Complete guide — from GPU node setup to serving with vLLM and integrating with VS Code via Continue.dev.

DevOpsBoysMay 25, 20265 min read
Share:Tweet

Qwen2.5-Coder is one of the best open-source code generation models in 2026 — comparable to GPT-4o for coding tasks, running entirely on your own infrastructure. No code leaves your cluster, no API costs, no rate limits.

This guide deploys Qwen2.5-Coder-32B-Instruct on Kubernetes with vLLM and connects it to VS Code via Continue.dev.


Why Qwen2.5-Coder

  • 32B parameter model — strong code completion, explanation, and generation
  • 128K context window — fits entire codebases for context
  • Apache 2.0 license — commercial use allowed
  • vLLM compatible — high-throughput GPU serving
  • OpenAI API compatible — drop-in replacement, works with any OpenAI SDK client

Benchmarks show Qwen2.5-Coder-32B performing above GPT-4o on HumanEval and MBPP coding benchmarks.


Prerequisites

  • Kubernetes cluster with GPU nodes (NVIDIA A100/H100/L40S or consumer RTX 4090)
  • NVIDIA GPU Operator installed
  • StorageClass with at least 100GB available
  • kubectl and helm configured

Minimum GPU requirements:

  • Qwen2.5-Coder-7B: 1x A10G (24GB VRAM)
  • Qwen2.5-Coder-14B: 1x A100 (40GB) or 2x 3090
  • Qwen2.5-Coder-32B: 2x A100 (80GB) or 4x A10G

Step 1: Create GPU Node Pool

AWS EKS (Karpenter)

yaml
# karpenter-gpu-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-qwen
spec:
  template:
    spec:
      requirements:
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values: ["p4d", "p3", "g5"]   # A100 or A10G instances
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-nodes
      taints:
      - key: nvidia.com/gpu
        effect: NoSchedule
  limits:
    nvidia.com/gpu: 8
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-nodes
spec:
  amiFamily: AL2
  instanceProfile: KarpenterNodeInstanceProfile
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster

Step 2: Create Namespace and Storage

bash
kubectl create namespace llm
yaml
# model-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qwen-model-storage
  namespace: llm
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 120Gi   # ~65GB for 32B model weights + buffer
  storageClassName: gp3  # AWS EBS gp3
bash
kubectl apply -f model-pvc.yaml

Step 3: Download Model Weights (Init Job)

yaml
# model-downloader.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: download-qwen-model
  namespace: llm
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: downloader
        image: python:3.11-slim
        command:
        - /bin/sh
        - -c
        - |
          pip install huggingface_hub -q
          python3 -c "
          from huggingface_hub import snapshot_download
          snapshot_download(
              repo_id='Qwen/Qwen2.5-Coder-32B-Instruct',
              local_dir='/models/qwen2.5-coder-32b',
              ignore_patterns=['*.msgpack', '*.h5', 'original/*']
          )
          print('Download complete')
          "
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-credentials
              key: token
        volumeMounts:
        - name: model-storage
          mountPath: /models
        resources:
          requests:
            memory: "8Gi"
            cpu: "2"
          limits:
            memory: "16Gi"
            cpu: "4"
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: qwen-model-storage
bash
# Create HuggingFace token secret (needed for gated models)
kubectl create secret generic hf-credentials \
  --from-literal=token=hf_your_token_here \
  -n llm
 
kubectl apply -f model-downloader.yaml
 
# Watch download progress (takes 20-40 minutes for 32B)
kubectl logs -f job/download-qwen-model -n llm

Step 4: Deploy vLLM Serving

yaml
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-vllm
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen-vllm
  template:
    metadata:
      labels:
        app: qwen-vllm
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model=/models/qwen2.5-coder-32b"
        - "--served-model-name=qwen2.5-coder"
        - "--tensor-parallel-size=2"      # Use 2 GPUs for 32B model
        - "--max-model-len=65536"          # 64K context window
        - "--max-num-seqs=64"              # Max concurrent requests
        - "--gpu-memory-utilization=0.92"
        - "--dtype=bfloat16"
        - "--port=8000"
        env:
        - name: NCCL_DEBUG
          value: "WARN"
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: "2"
            memory: "80Gi"
            cpu: "16"
          requests:
            nvidia.com/gpu: "2"
            memory: "60Gi"
            cpu: "8"
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        - name: shm
          mountPath: /dev/shm
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 15
          failureThreshold: 20
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: qwen-model-storage
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 16Gi
---
apiVersion: v1
kind: Service
metadata:
  name: qwen-vllm
  namespace: llm
spec:
  selector:
    app: qwen-vllm
  ports:
  - port: 8000
    targetPort: 8000
    name: http
  type: ClusterIP
bash
kubectl apply -f vllm-deployment.yaml
 
# Watch startup (model loading takes 5-10 minutes)
kubectl logs -f deployment/qwen-vllm -n llm
 
# Wait for ready
kubectl wait --for=condition=ready pod -l app=qwen-vllm -n llm --timeout=600s

Step 5: Test the API

bash
# Port forward for testing
kubectl port-forward svc/qwen-vllm 8000:8000 -n llm &
 
# Test completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function that checks if a binary tree is balanced. Include type hints and docstring."
      }
    ],
    "temperature": 0.1,
    "max_tokens": 1024
  }'

Step 6: Expose via Ingress

yaml
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: qwen-ingress
  namespace: llm
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - llm.internal.mycompany.com
    secretName: llm-tls
  rules:
  - host: llm.internal.mycompany.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: qwen-vllm
            port:
              number: 8000

Step 7: Connect VS Code via Continue.dev

Install Continue extension in VS Code, then configure:

json
// ~/.continue/config.json
{
  "models": [
    {
      "title": "Qwen2.5-Coder (Private)",
      "provider": "openai",
      "model": "qwen2.5-coder",
      "apiBase": "http://llm.internal.mycompany.com/v1",
      "apiKey": "not-needed"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder Autocomplete",
    "provider": "openai",
    "model": "qwen2.5-coder",
    "apiBase": "http://llm.internal.mycompany.com/v1",
    "apiKey": "not-needed"
  },
  "contextProviders": [
    { "name": "codebase" },
    { "name": "files" }
  ]
}

Now you have private, self-hosted code AI in VS Code — no external API calls.


Monitoring vLLM

bash
# vLLM exposes Prometheus metrics
kubectl port-forward svc/qwen-vllm 8000:8000 -n llm
curl http://localhost:8000/metrics
 
# Key metrics to watch:
# vllm:gpu_cache_usage_perc - GPU KV cache utilization
# vllm:num_requests_running - Active requests
# vllm:time_per_output_token_seconds - Token generation speed
# vllm:e2e_request_latency_seconds - End-to-end latency

Add to Prometheus scrape config:

yaml
- job_name: vllm
  static_configs:
  - targets: ['qwen-vllm.llm.svc.cluster.local:8000']

Cost Estimate (AWS)

InstanceGPUsVRAMModelCost/hr
g5.2xlarge1x A10G 24GB24GB7B INT4~$1.21
g5.12xlarge4x A10G 24GB96GB32B FP16~$5.67
p3.8xlarge4x V100 16GB64GB14B FP16~$12.24
p4d.24xlarge8x A100 40GB320GB32B FP16~$32.77

For internal developer tooling: g5.12xlarge at ~$5.67/hr with auto-scale-to-zero = ~$45/day for 8hr developer day, shared across 20 engineers = $2.25/engineer/day. Cheaper than any paid API.


Related: Deploy Ollama on Kubernetes | GPU Nodes on EKS Complete Guide | AI-Powered Log Analysis with LLMs

Affiliate note: AWS EC2 g5 instances are the most cost-effective for running 7B–32B models. Use EC2 Spot Instances for non-critical workloads to cut costs by 60–70%.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments