🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Deploy Microsoft Phi-3 Mini on Kubernetes for Local AI Inference

Phi-3 Mini delivers GPT-3.5 quality at a fraction of the compute cost. Here's how to deploy it on Kubernetes using Ollama or vLLM with GPU or CPU-only nodes.

DevOpsBoysMay 13, 20264 min read
Share:Tweet

Microsoft's Phi-3 Mini (3.8B parameters) punches well above its weight class. On standard reasoning and coding benchmarks, it outperforms models twice its size. It runs on a single GPU — or even CPU-only nodes for lighter workloads.

Here's how to deploy it on Kubernetes.


Why Phi-3 Mini?

  • 3.8B parameters — fits in 4GB VRAM (GPU) or 8GB RAM (CPU)
  • Strong at reasoning and code — designed for instruction following, not just text completion
  • Apache 2.0 license — commercial use allowed
  • Low deployment cost — runs on T4, A10G, or even CPU nodes

Compared to running Llama-3-8B or Mistral-7B, Phi-3 Mini costs about half the compute for similar quality on focused tasks.


Option 1 — Deploy with Ollama (Simplest)

Ollama handles model download, quantization, and serving automatically.

Kubernetes Deployment

yaml
# phi3-ollama.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-phi3
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-phi3
  template:
    metadata:
      labels:
        app: ollama-phi3
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          env:
            - name: OLLAMA_MODELS
              value: /models
          resources:
            requests:
              memory: "6Gi"
              cpu: "2"
            limits:
              memory: "8Gi"
              cpu: "4"
              # For GPU nodes, add:
              # nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-storage
              mountPath: /models
          lifecycle:
            postStart:
              exec:
                command: ["/bin/sh", "-c", "sleep 5 && ollama pull phi3:mini"]
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: ollama-models-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
  namespace: ai-inference
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-phi3
  namespace: ai-inference
spec:
  selector:
    app: ollama-phi3
  ports:
    - port: 11434
      targetPort: 11434
  type: ClusterIP
bash
kubectl create namespace ai-inference
kubectl apply -f phi3-ollama.yaml
 
# Wait for model to download (2-3 minutes)
kubectl logs -f deployment/ollama-phi3 -n ai-inference
 
# Test from inside cluster
kubectl run test --image=curlimages/curl --restart=Never -n ai-inference -- \
  curl -s http://ollama-phi3:11434/api/generate \
  -d '{"model":"phi3:mini","prompt":"What is Kubernetes in one sentence?","stream":false}'

Option 2 — Deploy with vLLM (Production-Grade)

vLLM gives you OpenAI-compatible API, batching, and better throughput for multi-user scenarios.

yaml
# phi3-vllm.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-phi3
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-phi3
  template:
    metadata:
      labels:
        app: vllm-phi3
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "microsoft/Phi-3-mini-4k-instruct"
            - "--dtype"
            - "float16"
            - "--max-model-len"
            - "4096"
            - "--port"
            - "8000"
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: "8Gi"
              cpu: "4"
            limits:
              memory: "12Gi"
              cpu: "8"
              nvidia.com/gpu: "1"    # Requires GPU node
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: vllm-model-cache
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-phi3
  namespace: ai-inference
spec:
  selector:
    app: vllm-phi3
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP
bash
# Create HuggingFace token secret
kubectl create secret generic hf-token \
  --from-literal=token=hf_your_token_here \
  -n ai-inference
 
kubectl apply -f phi3-vllm.yaml
 
# Test OpenAI-compatible API
kubectl exec -it <pod-name> -n ai-inference -- \
  curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "messages": [{"role": "user", "content": "Write a Kubernetes readiness probe example"}],
    "max_tokens": 500
  }'

CPU-Only Deployment (No GPU)

For non-latency-sensitive workloads, Phi-3 Mini runs acceptably on CPU with quantization:

yaml
# Using llama.cpp server for CPU inference
apiVersion: apps/v1
kind: Deployment
metadata:
  name: phi3-cpu
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: phi3-cpu
  template:
    metadata:
      labels:
        app: phi3-cpu
    spec:
      containers:
        - name: llama-server
          image: ghcr.io/ggml-org/llama.cpp:server
          args:
            - "-m"
            - "/models/phi-3-mini-q4.gguf"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8080"
            - "-n"
            - "512"
            - "--threads"
            - "4"
          resources:
            requests:
              memory: "4Gi"
              cpu: "4"
            limits:
              memory: "6Gi"
              cpu: "8"
          volumeMounts:
            - name: models
              mountPath: /models
      initContainers:
        - name: download-model
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              curl -L "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf" \
                -o /models/phi-3-mini-q4.gguf
          volumeMounts:
            - name: models
              mountPath: /models
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: phi3-model-pvc

Performance expectations (CPU):

  • 4 vCPU node: ~5–8 tokens/second
  • 8 vCPU node: ~10–15 tokens/second
  • Suitable for batch jobs, dev environments, not real-time chat

Add an Ingress for External Access

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: phi3-ingress
  namespace: ai-inference
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  rules:
    - host: ai.internal.mycompany.com
      http:
        paths:
          - path: /v1
            pathType: Prefix
            backend:
              service:
                name: vllm-phi3
                port:
                  number: 8000

Cost Comparison (AWS, ap-south-1)

Node TypeVRAM/RAMPhi-3 PerformanceCost/hour
g4dn.xlarge16GB GPUFast (~50 tok/s)~$0.52
g4dn.medium16GB GPUFast (~40 tok/s)~$0.26
c5.2xlarge (CPU)16GB RAMSlow (~10 tok/s)~$0.34
m5.xlarge (CPU)16GB RAMSlow (~8 tok/s)~$0.19

For internal tooling and automation — where latency is less critical — CPU deployment on a m5.xlarge is the most cost-effective option.


Phi-3 Mini is one of the best models for DevOps automation tasks (generating configs, explaining errors, writing scripts) because it's small enough to run cheaply on your own infrastructure while being accurate enough for technical content.

For Kubernetes and MLOps hands-on labs, KodeKloud covers GPU workloads, KEDA-based autoscaling, and full ML deployment pipelines.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments