Deploy Microsoft Phi-3 Mini on Kubernetes for Local AI Inference

Phi-3 Mini delivers GPT-3.5 quality at a fraction of the compute cost. Here's how to deploy it on Kubernetes using Ollama or vLLM with GPU or CPU-only nodes.

Microsoft's Phi-3 Mini (3.8B parameters) punches well above its weight class. On standard reasoning and coding benchmarks, it outperforms models twice its size. It runs on a single GPU — or even CPU-only nodes for lighter workloads.

Here's how to deploy it on Kubernetes.

Why Phi-3 Mini?

3.8B parameters — fits in 4GB VRAM (GPU) or 8GB RAM (CPU)
Strong at reasoning and code — designed for instruction following, not just text completion
Apache 2.0 license — commercial use allowed
Low deployment cost — runs on T4, A10G, or even CPU nodes

Compared to running Llama-3-8B or Mistral-7B, Phi-3 Mini costs about half the compute for similar quality on focused tasks.

Option 1 — Deploy with Ollama (Simplest)

Ollama handles model download, quantization, and serving automatically.

Kubernetes Deployment

yaml

# phi3-ollama.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-phi3
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-phi3
  template:
    metadata:
      labels:
        app: ollama-phi3
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          env:
            - name: OLLAMA_MODELS
              value: /models
          resources:
            requests:
              memory: "6Gi"
              cpu: "2"
            limits:
              memory: "8Gi"
              cpu: "4"
              # For GPU nodes, add:
              # nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-storage
              mountPath: /models
          lifecycle:
            postStart:
              exec:
                command: ["/bin/sh", "-c", "sleep 5 && ollama pull phi3:mini"]
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: ollama-models-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
  namespace: ai-inference
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-phi3
  namespace: ai-inference
spec:
  selector:
    app: ollama-phi3
  ports:
    - port: 11434
      targetPort: 11434
  type: ClusterIP

bash

kubectl create namespace ai-inference
kubectl apply -f phi3-ollama.yaml
 
# Wait for model to download (2-3 minutes)
kubectl logs -f deployment/ollama-phi3 -n ai-inference
 
# Test from inside cluster
kubectl run test --image=curlimages/curl --restart=Never -n ai-inference -- \
  curl -s http://ollama-phi3:11434/api/generate \
  -d '{"model":"phi3:mini","prompt":"What is Kubernetes in one sentence?","stream":false}'

Option 2 — Deploy with vLLM (Production-Grade)

vLLM gives you OpenAI-compatible API, batching, and better throughput for multi-user scenarios.

yaml

# phi3-vllm.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-phi3
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-phi3
  template:
    metadata:
      labels:
        app: vllm-phi3
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "microsoft/Phi-3-mini-4k-instruct"
            - "--dtype"
            - "float16"
            - "--max-model-len"
            - "4096"
            - "--port"
            - "8000"
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: "8Gi"
              cpu: "4"
            limits:
              memory: "12Gi"
              cpu: "8"
              nvidia.com/gpu: "1"    # Requires GPU node
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: vllm-model-cache
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-phi3
  namespace: ai-inference
spec:
  selector:
    app: vllm-phi3
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

bash

# Create HuggingFace token secret
kubectl create secret generic hf-token \
  --from-literal=token=hf_your_token_here \
  -n ai-inference
 
kubectl apply -f phi3-vllm.yaml
 
# Test OpenAI-compatible API
kubectl exec -it <pod-name> -n ai-inference -- \
  curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "messages": [{"role": "user", "content": "Write a Kubernetes readiness probe example"}],
    "max_tokens": 500
  }'

CPU-Only Deployment (No GPU)

For non-latency-sensitive workloads, Phi-3 Mini runs acceptably on CPU with quantization:

yaml

# Using llama.cpp server for CPU inference
apiVersion: apps/v1
kind: Deployment
metadata:
  name: phi3-cpu
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: phi3-cpu
  template:
    metadata:
      labels:
        app: phi3-cpu
    spec:
      containers:
        - name: llama-server
          image: ghcr.io/ggml-org/llama.cpp:server
          args:
            - "-m"
            - "/models/phi-3-mini-q4.gguf"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8080"
            - "-n"
            - "512"
            - "--threads"
            - "4"
          resources:
            requests:
              memory: "4Gi"
              cpu: "4"
            limits:
              memory: "6Gi"
              cpu: "8"
          volumeMounts:
            - name: models
              mountPath: /models
      initContainers:
        - name: download-model
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              curl -L "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf" \
                -o /models/phi-3-mini-q4.gguf
          volumeMounts:
            - name: models
              mountPath: /models
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: phi3-model-pvc

Performance expectations (CPU):

4 vCPU node: ~5–8 tokens/second
8 vCPU node: ~10–15 tokens/second
Suitable for batch jobs, dev environments, not real-time chat

Add an Ingress for External Access

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: phi3-ingress
  namespace: ai-inference
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  rules:
    - host: ai.internal.mycompany.com
      http:
        paths:
          - path: /v1
            pathType: Prefix
            backend:
              service:
                name: vllm-phi3
                port:
                  number: 8000

Cost Comparison (AWS, ap-south-1)

Node Type	VRAM/RAM	Phi-3 Performance	Cost/hour
g4dn.xlarge	16GB GPU	Fast (~50 tok/s)	~$0.52
g4dn.medium	16GB GPU	Fast (~40 tok/s)	~$0.26
c5.2xlarge (CPU)	16GB RAM	Slow (~10 tok/s)	~$0.34
m5.xlarge (CPU)	16GB RAM	Slow (~8 tok/s)	~$0.19

For internal tooling and automation — where latency is less critical — CPU deployment on a m5.xlarge is the most cost-effective option.

Phi-3 Mini is one of the best models for DevOps automation tasks (generating configs, explaining errors, writing scripts) because it's small enough to run cheaply on your own infrastructure while being accurate enough for technical content.

For Kubernetes and MLOps hands-on labs, KodeKloud covers GPU workloads, KEDA-based autoscaling, and full ML deployment pipelines.

Deploy Microsoft Phi-3 Mini on Kubernetes for Local AI Inference

Why Phi-3 Mini?

Option 1 — Deploy with Ollama (Simplest)

Kubernetes Deployment

Option 2 — Deploy with vLLM (Production-Grade)

CPU-Only Deployment (No GPU)

Add an Ingress for External Access

Cost Comparison (AWS, ap-south-1)

Stay ahead of the curve

Related Articles

Argo Workflows vs Prefect vs Airflow — Best for ML Pipelines 2026

Build an AI Kubernetes Resource Rightsizer with Claude API

Build a DevOps AI Agent with LangGraph on Kubernetes (2026)

Comments