Deploy Microsoft Phi-3 Mini on Kubernetes for Local AI Inference
Phi-3 Mini delivers GPT-3.5 quality at a fraction of the compute cost. Here's how to deploy it on Kubernetes using Ollama or vLLM with GPU or CPU-only nodes.
Microsoft's Phi-3 Mini (3.8B parameters) punches well above its weight class. On standard reasoning and coding benchmarks, it outperforms models twice its size. It runs on a single GPU — or even CPU-only nodes for lighter workloads.
Here's how to deploy it on Kubernetes.
Why Phi-3 Mini?
- 3.8B parameters — fits in 4GB VRAM (GPU) or 8GB RAM (CPU)
- Strong at reasoning and code — designed for instruction following, not just text completion
- Apache 2.0 license — commercial use allowed
- Low deployment cost — runs on T4, A10G, or even CPU nodes
Compared to running Llama-3-8B or Mistral-7B, Phi-3 Mini costs about half the compute for similar quality on focused tasks.
Option 1 — Deploy with Ollama (Simplest)
Ollama handles model download, quantization, and serving automatically.
Kubernetes Deployment
# phi3-ollama.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-phi3
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: ollama-phi3
template:
metadata:
labels:
app: ollama-phi3
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
env:
- name: OLLAMA_MODELS
value: /models
resources:
requests:
memory: "6Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
# For GPU nodes, add:
# nvidia.com/gpu: "1"
volumeMounts:
- name: model-storage
mountPath: /models
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "sleep 5 && ollama pull phi3:mini"]
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: ollama-models-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models-pvc
namespace: ai-inference
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: ollama-phi3
namespace: ai-inference
spec:
selector:
app: ollama-phi3
ports:
- port: 11434
targetPort: 11434
type: ClusterIPkubectl create namespace ai-inference
kubectl apply -f phi3-ollama.yaml
# Wait for model to download (2-3 minutes)
kubectl logs -f deployment/ollama-phi3 -n ai-inference
# Test from inside cluster
kubectl run test --image=curlimages/curl --restart=Never -n ai-inference -- \
curl -s http://ollama-phi3:11434/api/generate \
-d '{"model":"phi3:mini","prompt":"What is Kubernetes in one sentence?","stream":false}'Option 2 — Deploy with vLLM (Production-Grade)
vLLM gives you OpenAI-compatible API, batching, and better throughput for multi-user scenarios.
# phi3-vllm.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-phi3
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm-phi3
template:
metadata:
labels:
app: vllm-phi3
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "microsoft/Phi-3-mini-4k-instruct"
- "--dtype"
- "float16"
- "--max-model-len"
- "4096"
- "--port"
- "8000"
ports:
- containerPort: 8000
resources:
requests:
memory: "8Gi"
cpu: "4"
limits:
memory: "12Gi"
cpu: "8"
nvidia.com/gpu: "1" # Requires GPU node
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-cache
---
apiVersion: v1
kind: Service
metadata:
name: vllm-phi3
namespace: ai-inference
spec:
selector:
app: vllm-phi3
ports:
- port: 8000
targetPort: 8000
type: ClusterIP# Create HuggingFace token secret
kubectl create secret generic hf-token \
--from-literal=token=hf_your_token_here \
-n ai-inference
kubectl apply -f phi3-vllm.yaml
# Test OpenAI-compatible API
kubectl exec -it <pod-name> -n ai-inference -- \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3-mini-4k-instruct",
"messages": [{"role": "user", "content": "Write a Kubernetes readiness probe example"}],
"max_tokens": 500
}'CPU-Only Deployment (No GPU)
For non-latency-sensitive workloads, Phi-3 Mini runs acceptably on CPU with quantization:
# Using llama.cpp server for CPU inference
apiVersion: apps/v1
kind: Deployment
metadata:
name: phi3-cpu
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: phi3-cpu
template:
metadata:
labels:
app: phi3-cpu
spec:
containers:
- name: llama-server
image: ghcr.io/ggml-org/llama.cpp:server
args:
- "-m"
- "/models/phi-3-mini-q4.gguf"
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
- "-n"
- "512"
- "--threads"
- "4"
resources:
requests:
memory: "4Gi"
cpu: "4"
limits:
memory: "6Gi"
cpu: "8"
volumeMounts:
- name: models
mountPath: /models
initContainers:
- name: download-model
image: curlimages/curl:latest
command:
- sh
- -c
- |
curl -L "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf" \
-o /models/phi-3-mini-q4.gguf
volumeMounts:
- name: models
mountPath: /models
volumes:
- name: models
persistentVolumeClaim:
claimName: phi3-model-pvcPerformance expectations (CPU):
- 4 vCPU node: ~5–8 tokens/second
- 8 vCPU node: ~10–15 tokens/second
- Suitable for batch jobs, dev environments, not real-time chat
Add an Ingress for External Access
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: phi3-ingress
namespace: ai-inference
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
rules:
- host: ai.internal.mycompany.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: vllm-phi3
port:
number: 8000Cost Comparison (AWS, ap-south-1)
| Node Type | VRAM/RAM | Phi-3 Performance | Cost/hour |
|---|---|---|---|
| g4dn.xlarge | 16GB GPU | Fast (~50 tok/s) | ~$0.52 |
| g4dn.medium | 16GB GPU | Fast (~40 tok/s) | ~$0.26 |
| c5.2xlarge (CPU) | 16GB RAM | Slow (~10 tok/s) | ~$0.34 |
| m5.xlarge (CPU) | 16GB RAM | Slow (~8 tok/s) | ~$0.19 |
For internal tooling and automation — where latency is less critical — CPU deployment on a m5.xlarge is the most cost-effective option.
Phi-3 Mini is one of the best models for DevOps automation tasks (generating configs, explaining errors, writing scripts) because it's small enough to run cheaply on your own infrastructure while being accurate enough for technical content.
For Kubernetes and MLOps hands-on labs, KodeKloud covers GPU workloads, KEDA-based autoscaling, and full ML deployment pipelines.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build a DevOps AI Agent with LangGraph on Kubernetes (2026)
Build a stateful DevOps agent using LangGraph that can plan multi-step infrastructure tasks, use tools, handle errors, and maintain conversation context — deployed on Kubernetes with a FastAPI interface.
Build a DevOps Automation Bot with LLM Function Calling (2026)
Use Claude or GPT-4o function calling to build a DevOps bot that can check pod status, scale deployments, query logs, and trigger pipelines — all from plain English commands in Slack or terminal.
Build an MCP Server That Controls Your DevOps Tools with AI
Model Context Protocol (MCP) lets AI assistants like Claude control kubectl, Terraform, and AWS CLI directly. Here's how to build your own MCP server for DevOps automation.