How to Run Ollama on Kubernetes — Self-Host LLMs in Your Cluster (2026)
Ollama makes running LLMs locally easy. Running it on Kubernetes makes it scalable, persistent, and accessible to your whole team or application stack. Here's the complete setup — CPU and GPU, with persistent model storage and a production-ready deployment.
Everyone's running LLMs — but paying OpenAI API costs at scale gets expensive fast. Ollama on Kubernetes gives you a private, self-hosted LLM inference server running Llama 3, Mistral, Phi-3, or any GGUF model — inside your own cluster.
Here's the full setup from zero to a working deployment.
What You're Building
kubectl → Ollama Deployment (GPU or CPU pod)
├── Persistent Volume (model storage, ~5-50 GB per model)
├── Service (internal API on port 11434)
└── Optional: Ingress (expose to team or app)
Your apps call http://ollama-service:11434/api/generate — same API as running Ollama locally, but running inside Kubernetes, accessible from any pod.
Prerequisites
- Kubernetes cluster (EKS, GKE, AKS, or local kind/minikube)
kubectlconfigured- At least 4 CPU cores and 8 GB RAM for a small CPU-based model
- For GPU: NVIDIA GPU node + NVIDIA device plugin installed (covered below)
Part 1: CPU-Based Setup (No GPU Required)
Start here if you don't have GPU nodes. You can run smaller models (Phi-3 mini, Llama 3 8B at Q4 quantization) on CPU — slower, but it works for development and low-traffic use cases.
Namespace
kubectl create namespace ollamaPersistent Volume for Models
Models are large (4–40 GB). Mount a PVC so Ollama doesn't re-download models on every pod restart.
# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models
namespace: ollama
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3 # use your cluster's StorageClass
resources:
requests:
storage: 50Gi # enough for 2-3 modelskubectl apply -f ollama-pvc.yamlDeployment (CPU)
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
volumeMounts:
- name: models
mountPath: /root/.ollama # Ollama stores models here
env:
- name: OLLAMA_HOST
value: "0.0.0.0" # listen on all interfaces
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-modelskubectl apply -f ollama-deployment.yamlService
# ollama-service.yaml
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
type: ClusterIPkubectl apply -f ollama-service.yamlPull a Model
# Exec into the Ollama pod and pull a model
kubectl exec -n ollama deploy/ollama -- ollama pull phi3:mini
# Or pull llama3 (4.7 GB)
kubectl exec -n ollama deploy/ollama -- ollama pull llama3:8bWait for the pull to complete — models are stored in the PVC.
Test It
# Port-forward to test locally
kubectl port-forward -n ollama svc/ollama 11434:11434
# In another terminal
curl http://localhost:11434/api/generate -d '{
"model": "phi3:mini",
"prompt": "Explain Kubernetes in 3 sentences.",
"stream": false
}'Part 2: GPU Setup on EKS
GPU inference is 10–100x faster than CPU for LLMs. This section sets up Ollama on a GPU node group in EKS.
Step 1: Add a GPU Node Group in Terraform
# In your EKS Terraform config
resource "aws_eks_node_group" "gpu" {
cluster_name = module.eks.cluster_name
node_group_name = "gpu"
node_role_arn = aws_iam_role.node.arn
subnet_ids = module.vpc.private_subnets
instance_types = ["g4dn.xlarge"] # 1x NVIDIA T4, 16 GB VRAM, ~$0.52/hr
scaling_config {
desired_size = 1
min_size = 0
max_size = 2
}
labels = {
"nvidia.com/gpu" = "true"
}
taint {
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE" # only GPU-requesting pods land here
}
}Step 2: Install NVIDIA Device Plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.ymlVerify GPU is visible:
kubectl get nodes -l nvidia.com/gpu=true
kubectl describe node <gpu-node> | grep nvidia.com/gpu
# Allocatable: nvidia.com/gpu: 1Step 3: GPU Deployment
# ollama-gpu-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
nvidia.com/gpu: "true"
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: "1" # request 1 GPU
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /root/.ollama
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-modelsWith a T4 GPU, Llama 3 8B runs at ~30 tokens/second — fast enough for real application use.
Part 3: Calling Ollama from Your Application
Any pod in the cluster can call Ollama via the Service DNS name.
Python (with requests)
import requests
response = requests.post(
"http://ollama.ollama.svc.cluster.local:11434/api/generate",
json={
"model": "llama3:8b",
"prompt": "Write a Kubernetes health check for a Node.js app",
"stream": False
}
)
print(response.json()["response"])Using OpenAI-compatible endpoint
Ollama exposes an OpenAI-compatible API at /v1/. This means you can use the OpenAI Python SDK without changing your code:
from openai import OpenAI
client = OpenAI(
base_url="http://ollama.ollama.svc.cluster.local:11434/v1",
api_key="ollama" # required but ignored by Ollama
)
response = client.chat.completions.create(
model="llama3:8b",
messages=[
{"role": "user", "content": "Explain Terraform state locking in 2 sentences."}
]
)
print(response.choices[0].message.content)This drop-in compatibility means you can switch between OpenAI and self-hosted Ollama by just changing the base_url.
Part 4: Optional — Open WebUI (ChatGPT-like Interface)
Deploy Open WebUI to give your team a browser-based chat interface connected to your Ollama backend.
apiVersion: apps/v1
kind: Deployment
metadata:
name: open-webui
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: open-webui
template:
metadata:
labels:
app: open-webui
spec:
containers:
- name: open-webui
image: ghcr.io/open-webui/open-webui:main
ports:
- containerPort: 8080
env:
- name: OLLAMA_BASE_URL
value: "http://ollama:11434"
volumeMounts:
- name: data
mountPath: /app/backend/data
volumes:
- name: data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: open-webui
namespace: ollama
spec:
selector:
app: open-webui
ports:
- port: 8080
targetPort: 8080
type: ClusterIPPort-forward to access: kubectl port-forward -n ollama svc/open-webui 8080:8080
Resource Requirements by Model
| Model | Size (Q4) | Min RAM | Min VRAM | Tokens/sec (CPU) | Tokens/sec (T4 GPU) |
|---|---|---|---|---|---|
| Phi-3 Mini 3.8B | 2.3 GB | 4 GB | 4 GB | 8-12 | 60-80 |
| Llama 3 8B | 4.7 GB | 8 GB | 8 GB | 3-5 | 30-40 |
| Mistral 7B | 4.1 GB | 8 GB | 8 GB | 4-6 | 35-45 |
| Llama 3 70B | 40 GB | 64 GB | 40 GB (A100) | 0.5-1 | needs A100 |
| CodeLlama 13B | 7.4 GB | 16 GB | 16 GB | 2-3 | 20-25 |
For most use cases: Llama 3 8B or Phi-3 Mini on a g4dn.xlarge ($0.52/hr) gives excellent quality at low cost.
Cost Comparison
| Option | Monthly cost (1M tokens/day) | Latency |
|---|---|---|
| OpenAI GPT-4o | ~$600–1,500 | 1-3s |
| OpenAI GPT-4o-mini | ~$30–60 | 0.5-1s |
| Self-hosted Llama 3 8B (g4dn.xlarge) | ~$15–20 | 0.3-1s |
| Self-hosted Phi-3 Mini (CPU only) | ~$5–10 (smaller instances) | 3-10s |
For internal tooling, dev assistants, and batch jobs — self-hosted Ollama pays for itself quickly.
Troubleshooting
Pod stuck in Pending:
kubectl describe pod -n ollama -l app=ollama
# If GPU: check "insufficient nvidia.com/gpu" → GPU node not ready or device plugin not installedModel pull times out:
Models can be large. The first pull takes time. Use a larger initialDelaySeconds in readinessProbe (60-120s).
Out of memory: Increase memory limits. Llama 3 8B needs at least 8 GB. Set limits generously.
Slow inference on CPU: Expected — CPU inference is slow. Upgrade to a GPU node or use a smaller model (Phi-3 Mini).
Related: What is MLOps — Complete Guide | Kubernetes Resource Calculator
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
AWS EKS Pods Stuck in Pending State: Causes and Fixes
Pods stuck in Pending on EKS are caused by a handful of known issues — insufficient node capacity, taint mismatches, PVC problems, and more. Here's how to diagnose and fix each one.