Run OpenWebUI + Ollama on Kubernetes — Self-Hosted ChatGPT (2026)
Deploy your own ChatGPT-like interface on Kubernetes using Ollama for local LLMs and OpenWebUI for the frontend. Full setup with GPU support and persistent storage.
OpenWebUI is an open-source ChatGPT-like interface. Ollama serves LLMs locally. Together on Kubernetes, you get a private, self-hosted AI assistant with no data leaving your cluster. Here's the full setup.
What You'll Build
Users → OpenWebUI (web app) → Ollama (LLM server) → GPU nodes
↓
Persistent storage (chat history, models)
What this gives you:
- ChatGPT-like interface for your team
- Models run locally — no API costs, no data privacy concerns
- Supports Llama 3, Mistral, Phi-3, Gemma, and 100+ models
- OpenAI-compatible API (works with existing tools)
Prerequisites
- Kubernetes cluster (minikube works for CPU, EKS/GKE for GPU)
kubectl+helminstalled- For GPU: NVIDIA drivers + device plugin installed
Step 1: Create Namespace and Storage
kubectl create namespace ai# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models
namespace: ai
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi # models are large — Llama 3 8B = 4.7GB
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: openwebui-data
namespace: ai
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 5Gikubectl apply -f pvc.yamlStep 2: Deploy Ollama
CPU version (no GPU)
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
volumeMounts:
- name: models
mountPath: /root/.ollama
env:
- name: OLLAMA_KEEP_ALIVE
value: "24h" # keep models loaded in memory
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ai
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434GPU version (NVIDIA)
# For GPU nodes — add to container spec:
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: 1 # ← request 1 GPU
limits:
nvidia.com/gpu: 1# Verify NVIDIA device plugin is installed
kubectl get pods -n kube-system | grep nvidiaStep 3: Pull a Model
kubectl apply -f ollama-deployment.yaml
# Wait for pod to be running
kubectl get pods -n ai -w
# Pull a model (do this once — stored on PVC)
kubectl exec -n ai deploy/ollama -- ollama pull llama3.2:3b
# Available models: llama3.2, mistral, phi3, gemma2, codellama
# Check model sizes before pulling:
# llama3.2:1b → 1.3 GB (fast, less capable)
# llama3.2:3b → 2.0 GB (good balance)
# llama3.1:8b → 4.7 GB (good quality, needs 8GB RAM)
# llama3.1:70b → 40 GB (excellent, needs GPU)Step 4: Deploy OpenWebUI
# openwebui-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: openwebui
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: openwebui
template:
metadata:
labels:
app: openwebui
spec:
containers:
- name: openwebui
image: ghcr.io/open-webui/open-webui:main
ports:
- containerPort: 8080
env:
- name: OLLAMA_BASE_URL
value: "http://ollama:11434" # ← points to Ollama service
- name: WEBUI_SECRET_KEY
value: "change-this-to-a-random-secret"
- name: DEFAULT_USER_ROLE
value: "user"
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
volumeMounts:
- name: data
mountPath: /app/backend/data
volumes:
- name: data
persistentVolumeClaim:
claimName: openwebui-data
---
apiVersion: v1
kind: Service
metadata:
name: openwebui
namespace: ai
spec:
selector:
app: openwebui
ports:
- port: 80
targetPort: 8080
type: ClusterIPStep 5: Expose OpenWebUI
Option A: Port Forward (local/dev)
kubectl port-forward svc/openwebui 8080:80 -n ai
# Open http://localhost:8080Option B: Ingress (production)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: openwebui-ingress
namespace: ai
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- chat.yourdomain.com
secretName: openwebui-tls
rules:
- host: chat.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: openwebui
port:
number: 80kubectl apply -f openwebui-deployment.yaml
kubectl apply -f ingress.yamlStep 6: First Login
- Open OpenWebUI at
http://localhost:8080 - First user becomes admin — register immediately
- Settings → Models → verify Ollama models are listed
- Start chatting!
Add More Models
# Pull more models
kubectl exec -n ai deploy/ollama -- ollama pull mistral:7b
kubectl exec -n ai deploy/ollama -- ollama pull codellama:7b
kubectl exec -n ai deploy/ollama -- ollama pull phi3:mini
# List downloaded models
kubectl exec -n ai deploy/ollama -- ollama listHelm Install (Alternative — One Command)
If you prefer Helm:
helm repo add open-webui https://helm.openwebui.com/
helm repo update
helm install open-webui open-webui/open-webui \
--namespace ai \
--create-namespace \
--set ollama.enabled=true \
--set persistence.size=50Gi \
--set ingress.enabled=true \
--set ingress.host=chat.yourdomain.comResource Requirements
| Model | RAM needed | GPU needed | Speed (CPU) |
|---|---|---|---|
| llama3.2:1b | 4 GB | No | Fast |
| llama3.2:3b | 6 GB | No | Moderate |
| llama3.1:8b | 10 GB | Recommended | Slow on CPU |
| llama3.1:70b | 48 GB | Required | N/A |
| mistral:7b | 8 GB | Recommended | Slow on CPU |
For a team of 5-10 using it occasionally: llama3.2:3b on a 4-core/8GB node works well.
Use Cases
- Team AI assistant — internal chatbot with no data leaving
- Code review — load Codellama for code questions
- Document Q&A — OpenWebUI supports RAG with uploaded PDFs
- Development playground — test prompts before using Claude/GPT API in production
Resources
- OpenWebUI Docs — full documentation
- Ollama Models — browse available models
- KodeKloud MLOps Path — Kubernetes + AI/ML labs
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Best DevOps Tools Every Engineer Should Know in 2026
A comprehensive guide to the essential DevOps tools for containers, CI/CD, infrastructure, monitoring, and security — curated for practicing engineers.