Run OpenWebUI + Ollama on Kubernetes — Self-Hosted ChatGPT (2026)

Deploy your own ChatGPT-like interface on Kubernetes using Ollama for local LLMs and OpenWebUI for the frontend. Full setup with GPU support and persistent storage.

OpenWebUI is an open-source ChatGPT-like interface. Ollama serves LLMs locally. Together on Kubernetes, you get a private, self-hosted AI assistant with no data leaving your cluster. Here's the full setup.

What You'll Build

Users → OpenWebUI (web app) → Ollama (LLM server) → GPU nodes
                ↓
        Persistent storage (chat history, models)

What this gives you:

ChatGPT-like interface for your team
Models run locally — no API costs, no data privacy concerns
Supports Llama 3, Mistral, Phi-3, Gemma, and 100+ models
OpenAI-compatible API (works with existing tools)

Prerequisites

Kubernetes cluster (minikube works for CPU, EKS/GKE for GPU)
kubectl + helm installed
For GPU: NVIDIA drivers + device plugin installed

Step 1: Create Namespace and Storage

bash

kubectl create namespace ai

yaml

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: ai
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi    # models are large — Llama 3 8B = 4.7GB
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: openwebui-data
  namespace: ai
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 5Gi

bash

kubectl apply -f pvc.yaml

Step 2: Deploy Ollama

CPU version (no GPU)

yaml

# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
          limits:
            cpu: "4"
            memory: "16Gi"
        volumeMounts:
        - name: models
          mountPath: /root/.ollama
        env:
        - name: OLLAMA_KEEP_ALIVE
          value: "24h"    # keep models loaded in memory
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434

GPU version (NVIDIA)

yaml

# For GPU nodes — add to container spec:
resources:
  requests:
    cpu: "2"
    memory: "8Gi"
    nvidia.com/gpu: 1   # ← request 1 GPU
  limits:
    nvidia.com/gpu: 1

bash

# Verify NVIDIA device plugin is installed
kubectl get pods -n kube-system | grep nvidia

Step 3: Pull a Model

bash

kubectl apply -f ollama-deployment.yaml
 
# Wait for pod to be running
kubectl get pods -n ai -w
 
# Pull a model (do this once — stored on PVC)
kubectl exec -n ai deploy/ollama -- ollama pull llama3.2:3b
 
# Available models: llama3.2, mistral, phi3, gemma2, codellama
# Check model sizes before pulling:
# llama3.2:1b  → 1.3 GB (fast, less capable)
# llama3.2:3b  → 2.0 GB (good balance)
# llama3.1:8b  → 4.7 GB (good quality, needs 8GB RAM)
# llama3.1:70b → 40 GB  (excellent, needs GPU)

Step 4: Deploy OpenWebUI

yaml

# openwebui-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: openwebui
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: openwebui
  template:
    metadata:
      labels:
        app: openwebui
    spec:
      containers:
      - name: openwebui
        image: ghcr.io/open-webui/open-webui:main
        ports:
        - containerPort: 8080
        env:
        - name: OLLAMA_BASE_URL
          value: "http://ollama:11434"   # ← points to Ollama service
        - name: WEBUI_SECRET_KEY
          value: "change-this-to-a-random-secret"
        - name: DEFAULT_USER_ROLE
          value: "user"
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"
        volumeMounts:
        - name: data
          mountPath: /app/backend/data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: openwebui-data
---
apiVersion: v1
kind: Service
metadata:
  name: openwebui
  namespace: ai
spec:
  selector:
    app: openwebui
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Step 5: Expose OpenWebUI

Option A: Port Forward (local/dev)

bash

kubectl port-forward svc/openwebui 8080:80 -n ai
# Open http://localhost:8080

Option B: Ingress (production)

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: openwebui-ingress
  namespace: ai
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - chat.yourdomain.com
    secretName: openwebui-tls
  rules:
  - host: chat.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: openwebui
            port:
              number: 80

bash

kubectl apply -f openwebui-deployment.yaml
kubectl apply -f ingress.yaml

Open OpenWebUI at http://localhost:8080
First user becomes admin — register immediately
Settings → Models → verify Ollama models are listed
Start chatting!

Add More Models

bash

# Pull more models
kubectl exec -n ai deploy/ollama -- ollama pull mistral:7b
kubectl exec -n ai deploy/ollama -- ollama pull codellama:7b
kubectl exec -n ai deploy/ollama -- ollama pull phi3:mini
 
# List downloaded models
kubectl exec -n ai deploy/ollama -- ollama list

Helm Install (Alternative — One Command)

If you prefer Helm:

bash

helm repo add open-webui https://helm.openwebui.com/
helm repo update
 
helm install open-webui open-webui/open-webui \
  --namespace ai \
  --create-namespace \
  --set ollama.enabled=true \
  --set persistence.size=50Gi \
  --set ingress.enabled=true \
  --set ingress.host=chat.yourdomain.com

Resource Requirements

Model	RAM needed	GPU needed	Speed (CPU)
llama3.2:1b	4 GB	No	Fast
llama3.2:3b	6 GB	No	Moderate
llama3.1:8b	10 GB	Recommended	Slow on CPU
llama3.1:70b	48 GB	Required	N/A
mistral:7b	8 GB	Recommended	Slow on CPU

For a team of 5-10 using it occasionally: llama3.2:3b on a 4-core/8GB node works well.

Use Cases

Team AI assistant — internal chatbot with no data leaving
Code review — load Codellama for code questions
Document Q&A — OpenWebUI supports RAG with uploaded PDFs
Development playground — test prompts before using Claude/GPT API in production

Resources

OpenWebUI Docs — full documentation
Ollama Models — browse available models
KodeKloud MLOps Path — Kubernetes + AI/ML labs

Run OpenWebUI + Ollama on Kubernetes — Self-Hosted ChatGPT (2026)

What You'll Build

Prerequisites

Step 1: Create Namespace and Storage

Step 2: Deploy Ollama

CPU version (no GPU)

GPU version (NVIDIA)

Step 3: Pull a Model

Step 4: Deploy OpenWebUI

Step 5: Expose OpenWebUI

Option A: Port Forward (local/dev)

Option B: Ingress (production)

Add More Models

Helm Install (Alternative — One Command)

Resource Requirements

Use Cases

Resources

Stay ahead of the curve

Related Articles

AI-Driven Capacity Planning for Kubernetes Clusters (2026)

AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds

Argo Workflows vs Prefect vs Airflow — Best for ML Pipelines 2026

Comments