All Articles

Run OpenWebUI + Ollama on Kubernetes — Self-Hosted ChatGPT (2026)

Deploy your own ChatGPT-like interface on Kubernetes using Ollama for local LLMs and OpenWebUI for the frontend. Full setup with GPU support and persistent storage.

DevOpsBoysApr 15, 20264 min read
Share:Tweet

OpenWebUI is an open-source ChatGPT-like interface. Ollama serves LLMs locally. Together on Kubernetes, you get a private, self-hosted AI assistant with no data leaving your cluster. Here's the full setup.

What You'll Build

Users → OpenWebUI (web app) → Ollama (LLM server) → GPU nodes
                ↓
        Persistent storage (chat history, models)

What this gives you:

  • ChatGPT-like interface for your team
  • Models run locally — no API costs, no data privacy concerns
  • Supports Llama 3, Mistral, Phi-3, Gemma, and 100+ models
  • OpenAI-compatible API (works with existing tools)

Prerequisites

  • Kubernetes cluster (minikube works for CPU, EKS/GKE for GPU)
  • kubectl + helm installed
  • For GPU: NVIDIA drivers + device plugin installed

Step 1: Create Namespace and Storage

bash
kubectl create namespace ai
yaml
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: ai
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi    # models are large — Llama 3 8B = 4.7GB
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: openwebui-data
  namespace: ai
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 5Gi
bash
kubectl apply -f pvc.yaml

Step 2: Deploy Ollama

CPU version (no GPU)

yaml
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
          limits:
            cpu: "4"
            memory: "16Gi"
        volumeMounts:
        - name: models
          mountPath: /root/.ollama
        env:
        - name: OLLAMA_KEEP_ALIVE
          value: "24h"    # keep models loaded in memory
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434

GPU version (NVIDIA)

yaml
# For GPU nodes — add to container spec:
resources:
  requests:
    cpu: "2"
    memory: "8Gi"
    nvidia.com/gpu: 1   # ← request 1 GPU
  limits:
    nvidia.com/gpu: 1
bash
# Verify NVIDIA device plugin is installed
kubectl get pods -n kube-system | grep nvidia

Step 3: Pull a Model

bash
kubectl apply -f ollama-deployment.yaml
 
# Wait for pod to be running
kubectl get pods -n ai -w
 
# Pull a model (do this once — stored on PVC)
kubectl exec -n ai deploy/ollama -- ollama pull llama3.2:3b
 
# Available models: llama3.2, mistral, phi3, gemma2, codellama
# Check model sizes before pulling:
# llama3.2:1b  → 1.3 GB (fast, less capable)
# llama3.2:3b  → 2.0 GB (good balance)
# llama3.1:8b  → 4.7 GB (good quality, needs 8GB RAM)
# llama3.1:70b → 40 GB  (excellent, needs GPU)

Step 4: Deploy OpenWebUI

yaml
# openwebui-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: openwebui
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: openwebui
  template:
    metadata:
      labels:
        app: openwebui
    spec:
      containers:
      - name: openwebui
        image: ghcr.io/open-webui/open-webui:main
        ports:
        - containerPort: 8080
        env:
        - name: OLLAMA_BASE_URL
          value: "http://ollama:11434"   # ← points to Ollama service
        - name: WEBUI_SECRET_KEY
          value: "change-this-to-a-random-secret"
        - name: DEFAULT_USER_ROLE
          value: "user"
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"
        volumeMounts:
        - name: data
          mountPath: /app/backend/data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: openwebui-data
---
apiVersion: v1
kind: Service
metadata:
  name: openwebui
  namespace: ai
spec:
  selector:
    app: openwebui
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Step 5: Expose OpenWebUI

Option A: Port Forward (local/dev)

bash
kubectl port-forward svc/openwebui 8080:80 -n ai
# Open http://localhost:8080

Option B: Ingress (production)

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: openwebui-ingress
  namespace: ai
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - chat.yourdomain.com
    secretName: openwebui-tls
  rules:
  - host: chat.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: openwebui
            port:
              number: 80
bash
kubectl apply -f openwebui-deployment.yaml
kubectl apply -f ingress.yaml

Step 6: First Login

  1. Open OpenWebUI at http://localhost:8080
  2. First user becomes admin — register immediately
  3. Settings → Models → verify Ollama models are listed
  4. Start chatting!

Add More Models

bash
# Pull more models
kubectl exec -n ai deploy/ollama -- ollama pull mistral:7b
kubectl exec -n ai deploy/ollama -- ollama pull codellama:7b
kubectl exec -n ai deploy/ollama -- ollama pull phi3:mini
 
# List downloaded models
kubectl exec -n ai deploy/ollama -- ollama list

Helm Install (Alternative — One Command)

If you prefer Helm:

bash
helm repo add open-webui https://helm.openwebui.com/
helm repo update
 
helm install open-webui open-webui/open-webui \
  --namespace ai \
  --create-namespace \
  --set ollama.enabled=true \
  --set persistence.size=50Gi \
  --set ingress.enabled=true \
  --set ingress.host=chat.yourdomain.com

Resource Requirements

ModelRAM neededGPU neededSpeed (CPU)
llama3.2:1b4 GBNoFast
llama3.2:3b6 GBNoModerate
llama3.1:8b10 GBRecommendedSlow on CPU
llama3.1:70b48 GBRequiredN/A
mistral:7b8 GBRecommendedSlow on CPU

For a team of 5-10 using it occasionally: llama3.2:3b on a 4-core/8GB node works well.


Use Cases

  • Team AI assistant — internal chatbot with no data leaving
  • Code review — load Codellama for code questions
  • Document Q&A — OpenWebUI supports RAG with uploaded PDFs
  • Development playground — test prompts before using Claude/GPT API in production

Resources

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments