How to Run Ollama on Kubernetes — Self-Host LLMs in Your Cluster (2026)

Ollama makes running LLMs locally easy. Running it on Kubernetes makes it scalable, persistent, and accessible to your whole team or application stack. Here's the complete setup — CPU and GPU, with persistent model storage and a production-ready deployment.

Everyone's running LLMs — but paying OpenAI API costs at scale gets expensive fast. Ollama on Kubernetes gives you a private, self-hosted LLM inference server running Llama 3, Mistral, Phi-3, or any GGUF model — inside your own cluster.

Here's the full setup from zero to a working deployment.

What You're Building

kubectl → Ollama Deployment (GPU or CPU pod)
             ├── Persistent Volume (model storage, ~5-50 GB per model)
             ├── Service (internal API on port 11434)
             └── Optional: Ingress (expose to team or app)

Your apps call http://ollama-service:11434/api/generate — same API as running Ollama locally, but running inside Kubernetes, accessible from any pod.

Prerequisites

Kubernetes cluster (EKS, GKE, AKS, or local kind/minikube)
kubectl configured
At least 4 CPU cores and 8 GB RAM for a small CPU-based model
For GPU: NVIDIA GPU node + NVIDIA device plugin installed (covered below)

Part 1: CPU-Based Setup (No GPU Required)

Start here if you don't have GPU nodes. You can run smaller models (Phi-3 mini, Llama 3 8B at Q4 quantization) on CPU — slower, but it works for development and low-traffic use cases.

Namespace

bash

kubectl create namespace ollama

Persistent Volume for Models

Models are large (4–40 GB). Mount a PVC so Ollama doesn't re-download models on every pod restart.

yaml

# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: ollama
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3   # use your cluster's StorageClass
  resources:
    requests:
      storage: 50Gi        # enough for 2-3 models

bash

kubectl apply -f ollama-pvc.yaml

Deployment (CPU)

yaml

# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
            limits:
              cpu: "4"
              memory: "16Gi"
          volumeMounts:
            - name: models
              mountPath: /root/.ollama    # Ollama stores models here
          env:
            - name: OLLAMA_HOST
              value: "0.0.0.0"            # listen on all interfaces
          readinessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models

bash

kubectl apply -f ollama-deployment.yaml

Service

yaml

# ollama-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434
  type: ClusterIP

bash

kubectl apply -f ollama-service.yaml

Pull a Model

bash

# Exec into the Ollama pod and pull a model
kubectl exec -n ollama deploy/ollama -- ollama pull phi3:mini
 
# Or pull llama3 (4.7 GB)
kubectl exec -n ollama deploy/ollama -- ollama pull llama3:8b

Wait for the pull to complete — models are stored in the PVC.

Test It

bash

# Port-forward to test locally
kubectl port-forward -n ollama svc/ollama 11434:11434
 
# In another terminal
curl http://localhost:11434/api/generate -d '{
  "model": "phi3:mini",
  "prompt": "Explain Kubernetes in 3 sentences.",
  "stream": false
}'

Part 2: GPU Setup on EKS

GPU inference is 10–100x faster than CPU for LLMs. This section sets up Ollama on a GPU node group in EKS.

Step 1: Add a GPU Node Group in Terraform

hcl

# In your EKS Terraform config
resource "aws_eks_node_group" "gpu" {
  cluster_name    = module.eks.cluster_name
  node_group_name = "gpu"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = module.vpc.private_subnets
 
  instance_types = ["g4dn.xlarge"]   # 1x NVIDIA T4, 16 GB VRAM, ~$0.52/hr
 
  scaling_config {
    desired_size = 1
    min_size     = 0
    max_size     = 2
  }
 
  labels = {
    "nvidia.com/gpu" = "true"
  }
 
  taint {
    key    = "nvidia.com/gpu"
    value  = "true"
    effect = "NO_SCHEDULE"     # only GPU-requesting pods land here
  }
}

Step 2: Install NVIDIA Device Plugin

bash

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

Verify GPU is visible:

bash

kubectl get nodes -l nvidia.com/gpu=true
kubectl describe node <gpu-node> | grep nvidia.com/gpu
# Allocatable: nvidia.com/gpu: 1

Step 3: GPU Deployment

yaml

# ollama-gpu-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      nodeSelector:
        nvidia.com/gpu: "true"
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
              nvidia.com/gpu: "1"     # request 1 GPU
            limits:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          env:
            - name: OLLAMA_HOST
              value: "0.0.0.0"
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models

With a T4 GPU, Llama 3 8B runs at ~30 tokens/second — fast enough for real application use.

Part 3: Calling Ollama from Your Application

Any pod in the cluster can call Ollama via the Service DNS name.

Python (with requests)

python

import requests
 
response = requests.post(
    "http://ollama.ollama.svc.cluster.local:11434/api/generate",
    json={
        "model": "llama3:8b",
        "prompt": "Write a Kubernetes health check for a Node.js app",
        "stream": False
    }
)
print(response.json()["response"])

Using OpenAI-compatible endpoint

Ollama exposes an OpenAI-compatible API at /v1/. This means you can use the OpenAI Python SDK without changing your code:

python

from openai import OpenAI
 
client = OpenAI(
    base_url="http://ollama.ollama.svc.cluster.local:11434/v1",
    api_key="ollama"   # required but ignored by Ollama
)
 
response = client.chat.completions.create(
    model="llama3:8b",
    messages=[
        {"role": "user", "content": "Explain Terraform state locking in 2 sentences."}
    ]
)
print(response.choices[0].message.content)

This drop-in compatibility means you can switch between OpenAI and self-hosted Ollama by just changing the base_url.

Part 4: Optional — Open WebUI (ChatGPT-like Interface)

Deploy Open WebUI to give your team a browser-based chat interface connected to your Ollama backend.

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
        - name: open-webui
          image: ghcr.io/open-webui/open-webui:main
          ports:
            - containerPort: 8080
          env:
            - name: OLLAMA_BASE_URL
              value: "http://ollama:11434"
          volumeMounts:
            - name: data
              mountPath: /app/backend/data
      volumes:
        - name: data
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: open-webui
  namespace: ollama
spec:
  selector:
    app: open-webui
  ports:
    - port: 8080
      targetPort: 8080
  type: ClusterIP

Port-forward to access: kubectl port-forward -n ollama svc/open-webui 8080:8080

Resource Requirements by Model

Model	Size (Q4)	Min RAM	Min VRAM	Tokens/sec (CPU)	Tokens/sec (T4 GPU)
Phi-3 Mini 3.8B	2.3 GB	4 GB	4 GB	8-12	60-80
Llama 3 8B	4.7 GB	8 GB	8 GB	3-5	30-40
Mistral 7B	4.1 GB	8 GB	8 GB	4-6	35-45
Llama 3 70B	40 GB	64 GB	40 GB (A100)	0.5-1	needs A100
CodeLlama 13B	7.4 GB	16 GB	16 GB	2-3	20-25

For most use cases: Llama 3 8B or Phi-3 Mini on a g4dn.xlarge ($0.52/hr) gives excellent quality at low cost.

Cost Comparison

Option	Monthly cost (1M tokens/day)	Latency
OpenAI GPT-4o	~$600–1,500	1-3s
OpenAI GPT-4o-mini	~$30–60	0.5-1s
Self-hosted Llama 3 8B (g4dn.xlarge)	~$15–20	0.3-1s
Self-hosted Phi-3 Mini (CPU only)	~$5–10 (smaller instances)	3-10s

For internal tooling, dev assistants, and batch jobs — self-hosted Ollama pays for itself quickly.

Troubleshooting

Pod stuck in Pending:

bash

kubectl describe pod -n ollama -l app=ollama
# If GPU: check "insufficient nvidia.com/gpu" → GPU node not ready or device plugin not installed

Model pull times out: Models can be large. The first pull takes time. Use a larger initialDelaySeconds in readinessProbe (60-120s).

Out of memory: Increase memory limits. Llama 3 8B needs at least 8 GB. Set limits generously.

Slow inference on CPU: Expected — CPU inference is slow. Upgrade to a GPU node or use a smaller model (Phi-3 Mini).

How to Run Ollama on Kubernetes — Self-Host LLMs in Your Cluster (2026)

What You're Building

Prerequisites

Part 1: CPU-Based Setup (No GPU Required)

Namespace

Persistent Volume for Models

Deployment (CPU)

Service

Pull a Model

Test It

Part 2: GPU Setup on EKS

Step 1: Add a GPU Node Group in Terraform

Step 2: Install NVIDIA Device Plugin

Step 3: GPU Deployment

Part 3: Calling Ollama from Your Application

Python (with requests)

Using OpenAI-compatible endpoint

Part 4: Optional — Open WebUI (ChatGPT-like Interface)

Resource Requirements by Model

Cost Comparison

Troubleshooting

Stay ahead of the curve

Related Articles

Build an AI Kubernetes Cost Optimizer with Python and Claude API

Build an AI Kubernetes Resource Rightsizer with Claude API

Build a Kubernetes Cost Optimization Bot with AI in 2026

Comments