All Articles

How to Run Ollama on Kubernetes — Self-Host LLMs in Your Cluster (2026)

Ollama makes running LLMs locally easy. Running it on Kubernetes makes it scalable, persistent, and accessible to your whole team or application stack. Here's the complete setup — CPU and GPU, with persistent model storage and a production-ready deployment.

DevOpsBoysApr 10, 20266 min read
Share:Tweet

Everyone's running LLMs — but paying OpenAI API costs at scale gets expensive fast. Ollama on Kubernetes gives you a private, self-hosted LLM inference server running Llama 3, Mistral, Phi-3, or any GGUF model — inside your own cluster.

Here's the full setup from zero to a working deployment.


What You're Building

kubectl → Ollama Deployment (GPU or CPU pod)
             ├── Persistent Volume (model storage, ~5-50 GB per model)
             ├── Service (internal API on port 11434)
             └── Optional: Ingress (expose to team or app)

Your apps call http://ollama-service:11434/api/generate — same API as running Ollama locally, but running inside Kubernetes, accessible from any pod.


Prerequisites

  • Kubernetes cluster (EKS, GKE, AKS, or local kind/minikube)
  • kubectl configured
  • At least 4 CPU cores and 8 GB RAM for a small CPU-based model
  • For GPU: NVIDIA GPU node + NVIDIA device plugin installed (covered below)

Part 1: CPU-Based Setup (No GPU Required)

Start here if you don't have GPU nodes. You can run smaller models (Phi-3 mini, Llama 3 8B at Q4 quantization) on CPU — slower, but it works for development and low-traffic use cases.

Namespace

bash
kubectl create namespace ollama

Persistent Volume for Models

Models are large (4–40 GB). Mount a PVC so Ollama doesn't re-download models on every pod restart.

yaml
# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: ollama
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3   # use your cluster's StorageClass
  resources:
    requests:
      storage: 50Gi        # enough for 2-3 models
bash
kubectl apply -f ollama-pvc.yaml

Deployment (CPU)

yaml
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
            limits:
              cpu: "4"
              memory: "16Gi"
          volumeMounts:
            - name: models
              mountPath: /root/.ollama    # Ollama stores models here
          env:
            - name: OLLAMA_HOST
              value: "0.0.0.0"            # listen on all interfaces
          readinessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models
bash
kubectl apply -f ollama-deployment.yaml

Service

yaml
# ollama-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434
  type: ClusterIP
bash
kubectl apply -f ollama-service.yaml

Pull a Model

bash
# Exec into the Ollama pod and pull a model
kubectl exec -n ollama deploy/ollama -- ollama pull phi3:mini
 
# Or pull llama3 (4.7 GB)
kubectl exec -n ollama deploy/ollama -- ollama pull llama3:8b

Wait for the pull to complete — models are stored in the PVC.

Test It

bash
# Port-forward to test locally
kubectl port-forward -n ollama svc/ollama 11434:11434
 
# In another terminal
curl http://localhost:11434/api/generate -d '{
  "model": "phi3:mini",
  "prompt": "Explain Kubernetes in 3 sentences.",
  "stream": false
}'

Part 2: GPU Setup on EKS

GPU inference is 10–100x faster than CPU for LLMs. This section sets up Ollama on a GPU node group in EKS.

Step 1: Add a GPU Node Group in Terraform

hcl
# In your EKS Terraform config
resource "aws_eks_node_group" "gpu" {
  cluster_name    = module.eks.cluster_name
  node_group_name = "gpu"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = module.vpc.private_subnets
 
  instance_types = ["g4dn.xlarge"]   # 1x NVIDIA T4, 16 GB VRAM, ~$0.52/hr
 
  scaling_config {
    desired_size = 1
    min_size     = 0
    max_size     = 2
  }
 
  labels = {
    "nvidia.com/gpu" = "true"
  }
 
  taint {
    key    = "nvidia.com/gpu"
    value  = "true"
    effect = "NO_SCHEDULE"     # only GPU-requesting pods land here
  }
}

Step 2: Install NVIDIA Device Plugin

bash
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

Verify GPU is visible:

bash
kubectl get nodes -l nvidia.com/gpu=true
kubectl describe node <gpu-node> | grep nvidia.com/gpu
# Allocatable: nvidia.com/gpu: 1

Step 3: GPU Deployment

yaml
# ollama-gpu-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      nodeSelector:
        nvidia.com/gpu: "true"
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
              nvidia.com/gpu: "1"     # request 1 GPU
            limits:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          env:
            - name: OLLAMA_HOST
              value: "0.0.0.0"
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models

With a T4 GPU, Llama 3 8B runs at ~30 tokens/second — fast enough for real application use.


Part 3: Calling Ollama from Your Application

Any pod in the cluster can call Ollama via the Service DNS name.

Python (with requests)

python
import requests
 
response = requests.post(
    "http://ollama.ollama.svc.cluster.local:11434/api/generate",
    json={
        "model": "llama3:8b",
        "prompt": "Write a Kubernetes health check for a Node.js app",
        "stream": False
    }
)
print(response.json()["response"])

Using OpenAI-compatible endpoint

Ollama exposes an OpenAI-compatible API at /v1/. This means you can use the OpenAI Python SDK without changing your code:

python
from openai import OpenAI
 
client = OpenAI(
    base_url="http://ollama.ollama.svc.cluster.local:11434/v1",
    api_key="ollama"   # required but ignored by Ollama
)
 
response = client.chat.completions.create(
    model="llama3:8b",
    messages=[
        {"role": "user", "content": "Explain Terraform state locking in 2 sentences."}
    ]
)
print(response.choices[0].message.content)

This drop-in compatibility means you can switch between OpenAI and self-hosted Ollama by just changing the base_url.


Part 4: Optional — Open WebUI (ChatGPT-like Interface)

Deploy Open WebUI to give your team a browser-based chat interface connected to your Ollama backend.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
        - name: open-webui
          image: ghcr.io/open-webui/open-webui:main
          ports:
            - containerPort: 8080
          env:
            - name: OLLAMA_BASE_URL
              value: "http://ollama:11434"
          volumeMounts:
            - name: data
              mountPath: /app/backend/data
      volumes:
        - name: data
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: open-webui
  namespace: ollama
spec:
  selector:
    app: open-webui
  ports:
    - port: 8080
      targetPort: 8080
  type: ClusterIP

Port-forward to access: kubectl port-forward -n ollama svc/open-webui 8080:8080


Resource Requirements by Model

ModelSize (Q4)Min RAMMin VRAMTokens/sec (CPU)Tokens/sec (T4 GPU)
Phi-3 Mini 3.8B2.3 GB4 GB4 GB8-1260-80
Llama 3 8B4.7 GB8 GB8 GB3-530-40
Mistral 7B4.1 GB8 GB8 GB4-635-45
Llama 3 70B40 GB64 GB40 GB (A100)0.5-1needs A100
CodeLlama 13B7.4 GB16 GB16 GB2-320-25

For most use cases: Llama 3 8B or Phi-3 Mini on a g4dn.xlarge ($0.52/hr) gives excellent quality at low cost.


Cost Comparison

OptionMonthly cost (1M tokens/day)Latency
OpenAI GPT-4o~$600–1,5001-3s
OpenAI GPT-4o-mini~$30–600.5-1s
Self-hosted Llama 3 8B (g4dn.xlarge)~$15–200.3-1s
Self-hosted Phi-3 Mini (CPU only)~$5–10 (smaller instances)3-10s

For internal tooling, dev assistants, and batch jobs — self-hosted Ollama pays for itself quickly.


Troubleshooting

Pod stuck in Pending:

bash
kubectl describe pod -n ollama -l app=ollama
# If GPU: check "insufficient nvidia.com/gpu" → GPU node not ready or device plugin not installed

Model pull times out: Models can be large. The first pull takes time. Use a larger initialDelaySeconds in readinessProbe (60-120s).

Out of memory: Increase memory limits. Llama 3 8B needs at least 8 GB. Set limits generously.

Slow inference on CPU: Expected — CPU inference is slow. Upgrade to a GPU node or use a smaller model (Phi-3 Mini).


Related: What is MLOps — Complete Guide | Kubernetes Resource Calculator

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments