Deploy LocalAI on Kubernetes — Self-Hosted LLM API Without GPU (2026)
Run LocalAI on Kubernetes to get an OpenAI-compatible API endpoint using CPU-only nodes. Deploy Llama 3, Mistral, or Phi-3 locally with no API costs, no data leaving your cluster, and full OpenAI SDK compatibility.
You want to run LLMs in your Kubernetes cluster but don't have GPU nodes — or can't send data to OpenAI for compliance reasons. LocalAI gives you an OpenAI-compatible REST API running on CPU-only nodes, supporting Llama 3, Mistral, Phi-3, and 50+ other models.
Your existing code that uses the OpenAI SDK works without any changes — just swap the base URL.
What LocalAI Is
LocalAI is a drop-in replacement for the OpenAI API that runs locally. It supports:
- Chat completions (
/v1/chat/completions) - Text completions (
/v1/completions) - Embeddings (
/v1/embeddings) - Image generation (
/v1/images/generations) - Text-to-speech, speech-to-text
No GPU required — runs on standard CPU nodes (slower, but works). GPU nodes significantly improve speed but aren't mandatory.
Hardware Requirements
| Model | RAM needed | Speed (CPU) | Speed (GPU T4) |
|---|---|---|---|
| Phi-3 Mini (3.8B) | 4GB | ~5 tok/sec | ~40 tok/sec |
| Mistral 7B (q4) | 8GB | ~2 tok/sec | ~25 tok/sec |
| Llama 3 8B (q4) | 8GB | ~2 tok/sec | ~25 tok/sec |
| Llama 3 70B (q4) | 48GB | Too slow | Needs A100 |
For CPU-only: start with Phi-3 Mini or Mistral 7B quantized (q4_K_M). Fast enough for internal tooling.
Step 1: Create Namespace and Storage
kubectl create namespace localai# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: localai-models
namespace: localai
spec:
accessModes: [ReadWriteOnce]
storageClassName: gp3
resources:
requests:
storage: 30Gi # Models are 4-8GB eachkubectl apply -f pvc.yamlStep 2: Create Model Configuration
LocalAI uses YAML config files to define models. Create a ConfigMap:
# models-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: localai-models-config
namespace: localai
data:
phi-3-mini.yaml: |
name: phi-3-mini
backend: llama
parameters:
model: phi-3-mini-4k-instruct-q4.gguf
context_size: 4096
threads: 4
f16: true
template:
chat_message: |
<|user|>
{{.Input}}<|end|>
<|assistant|>
mistral-7b.yaml: |
name: mistral-7b
backend: llama
parameters:
model: mistral-7b-instruct-v0.3-q4_K_M.gguf
context_size: 8192
threads: 4
f16: true
template:
chat_message: |
[INST] {{.Input}} [/INST]
text-embedding.yaml: |
name: text-embedding-ada-002
backend: bert-embeddings
parameters:
model: all-MiniLM-L6-v2.bin
embeddings: truekubectl apply -f models-config.yamlStep 3: Download Models (Init Job)
# download-models-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: download-models
namespace: localai
spec:
template:
spec:
containers:
- name: downloader
image: python:3.12-slim
command:
- /bin/bash
- -c
- |
pip install huggingface_hub -q
python3 -c "
from huggingface_hub import hf_hub_download
import shutil, os
models = [
('microsoft/Phi-3-mini-4k-instruct-gguf', 'Phi-3-mini-4k-instruct-q4.gguf'),
('TheBloke/Mistral-7B-Instruct-v0.3-GGUF', 'mistral-7b-instruct-v0.3.Q4_K_M.gguf'),
]
for repo, filename in models:
print(f'Downloading {filename}...')
path = hf_hub_download(repo_id=repo, filename=filename, cache_dir='/tmp')
dest = f'/models/{filename}'
if not os.path.exists(dest):
shutil.copy(path, dest)
print(f'Done: {dest}')
"
volumeMounts:
- name: models
mountPath: /models
resources:
requests:
memory: "2Gi"
cpu: "500m"
volumes:
- name: models
persistentVolumeClaim:
claimName: localai-models
restartPolicy: Neverkubectl apply -f download-models-job.yaml
kubectl logs -n localai job/download-models -fStep 4: Deploy LocalAI
# localai-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: localai
namespace: localai
spec:
replicas: 1
selector:
matchLabels:
app: localai
template:
metadata:
labels:
app: localai
spec:
containers:
- name: localai
image: localai/localai:latest-aio-cpu # CPU-only image
ports:
- containerPort: 8080
env:
- name: MODELS_PATH
value: /models
- name: CONTEXT_SIZE
value: "4096"
- name: THREADS
value: "4"
- name: DEBUG
value: "false"
volumeMounts:
- name: models
mountPath: /models
- name: config
mountPath: /models/config
resources:
requests:
memory: "6Gi"
cpu: "2000m"
limits:
memory: "10Gi"
cpu: "4000m"
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
volumes:
- name: models
persistentVolumeClaim:
claimName: localai-models
- name: config
configMap:
name: localai-models-config
---
apiVersion: v1
kind: Service
metadata:
name: localai
namespace: localai
spec:
selector:
app: localai
ports:
- port: 80
targetPort: 8080kubectl apply -f localai-deployment.yaml
# Watch pod startup (may take 2-3 minutes for model loading)
kubectl logs -n localai deploy/localai -fStep 5: Expose with Ingress
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: localai-ingress
namespace: localai
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
ingressClassName: nginx
rules:
- host: llm.internal.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: localai
port:
number: 80Step 6: Use It — OpenAI SDK Compatible
from openai import OpenAI
# Just change the base_url — everything else stays the same
client = OpenAI(
api_key="not-needed", # LocalAI doesn't require auth by default
base_url="http://llm.internal.yourdomain.com/v1"
)
# Chat completion — identical to OpenAI
response = client.chat.completions.create(
model="phi-3-mini",
messages=[
{"role": "system", "content": "You are a DevOps expert."},
{"role": "user", "content": "Explain what a Kubernetes PodDisruptionBudget does in 2 sentences."}
],
max_tokens=200,
)
print(response.choices[0].message.content)# Embeddings — also OpenAI-compatible
embedding = client.embeddings.create(
model="text-embedding-ada-002", # Your local embedding model
input="kubernetes pod scheduling"
)
print(embedding.data[0].embedding[:5]) # First 5 dimensions# Test directly with curl
curl http://llm.internal.yourdomain.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-3-mini",
"messages": [{"role": "user", "content": "What is a Kubernetes namespace?"}]
}'Scaling for Multiple Concurrent Users
LocalAI handles one request at a time per replica by default. For concurrent users:
# Scale up replicas
spec:
replicas: 3 # 3 instances = 3 concurrent requests
# Or use HPA based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: localai-hpa
namespace: localai
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: localai
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Practical Use Cases in DevOps
1. Internal documentation chatbot:
# Answer questions from your runbooks
response = client.chat.completions.create(
model="mistral-7b",
messages=[
{"role": "system", "content": f"Answer based on this runbook:\n{runbook_text}"},
{"role": "user", "content": "How do I rotate the database password?"}
]
)2. Log analysis:
def analyze_error_log(log_excerpt: str) -> str:
response = client.chat.completions.create(
model="phi-3-mini",
messages=[{
"role": "user",
"content": f"Analyze this error log and suggest the most likely cause:\n{log_excerpt}"
}]
)
return response.choices[0].message.content3. Terraform plan explainer:
def explain_terraform_plan(plan_output: str) -> str:
response = client.chat.completions.create(
model="mistral-7b",
messages=[{
"role": "user",
"content": f"Explain what this terraform plan will do in plain English:\n{plan_output}"
}]
)
return response.choices[0].message.contentCost Comparison
| Option | Cost for 1M tokens |
|---|---|
| OpenAI GPT-4o | ~$10–15 |
| Claude 3.5 Sonnet | ~$9 |
| LocalAI on t3.xlarge (CPU) | ~$0.15 (compute only) |
| LocalAI on g4dn.xlarge (GPU) | ~$0.53/hr amortized |
For high-volume internal tooling where output quality doesn't need to match GPT-4o, LocalAI on CPU nodes is dramatically cheaper.
LocalAI gives you a privacy-first, cost-effective LLM deployment that works with all existing OpenAI-compatible code. For production deployments, check the LocalAI documentation for advanced model configuration and GPU acceleration setup.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Build a DevOps AI Agent with LangGraph on Kubernetes (2026)
Build a stateful DevOps agent using LangGraph that can plan multi-step infrastructure tasks, use tools, handle errors, and maintain conversation context — deployed on Kubernetes with a FastAPI interface.