🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Deploy LocalAI on Kubernetes — Self-Hosted LLM API Without GPU (2026)

Run LocalAI on Kubernetes to get an OpenAI-compatible API endpoint using CPU-only nodes. Deploy Llama 3, Mistral, or Phi-3 locally with no API costs, no data leaving your cluster, and full OpenAI SDK compatibility.

DevOpsBoysApr 26, 20265 min read
Share:Tweet

You want to run LLMs in your Kubernetes cluster but don't have GPU nodes — or can't send data to OpenAI for compliance reasons. LocalAI gives you an OpenAI-compatible REST API running on CPU-only nodes, supporting Llama 3, Mistral, Phi-3, and 50+ other models.

Your existing code that uses the OpenAI SDK works without any changes — just swap the base URL.


What LocalAI Is

LocalAI is a drop-in replacement for the OpenAI API that runs locally. It supports:

  • Chat completions (/v1/chat/completions)
  • Text completions (/v1/completions)
  • Embeddings (/v1/embeddings)
  • Image generation (/v1/images/generations)
  • Text-to-speech, speech-to-text

No GPU required — runs on standard CPU nodes (slower, but works). GPU nodes significantly improve speed but aren't mandatory.


Hardware Requirements

ModelRAM neededSpeed (CPU)Speed (GPU T4)
Phi-3 Mini (3.8B)4GB~5 tok/sec~40 tok/sec
Mistral 7B (q4)8GB~2 tok/sec~25 tok/sec
Llama 3 8B (q4)8GB~2 tok/sec~25 tok/sec
Llama 3 70B (q4)48GBToo slowNeeds A100

For CPU-only: start with Phi-3 Mini or Mistral 7B quantized (q4_K_M). Fast enough for internal tooling.


Step 1: Create Namespace and Storage

bash
kubectl create namespace localai
yaml
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: localai-models
  namespace: localai
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: gp3
  resources:
    requests:
      storage: 30Gi  # Models are 4-8GB each
bash
kubectl apply -f pvc.yaml

Step 2: Create Model Configuration

LocalAI uses YAML config files to define models. Create a ConfigMap:

yaml
# models-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: localai-models-config
  namespace: localai
data:
  phi-3-mini.yaml: |
    name: phi-3-mini
    backend: llama
    parameters:
      model: phi-3-mini-4k-instruct-q4.gguf
      context_size: 4096
      threads: 4
      f16: true
    template:
      chat_message: |
        <|user|>
        {{.Input}}<|end|>
        <|assistant|>
 
  mistral-7b.yaml: |
    name: mistral-7b
    backend: llama
    parameters:
      model: mistral-7b-instruct-v0.3-q4_K_M.gguf
      context_size: 8192
      threads: 4
      f16: true
    template:
      chat_message: |
        [INST] {{.Input}} [/INST]
 
  text-embedding.yaml: |
    name: text-embedding-ada-002
    backend: bert-embeddings
    parameters:
      model: all-MiniLM-L6-v2.bin
    embeddings: true
bash
kubectl apply -f models-config.yaml

Step 3: Download Models (Init Job)

yaml
# download-models-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: download-models
  namespace: localai
spec:
  template:
    spec:
      containers:
      - name: downloader
        image: python:3.12-slim
        command:
        - /bin/bash
        - -c
        - |
          pip install huggingface_hub -q
          python3 -c "
          from huggingface_hub import hf_hub_download
          import shutil, os
 
          models = [
            ('microsoft/Phi-3-mini-4k-instruct-gguf', 'Phi-3-mini-4k-instruct-q4.gguf'),
            ('TheBloke/Mistral-7B-Instruct-v0.3-GGUF', 'mistral-7b-instruct-v0.3.Q4_K_M.gguf'),
          ]
          
          for repo, filename in models:
              print(f'Downloading {filename}...')
              path = hf_hub_download(repo_id=repo, filename=filename, cache_dir='/tmp')
              dest = f'/models/{filename}'
              if not os.path.exists(dest):
                  shutil.copy(path, dest)
              print(f'Done: {dest}')
          "
        volumeMounts:
        - name: models
          mountPath: /models
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: localai-models
      restartPolicy: Never
bash
kubectl apply -f download-models-job.yaml
kubectl logs -n localai job/download-models -f

Step 4: Deploy LocalAI

yaml
# localai-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: localai
  namespace: localai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: localai
  template:
    metadata:
      labels:
        app: localai
    spec:
      containers:
      - name: localai
        image: localai/localai:latest-aio-cpu   # CPU-only image
        ports:
        - containerPort: 8080
        env:
        - name: MODELS_PATH
          value: /models
        - name: CONTEXT_SIZE
          value: "4096"
        - name: THREADS
          value: "4"
        - name: DEBUG
          value: "false"
        volumeMounts:
        - name: models
          mountPath: /models
        - name: config
          mountPath: /models/config
        resources:
          requests:
            memory: "6Gi"
            cpu: "2000m"
          limits:
            memory: "10Gi"
            cpu: "4000m"
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 15
          timeoutSeconds: 5
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 30
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: localai-models
      - name: config
        configMap:
          name: localai-models-config
---
apiVersion: v1
kind: Service
metadata:
  name: localai
  namespace: localai
spec:
  selector:
    app: localai
  ports:
  - port: 80
    targetPort: 8080
bash
kubectl apply -f localai-deployment.yaml
 
# Watch pod startup (may take 2-3 minutes for model loading)
kubectl logs -n localai deploy/localai -f

Step 5: Expose with Ingress

yaml
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: localai-ingress
  namespace: localai
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: llm.internal.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: localai
            port:
              number: 80

Step 6: Use It — OpenAI SDK Compatible

python
from openai import OpenAI
 
# Just change the base_url — everything else stays the same
client = OpenAI(
    api_key="not-needed",   # LocalAI doesn't require auth by default
    base_url="http://llm.internal.yourdomain.com/v1"
)
 
# Chat completion — identical to OpenAI
response = client.chat.completions.create(
    model="phi-3-mini",
    messages=[
        {"role": "system", "content": "You are a DevOps expert."},
        {"role": "user", "content": "Explain what a Kubernetes PodDisruptionBudget does in 2 sentences."}
    ],
    max_tokens=200,
)
print(response.choices[0].message.content)
python
# Embeddings — also OpenAI-compatible
embedding = client.embeddings.create(
    model="text-embedding-ada-002",  # Your local embedding model
    input="kubernetes pod scheduling"
)
print(embedding.data[0].embedding[:5])  # First 5 dimensions
bash
# Test directly with curl
curl http://llm.internal.yourdomain.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-3-mini",
    "messages": [{"role": "user", "content": "What is a Kubernetes namespace?"}]
  }'

Scaling for Multiple Concurrent Users

LocalAI handles one request at a time per replica by default. For concurrent users:

yaml
# Scale up replicas
spec:
  replicas: 3   # 3 instances = 3 concurrent requests
 
# Or use HPA based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: localai-hpa
  namespace: localai
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: localai
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Practical Use Cases in DevOps

1. Internal documentation chatbot:

python
# Answer questions from your runbooks
response = client.chat.completions.create(
    model="mistral-7b",
    messages=[
        {"role": "system", "content": f"Answer based on this runbook:\n{runbook_text}"},
        {"role": "user", "content": "How do I rotate the database password?"}
    ]
)

2. Log analysis:

python
def analyze_error_log(log_excerpt: str) -> str:
    response = client.chat.completions.create(
        model="phi-3-mini",
        messages=[{
            "role": "user",
            "content": f"Analyze this error log and suggest the most likely cause:\n{log_excerpt}"
        }]
    )
    return response.choices[0].message.content

3. Terraform plan explainer:

python
def explain_terraform_plan(plan_output: str) -> str:
    response = client.chat.completions.create(
        model="mistral-7b",
        messages=[{
            "role": "user",
            "content": f"Explain what this terraform plan will do in plain English:\n{plan_output}"
        }]
    )
    return response.choices[0].message.content

Cost Comparison

OptionCost for 1M tokens
OpenAI GPT-4o~$10–15
Claude 3.5 Sonnet~$9
LocalAI on t3.xlarge (CPU)~$0.15 (compute only)
LocalAI on g4dn.xlarge (GPU)~$0.53/hr amortized

For high-volume internal tooling where output quality doesn't need to match GPT-4o, LocalAI on CPU nodes is dramatically cheaper.


LocalAI gives you a privacy-first, cost-effective LLM deployment that works with all existing OpenAI-compatible code. For production deployments, check the LocalAI documentation for advanced model configuration and GPU acceleration setup.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments