All Articles

Deploy LocalAI on Kubernetes — Self-Hosted LLM API Without GPU (2026)

Run LocalAI on Kubernetes to get an OpenAI-compatible API endpoint using CPU-only nodes. Deploy Llama 3, Mistral, or Phi-3 locally with no API costs, no data leaving your cluster, and full OpenAI SDK compatibility.

DevOpsBoysApr 26, 20265 min read
Share:Tweet

You want to run LLMs in your Kubernetes cluster but don't have GPU nodes — or can't send data to OpenAI for compliance reasons. LocalAI gives you an OpenAI-compatible REST API running on CPU-only nodes, supporting Llama 3, Mistral, Phi-3, and 50+ other models.

Your existing code that uses the OpenAI SDK works without any changes — just swap the base URL.


What LocalAI Is

LocalAI is a drop-in replacement for the OpenAI API that runs locally. It supports:

  • Chat completions (/v1/chat/completions)
  • Text completions (/v1/completions)
  • Embeddings (/v1/embeddings)
  • Image generation (/v1/images/generations)
  • Text-to-speech, speech-to-text

No GPU required — runs on standard CPU nodes (slower, but works). GPU nodes significantly improve speed but aren't mandatory.


Hardware Requirements

ModelRAM neededSpeed (CPU)Speed (GPU T4)
Phi-3 Mini (3.8B)4GB~5 tok/sec~40 tok/sec
Mistral 7B (q4)8GB~2 tok/sec~25 tok/sec
Llama 3 8B (q4)8GB~2 tok/sec~25 tok/sec
Llama 3 70B (q4)48GBToo slowNeeds A100

For CPU-only: start with Phi-3 Mini or Mistral 7B quantized (q4_K_M). Fast enough for internal tooling.


Step 1: Create Namespace and Storage

bash
kubectl create namespace localai
yaml
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: localai-models
  namespace: localai
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: gp3
  resources:
    requests:
      storage: 30Gi  # Models are 4-8GB each
bash
kubectl apply -f pvc.yaml

Step 2: Create Model Configuration

LocalAI uses YAML config files to define models. Create a ConfigMap:

yaml
# models-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: localai-models-config
  namespace: localai
data:
  phi-3-mini.yaml: |
    name: phi-3-mini
    backend: llama
    parameters:
      model: phi-3-mini-4k-instruct-q4.gguf
      context_size: 4096
      threads: 4
      f16: true
    template:
      chat_message: |
        <|user|>
        {{.Input}}<|end|>
        <|assistant|>
 
  mistral-7b.yaml: |
    name: mistral-7b
    backend: llama
    parameters:
      model: mistral-7b-instruct-v0.3-q4_K_M.gguf
      context_size: 8192
      threads: 4
      f16: true
    template:
      chat_message: |
        [INST] {{.Input}} [/INST]
 
  text-embedding.yaml: |
    name: text-embedding-ada-002
    backend: bert-embeddings
    parameters:
      model: all-MiniLM-L6-v2.bin
    embeddings: true
bash
kubectl apply -f models-config.yaml

Step 3: Download Models (Init Job)

yaml
# download-models-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: download-models
  namespace: localai
spec:
  template:
    spec:
      containers:
      - name: downloader
        image: python:3.12-slim
        command:
        - /bin/bash
        - -c
        - |
          pip install huggingface_hub -q
          python3 -c "
          from huggingface_hub import hf_hub_download
          import shutil, os
 
          models = [
            ('microsoft/Phi-3-mini-4k-instruct-gguf', 'Phi-3-mini-4k-instruct-q4.gguf'),
            ('TheBloke/Mistral-7B-Instruct-v0.3-GGUF', 'mistral-7b-instruct-v0.3.Q4_K_M.gguf'),
          ]
          
          for repo, filename in models:
              print(f'Downloading {filename}...')
              path = hf_hub_download(repo_id=repo, filename=filename, cache_dir='/tmp')
              dest = f'/models/{filename}'
              if not os.path.exists(dest):
                  shutil.copy(path, dest)
              print(f'Done: {dest}')
          "
        volumeMounts:
        - name: models
          mountPath: /models
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: localai-models
      restartPolicy: Never
bash
kubectl apply -f download-models-job.yaml
kubectl logs -n localai job/download-models -f

Step 4: Deploy LocalAI

yaml
# localai-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: localai
  namespace: localai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: localai
  template:
    metadata:
      labels:
        app: localai
    spec:
      containers:
      - name: localai
        image: localai/localai:latest-aio-cpu   # CPU-only image
        ports:
        - containerPort: 8080
        env:
        - name: MODELS_PATH
          value: /models
        - name: CONTEXT_SIZE
          value: "4096"
        - name: THREADS
          value: "4"
        - name: DEBUG
          value: "false"
        volumeMounts:
        - name: models
          mountPath: /models
        - name: config
          mountPath: /models/config
        resources:
          requests:
            memory: "6Gi"
            cpu: "2000m"
          limits:
            memory: "10Gi"
            cpu: "4000m"
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 15
          timeoutSeconds: 5
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 30
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: localai-models
      - name: config
        configMap:
          name: localai-models-config
---
apiVersion: v1
kind: Service
metadata:
  name: localai
  namespace: localai
spec:
  selector:
    app: localai
  ports:
  - port: 80
    targetPort: 8080
bash
kubectl apply -f localai-deployment.yaml
 
# Watch pod startup (may take 2-3 minutes for model loading)
kubectl logs -n localai deploy/localai -f

Step 5: Expose with Ingress

yaml
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: localai-ingress
  namespace: localai
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: llm.internal.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: localai
            port:
              number: 80

Step 6: Use It — OpenAI SDK Compatible

python
from openai import OpenAI
 
# Just change the base_url — everything else stays the same
client = OpenAI(
    api_key="not-needed",   # LocalAI doesn't require auth by default
    base_url="http://llm.internal.yourdomain.com/v1"
)
 
# Chat completion — identical to OpenAI
response = client.chat.completions.create(
    model="phi-3-mini",
    messages=[
        {"role": "system", "content": "You are a DevOps expert."},
        {"role": "user", "content": "Explain what a Kubernetes PodDisruptionBudget does in 2 sentences."}
    ],
    max_tokens=200,
)
print(response.choices[0].message.content)
python
# Embeddings — also OpenAI-compatible
embedding = client.embeddings.create(
    model="text-embedding-ada-002",  # Your local embedding model
    input="kubernetes pod scheduling"
)
print(embedding.data[0].embedding[:5])  # First 5 dimensions
bash
# Test directly with curl
curl http://llm.internal.yourdomain.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-3-mini",
    "messages": [{"role": "user", "content": "What is a Kubernetes namespace?"}]
  }'

Scaling for Multiple Concurrent Users

LocalAI handles one request at a time per replica by default. For concurrent users:

yaml
# Scale up replicas
spec:
  replicas: 3   # 3 instances = 3 concurrent requests
 
# Or use HPA based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: localai-hpa
  namespace: localai
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: localai
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Practical Use Cases in DevOps

1. Internal documentation chatbot:

python
# Answer questions from your runbooks
response = client.chat.completions.create(
    model="mistral-7b",
    messages=[
        {"role": "system", "content": f"Answer based on this runbook:\n{runbook_text}"},
        {"role": "user", "content": "How do I rotate the database password?"}
    ]
)

2. Log analysis:

python
def analyze_error_log(log_excerpt: str) -> str:
    response = client.chat.completions.create(
        model="phi-3-mini",
        messages=[{
            "role": "user",
            "content": f"Analyze this error log and suggest the most likely cause:\n{log_excerpt}"
        }]
    )
    return response.choices[0].message.content

3. Terraform plan explainer:

python
def explain_terraform_plan(plan_output: str) -> str:
    response = client.chat.completions.create(
        model="mistral-7b",
        messages=[{
            "role": "user",
            "content": f"Explain what this terraform plan will do in plain English:\n{plan_output}"
        }]
    )
    return response.choices[0].message.content

Cost Comparison

OptionCost for 1M tokens
OpenAI GPT-4o~$10–15
Claude 3.5 Sonnet~$9
LocalAI on t3.xlarge (CPU)~$0.15 (compute only)
LocalAI on g4dn.xlarge (GPU)~$0.53/hr amortized

For high-volume internal tooling where output quality doesn't need to match GPT-4o, LocalAI on CPU nodes is dramatically cheaper.


LocalAI gives you a privacy-first, cost-effective LLM deployment that works with all existing OpenAI-compatible code. For production deployments, check the LocalAI documentation for advanced model configuration and GPU acceleration setup.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments