All Articles

Deploy Stable Diffusion on Kubernetes with GPU Nodes (2026 Guide)

Run Stable Diffusion (SDXL + AUTOMATIC1111) on Kubernetes with GPU node pools, autoscaling, and an API endpoint. Step-by-step guide with EKS GPU nodes, persistent model storage, and Ingress setup.

DevOpsBoysApr 24, 20266 min read
Share:Tweet

Running Stable Diffusion locally is fine for experiments. Running it in production — with GPU autoscaling, a REST API, model persistence, and proper resource management — requires Kubernetes. This guide deploys SDXL on EKS with GPU nodes.


Architecture

Client Request → Ingress → SD API Service → Stable Diffusion Pod
                                              │
                                              ├── GPU Node (g4dn.xlarge)
                                              ├── Persistent Volume (models ~10GB)
                                              └── NVIDIA CUDA 12.x

Stack:

  • EKS with GPU node group (g4dn.xlarge — NVIDIA T4, 16GB VRAM)
  • AUTOMATIC1111 (Stable Diffusion WebUI with API)
  • Karpenter or Cluster Autoscaler for GPU node scaling
  • EFS or EBS for persistent model storage
  • NVIDIA Device Plugin for GPU scheduling

Prerequisites

  • EKS cluster (or any Kubernetes with GPU nodes)
  • kubectl configured
  • Helm 3+
  • AWS CLI (if using EKS)

Step 1: Create GPU Node Group on EKS

bash
# Add a GPU node group to existing EKS cluster
eksctl create nodegroup \
  --cluster my-cluster \
  --region ap-south-1 \
  --name gpu-nodes \
  --node-type g4dn.xlarge \
  --nodes-min 0 \
  --nodes-max 3 \
  --nodes 0 \
  --asg-access \
  --managed \
  --node-labels "nvidia.com/gpu=true,workload=ml" \
  --node-taints "nvidia.com/gpu=true:NoSchedule"

Key settings:

  • --nodes 0 — starts with zero GPU nodes (cost savings)
  • --nodes-max 3 — scales up to 3 when needed
  • --node-taints — only pods that explicitly tolerate this taint run on GPU nodes

Step 2: Install NVIDIA Device Plugin

The NVIDIA Device Plugin lets Kubernetes see and schedule GPU resources.

bash
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
 
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --set tolerations[0].key=nvidia.com/gpu \
  --set tolerations[0].operator=Exists \
  --set tolerations[0].effect=NoSchedule

Verify GPU is visible to Kubernetes:

bash
# Scale up one GPU node first
kubectl get nodes -l nvidia.com/gpu=true
 
# Check GPU capacity
kubectl describe node <gpu-node-name> | grep -A 5 "Capacity:"
# Should show: nvidia.com/gpu: 1

Step 3: Create Persistent Storage for Models

SDXL models are ~6.5GB. Downloading them on every pod start is too slow. Use a PersistentVolume.

yaml
# pvc-models.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: sd-models-pvc
  namespace: ml
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 50Gi  # Models + outputs
bash
kubectl create namespace ml
kubectl apply -f pvc-models.yaml

Step 4: Download Models (Init Job)

Run a one-time Job to download models into the PVC:

yaml
# model-download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: sd-model-download
  namespace: ml
spec:
  template:
    spec:
      containers:
      - name: downloader
        image: python:3.11-slim
        command:
        - /bin/bash
        - -c
        - |
          pip install huggingface_hub
          python3 -c "
          from huggingface_hub import snapshot_download
          snapshot_download(
            repo_id='stabilityai/stable-diffusion-xl-base-1.0',
            local_dir='/models/sdxl-base',
            ignore_patterns=['*.fp32.safetensors']
          )
          "
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: token
        volumeMounts:
        - name: models
          mountPath: /models
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: sd-models-pvc
      restartPolicy: Never
bash
# Create HuggingFace token secret (get token from huggingface.co/settings/tokens)
kubectl create secret generic hf-secret \
  --namespace ml \
  --from-literal=token=hf_yourtoken
 
kubectl apply -f model-download-job.yaml
 
# Watch download progress
kubectl logs -n ml job/sd-model-download -f

Step 5: Deploy Stable Diffusion WebUI

We'll use AUTOMATIC1111 (a1111) which exposes a REST API.

yaml
# sd-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stable-diffusion
  namespace: ml
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stable-diffusion
  template:
    metadata:
      labels:
        app: stable-diffusion
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        nvidia.com/gpu: "true"
      containers:
      - name: stable-diffusion
        image: universonic/stable-diffusion-webui:latest
        args:
        - --api
        - --nowebui
        - --listen
        - --port
        - "7860"
        - --ckpt-dir
        - /models
        - --skip-torch-cuda-test
        ports:
        - containerPort: 7860
          name: api
        env:
        - name: COMMANDLINE_ARGS
          value: "--api --nowebui --listen"
        resources:
          limits:
            nvidia.com/gpu: "1"    # Request 1 GPU
            memory: "12Gi"
            cpu: "4000m"
          requests:
            nvidia.com/gpu: "1"
            memory: "8Gi"
            cpu: "2000m"
        volumeMounts:
        - name: models
          mountPath: /models
        - name: outputs
          mountPath: /outputs
        readinessProbe:
          httpGet:
            path: /sdapi/v1/progress
            port: 7860
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 10
          failureThreshold: 10
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: sd-models-pvc
      - name: outputs
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: stable-diffusion
  namespace: ml
spec:
  selector:
    app: stable-diffusion
  ports:
  - port: 80
    targetPort: 7860
    name: api
bash
kubectl apply -f sd-deployment.yaml
 
# Watch pod startup (GPU node will scale up first — takes 3-5 min)
kubectl get pods -n ml -w
kubectl logs -n ml deploy/stable-diffusion -f

Step 6: Expose via Ingress

yaml
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: sd-ingress
  namespace: ml
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: sd-basic-auth
    nginx.ingress.kubernetes.io/auth-realm: "SD API"
spec:
  ingressClassName: nginx
  rules:
  - host: sd.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: stable-diffusion
            port:
              number: 80
bash
# Create basic auth for the API
htpasswd -c auth admin
kubectl create secret generic sd-basic-auth \
  --from-file=auth \
  --namespace ml
 
kubectl apply -f ingress.yaml

Step 7: Generate Images via API

Once running, use the AUTOMATIC1111 REST API:

python
import requests
import base64
from PIL import Image
import io
 
SD_API = "https://sd.yourdomain.com"
AUTH = ("admin", "yourpassword")
 
def generate_image(prompt: str, negative_prompt: str = "", steps: int = 20) -> Image.Image:
    payload = {
        "prompt": prompt,
        "negative_prompt": negative_prompt,
        "steps": steps,
        "width": 1024,
        "height": 1024,
        "cfg_scale": 7,
        "sampler_name": "DPM++ 2M Karras",
    }
    
    response = requests.post(
        f"{SD_API}/sdapi/v1/txt2img",
        json=payload,
        auth=AUTH,
        timeout=120,
    )
    response.raise_for_status()
    
    image_data = response.json()["images"][0]
    return Image.open(io.BytesIO(base64.b64decode(image_data)))
 
 
# Generate an image
img = generate_image(
    prompt="A Kubernetes cluster visualized as a city, cyberpunk style, high detail",
    negative_prompt="blurry, low quality",
    steps=25,
)
img.save("output.png")
print("Image saved to output.png")

Step 8: Auto-Scale GPU Nodes to Zero

The biggest cost optimization: scale GPU nodes to zero when idle.

With Karpenter:

yaml
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["g4dn.xlarge", "g4dn.2xlarge"]
      taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule
  limits:
    nvidia.com/gpu: "10"
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m   # Scale down idle GPU nodes after 5 minutes

With this config, GPU nodes scale to zero when the SD pod is deleted and spin up in ~4 minutes when a pod needs a GPU.


Cost Estimate (AWS ap-south-1)

ConfigCost
g4dn.xlarge (1 T4 GPU)~$0.526/hour
Running 8 hours/day~$4.20/day, ~$126/month
Scale to zero nights/weekends~$30–40/month

For 24/7 production: use Spot instances for non-critical workloads:

bash
eksctl create nodegroup \
  --cluster my-cluster \
  --name gpu-spot \
  --node-type g4dn.xlarge \
  --spot \
  --nodes-min 0 \
  --nodes-max 5

Spot g4dn.xlarge: ~$0.16/hour (70% cheaper). Risk: interruption.


Common Issues

Pod stuck in Pending:

bash
kubectl describe pod -n ml <pod-name>
# Look for: "0/3 nodes available: 3 Insufficient nvidia.com/gpu"
# → GPU nodes not yet scaled up, or NVIDIA plugin not installed

CUDA out of memory:

  • Reduce batch size or image resolution
  • Use --medvram flag for A1111 on 8GB VRAM GPUs

Model not loading:

bash
kubectl exec -n ml deploy/stable-diffusion -- ls /models/
# Verify models downloaded correctly

Stable Diffusion on Kubernetes gives you a scalable, cost-efficient image generation service that scales to zero when idle and handles burst capacity automatically. For production ML workloads, also check out the setup-kubeflow-ml-pipeline guide for full ML pipeline orchestration.

For learning GPU-accelerated Kubernetes, Kubernetes for ML Engineers on Udemy covers GPU scheduling and MLOps patterns end-to-end.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments