Deploy Stable Diffusion on Kubernetes with GPU Nodes (2026 Guide)

Run Stable Diffusion (SDXL + AUTOMATIC1111) on Kubernetes with GPU node pools, autoscaling, and an API endpoint. Step-by-step guide with EKS GPU nodes, persistent model storage, and Ingress setup.

Running Stable Diffusion locally is fine for experiments. Running it in production — with GPU autoscaling, a REST API, model persistence, and proper resource management — requires Kubernetes. This guide deploys SDXL on EKS with GPU nodes.

Architecture

Client Request → Ingress → SD API Service → Stable Diffusion Pod
                                              │
                                              ├── GPU Node (g4dn.xlarge)
                                              ├── Persistent Volume (models ~10GB)
                                              └── NVIDIA CUDA 12.x

Stack:

EKS with GPU node group (g4dn.xlarge — NVIDIA T4, 16GB VRAM)
AUTOMATIC1111 (Stable Diffusion WebUI with API)
Karpenter or Cluster Autoscaler for GPU node scaling
EFS or EBS for persistent model storage
NVIDIA Device Plugin for GPU scheduling

Prerequisites

EKS cluster (or any Kubernetes with GPU nodes)
kubectl configured
Helm 3+
AWS CLI (if using EKS)

Step 1: Create GPU Node Group on EKS

bash

# Add a GPU node group to existing EKS cluster
eksctl create nodegroup \
  --cluster my-cluster \
  --region ap-south-1 \
  --name gpu-nodes \
  --node-type g4dn.xlarge \
  --nodes-min 0 \
  --nodes-max 3 \
  --nodes 0 \
  --asg-access \
  --managed \
  --node-labels "nvidia.com/gpu=true,workload=ml" \
  --node-taints "nvidia.com/gpu=true:NoSchedule"

Key settings:

--nodes 0 — starts with zero GPU nodes (cost savings)
--nodes-max 3 — scales up to 3 when needed
--node-taints — only pods that explicitly tolerate this taint run on GPU nodes

Step 2: Install NVIDIA Device Plugin

The NVIDIA Device Plugin lets Kubernetes see and schedule GPU resources.

bash

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
 
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --set tolerations[0].key=nvidia.com/gpu \
  --set tolerations[0].operator=Exists \
  --set tolerations[0].effect=NoSchedule

Verify GPU is visible to Kubernetes:

bash

# Scale up one GPU node first
kubectl get nodes -l nvidia.com/gpu=true
 
# Check GPU capacity
kubectl describe node <gpu-node-name> | grep -A 5 "Capacity:"
# Should show: nvidia.com/gpu: 1

Step 3: Create Persistent Storage for Models

SDXL models are ~6.5GB. Downloading them on every pod start is too slow. Use a PersistentVolume.

yaml

# pvc-models.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: sd-models-pvc
  namespace: ml
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 50Gi  # Models + outputs

bash

kubectl create namespace ml
kubectl apply -f pvc-models.yaml

Step 4: Download Models (Init Job)

Run a one-time Job to download models into the PVC:

yaml

# model-download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: sd-model-download
  namespace: ml
spec:
  template:
    spec:
      containers:
      - name: downloader
        image: python:3.11-slim
        command:
        - /bin/bash
        - -c
        - |
          pip install huggingface_hub
          python3 -c "
          from huggingface_hub import snapshot_download
          snapshot_download(
            repo_id='stabilityai/stable-diffusion-xl-base-1.0',
            local_dir='/models/sdxl-base',
            ignore_patterns=['*.fp32.safetensors']
          )
          "
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: token
        volumeMounts:
        - name: models
          mountPath: /models
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: sd-models-pvc
      restartPolicy: Never

bash

# Create HuggingFace token secret (get token from huggingface.co/settings/tokens)
kubectl create secret generic hf-secret \
  --namespace ml \
  --from-literal=token=hf_yourtoken
 
kubectl apply -f model-download-job.yaml
 
# Watch download progress
kubectl logs -n ml job/sd-model-download -f

Step 5: Deploy Stable Diffusion WebUI

We'll use AUTOMATIC1111 (a1111) which exposes a REST API.

yaml

# sd-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stable-diffusion
  namespace: ml
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stable-diffusion
  template:
    metadata:
      labels:
        app: stable-diffusion
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        nvidia.com/gpu: "true"
      containers:
      - name: stable-diffusion
        image: universonic/stable-diffusion-webui:latest
        args:
        - --api
        - --nowebui
        - --listen
        - --port
        - "7860"
        - --ckpt-dir
        - /models
        - --skip-torch-cuda-test
        ports:
        - containerPort: 7860
          name: api
        env:
        - name: COMMANDLINE_ARGS
          value: "--api --nowebui --listen"
        resources:
          limits:
            nvidia.com/gpu: "1"    # Request 1 GPU
            memory: "12Gi"
            cpu: "4000m"
          requests:
            nvidia.com/gpu: "1"
            memory: "8Gi"
            cpu: "2000m"
        volumeMounts:
        - name: models
          mountPath: /models
        - name: outputs
          mountPath: /outputs
        readinessProbe:
          httpGet:
            path: /sdapi/v1/progress
            port: 7860
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 10
          failureThreshold: 10
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: sd-models-pvc
      - name: outputs
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: stable-diffusion
  namespace: ml
spec:
  selector:
    app: stable-diffusion
  ports:
  - port: 80
    targetPort: 7860
    name: api

bash

kubectl apply -f sd-deployment.yaml
 
# Watch pod startup (GPU node will scale up first — takes 3-5 min)
kubectl get pods -n ml -w
kubectl logs -n ml deploy/stable-diffusion -f

Step 6: Expose via Ingress

yaml

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: sd-ingress
  namespace: ml
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: sd-basic-auth
    nginx.ingress.kubernetes.io/auth-realm: "SD API"
spec:
  ingressClassName: nginx
  rules:
  - host: sd.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: stable-diffusion
            port:
              number: 80

bash

# Create basic auth for the API
htpasswd -c auth admin
kubectl create secret generic sd-basic-auth \
  --from-file=auth \
  --namespace ml
 
kubectl apply -f ingress.yaml

Step 7: Generate Images via API

Once running, use the AUTOMATIC1111 REST API:

python

import requests
import base64
from PIL import Image
import io
 
SD_API = "https://sd.yourdomain.com"
AUTH = ("admin", "yourpassword")
 
def generate_image(prompt: str, negative_prompt: str = "", steps: int = 20) -> Image.Image:
    payload = {
        "prompt": prompt,
        "negative_prompt": negative_prompt,
        "steps": steps,
        "width": 1024,
        "height": 1024,
        "cfg_scale": 7,
        "sampler_name": "DPM++ 2M Karras",
    }
    
    response = requests.post(
        f"{SD_API}/sdapi/v1/txt2img",
        json=payload,
        auth=AUTH,
        timeout=120,
    )
    response.raise_for_status()
    
    image_data = response.json()["images"][0]
    return Image.open(io.BytesIO(base64.b64decode(image_data)))
 
 
# Generate an image
img = generate_image(
    prompt="A Kubernetes cluster visualized as a city, cyberpunk style, high detail",
    negative_prompt="blurry, low quality",
    steps=25,
)
img.save("output.png")
print("Image saved to output.png")

Step 8: Auto-Scale GPU Nodes to Zero

The biggest cost optimization: scale GPU nodes to zero when idle.

With Karpenter:

yaml

# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["g4dn.xlarge", "g4dn.2xlarge"]
      taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule
  limits:
    nvidia.com/gpu: "10"
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m   # Scale down idle GPU nodes after 5 minutes

With this config, GPU nodes scale to zero when the SD pod is deleted and spin up in ~4 minutes when a pod needs a GPU.

Cost Estimate (AWS ap-south-1)

Config	Cost
g4dn.xlarge (1 T4 GPU)	~$0.526/hour
Running 8 hours/day	~$4.20/day, ~$126/month
Scale to zero nights/weekends	~$30–40/month

For 24/7 production: use Spot instances for non-critical workloads:

bash

eksctl create nodegroup \
  --cluster my-cluster \
  --name gpu-spot \
  --node-type g4dn.xlarge \
  --spot \
  --nodes-min 0 \
  --nodes-max 5

Spot g4dn.xlarge: ~$0.16/hour (70% cheaper). Risk: interruption.

Common Issues

Pod stuck in Pending:

bash

kubectl describe pod -n ml <pod-name>
# Look for: "0/3 nodes available: 3 Insufficient nvidia.com/gpu"
# → GPU nodes not yet scaled up, or NVIDIA plugin not installed

CUDA out of memory:

Reduce batch size or image resolution
Use --medvram flag for A1111 on 8GB VRAM GPUs

Model not loading:

bash

kubectl exec -n ml deploy/stable-diffusion -- ls /models/
# Verify models downloaded correctly

Stable Diffusion on Kubernetes gives you a scalable, cost-efficient image generation service that scales to zero when idle and handles burst capacity automatically. For production ML workloads, also check out the setup-kubeflow-ml-pipeline guide for full ML pipeline orchestration.

For learning GPU-accelerated Kubernetes, Kubernetes for ML Engineers on Udemy covers GPU scheduling and MLOps patterns end-to-end.

Deploy Stable Diffusion on Kubernetes with GPU Nodes (2026 Guide)

Architecture

Prerequisites

Step 1: Create GPU Node Group on EKS

Step 2: Install NVIDIA Device Plugin

Step 3: Create Persistent Storage for Models

Step 4: Download Models (Init Job)

Step 5: Deploy Stable Diffusion WebUI

Step 6: Expose via Ingress

Step 7: Generate Images via API

Step 8: Auto-Scale GPU Nodes to Zero

Cost Estimate (AWS ap-south-1)

Common Issues

Stay ahead of the curve

Related Articles

Build an AI Kubernetes Cost Optimizer with Python and Claude API

Build a Kubernetes Cost Optimization Bot with AI in 2026

Deploy Gemma 3 on Kubernetes with GPU — Complete Guide (2026)

Comments