Deploy Stable Diffusion on Kubernetes with GPU Nodes (2026 Guide)
Run Stable Diffusion (SDXL + AUTOMATIC1111) on Kubernetes with GPU node pools, autoscaling, and an API endpoint. Step-by-step guide with EKS GPU nodes, persistent model storage, and Ingress setup.
Running Stable Diffusion locally is fine for experiments. Running it in production — with GPU autoscaling, a REST API, model persistence, and proper resource management — requires Kubernetes. This guide deploys SDXL on EKS with GPU nodes.
Architecture
Client Request → Ingress → SD API Service → Stable Diffusion Pod
│
├── GPU Node (g4dn.xlarge)
├── Persistent Volume (models ~10GB)
└── NVIDIA CUDA 12.x
Stack:
- EKS with GPU node group (g4dn.xlarge — NVIDIA T4, 16GB VRAM)
- AUTOMATIC1111 (Stable Diffusion WebUI with API)
- Karpenter or Cluster Autoscaler for GPU node scaling
- EFS or EBS for persistent model storage
- NVIDIA Device Plugin for GPU scheduling
Prerequisites
- EKS cluster (or any Kubernetes with GPU nodes)
kubectlconfigured- Helm 3+
- AWS CLI (if using EKS)
Step 1: Create GPU Node Group on EKS
# Add a GPU node group to existing EKS cluster
eksctl create nodegroup \
--cluster my-cluster \
--region ap-south-1 \
--name gpu-nodes \
--node-type g4dn.xlarge \
--nodes-min 0 \
--nodes-max 3 \
--nodes 0 \
--asg-access \
--managed \
--node-labels "nvidia.com/gpu=true,workload=ml" \
--node-taints "nvidia.com/gpu=true:NoSchedule"Key settings:
--nodes 0— starts with zero GPU nodes (cost savings)--nodes-max 3— scales up to 3 when needed--node-taints— only pods that explicitly tolerate this taint run on GPU nodes
Step 2: Install NVIDIA Device Plugin
The NVIDIA Device Plugin lets Kubernetes see and schedule GPU resources.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace kube-system \
--set tolerations[0].key=nvidia.com/gpu \
--set tolerations[0].operator=Exists \
--set tolerations[0].effect=NoScheduleVerify GPU is visible to Kubernetes:
# Scale up one GPU node first
kubectl get nodes -l nvidia.com/gpu=true
# Check GPU capacity
kubectl describe node <gpu-node-name> | grep -A 5 "Capacity:"
# Should show: nvidia.com/gpu: 1Step 3: Create Persistent Storage for Models
SDXL models are ~6.5GB. Downloading them on every pod start is too slow. Use a PersistentVolume.
# pvc-models.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: sd-models-pvc
namespace: ml
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3
resources:
requests:
storage: 50Gi # Models + outputskubectl create namespace ml
kubectl apply -f pvc-models.yamlStep 4: Download Models (Init Job)
Run a one-time Job to download models into the PVC:
# model-download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: sd-model-download
namespace: ml
spec:
template:
spec:
containers:
- name: downloader
image: python:3.11-slim
command:
- /bin/bash
- -c
- |
pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='stabilityai/stable-diffusion-xl-base-1.0',
local_dir='/models/sdxl-base',
ignore_patterns=['*.fp32.safetensors']
)
"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
volumeMounts:
- name: models
mountPath: /models
resources:
requests:
memory: "2Gi"
cpu: "500m"
volumes:
- name: models
persistentVolumeClaim:
claimName: sd-models-pvc
restartPolicy: Never# Create HuggingFace token secret (get token from huggingface.co/settings/tokens)
kubectl create secret generic hf-secret \
--namespace ml \
--from-literal=token=hf_yourtoken
kubectl apply -f model-download-job.yaml
# Watch download progress
kubectl logs -n ml job/sd-model-download -fStep 5: Deploy Stable Diffusion WebUI
We'll use AUTOMATIC1111 (a1111) which exposes a REST API.
# sd-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: stable-diffusion
namespace: ml
spec:
replicas: 1
selector:
matchLabels:
app: stable-diffusion
template:
metadata:
labels:
app: stable-diffusion
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
nvidia.com/gpu: "true"
containers:
- name: stable-diffusion
image: universonic/stable-diffusion-webui:latest
args:
- --api
- --nowebui
- --listen
- --port
- "7860"
- --ckpt-dir
- /models
- --skip-torch-cuda-test
ports:
- containerPort: 7860
name: api
env:
- name: COMMANDLINE_ARGS
value: "--api --nowebui --listen"
resources:
limits:
nvidia.com/gpu: "1" # Request 1 GPU
memory: "12Gi"
cpu: "4000m"
requests:
nvidia.com/gpu: "1"
memory: "8Gi"
cpu: "2000m"
volumeMounts:
- name: models
mountPath: /models
- name: outputs
mountPath: /outputs
readinessProbe:
httpGet:
path: /sdapi/v1/progress
port: 7860
initialDelaySeconds: 60
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 10
volumes:
- name: models
persistentVolumeClaim:
claimName: sd-models-pvc
- name: outputs
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: stable-diffusion
namespace: ml
spec:
selector:
app: stable-diffusion
ports:
- port: 80
targetPort: 7860
name: apikubectl apply -f sd-deployment.yaml
# Watch pod startup (GPU node will scale up first — takes 3-5 min)
kubectl get pods -n ml -w
kubectl logs -n ml deploy/stable-diffusion -fStep 6: Expose via Ingress
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: sd-ingress
namespace: ml
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: sd-basic-auth
nginx.ingress.kubernetes.io/auth-realm: "SD API"
spec:
ingressClassName: nginx
rules:
- host: sd.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: stable-diffusion
port:
number: 80# Create basic auth for the API
htpasswd -c auth admin
kubectl create secret generic sd-basic-auth \
--from-file=auth \
--namespace ml
kubectl apply -f ingress.yamlStep 7: Generate Images via API
Once running, use the AUTOMATIC1111 REST API:
import requests
import base64
from PIL import Image
import io
SD_API = "https://sd.yourdomain.com"
AUTH = ("admin", "yourpassword")
def generate_image(prompt: str, negative_prompt: str = "", steps: int = 20) -> Image.Image:
payload = {
"prompt": prompt,
"negative_prompt": negative_prompt,
"steps": steps,
"width": 1024,
"height": 1024,
"cfg_scale": 7,
"sampler_name": "DPM++ 2M Karras",
}
response = requests.post(
f"{SD_API}/sdapi/v1/txt2img",
json=payload,
auth=AUTH,
timeout=120,
)
response.raise_for_status()
image_data = response.json()["images"][0]
return Image.open(io.BytesIO(base64.b64decode(image_data)))
# Generate an image
img = generate_image(
prompt="A Kubernetes cluster visualized as a city, cyberpunk style, high detail",
negative_prompt="blurry, low quality",
steps=25,
)
img.save("output.png")
print("Image saved to output.png")Step 8: Auto-Scale GPU Nodes to Zero
The biggest cost optimization: scale GPU nodes to zero when idle.
With Karpenter:
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-nodepool
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["g4dn.xlarge", "g4dn.2xlarge"]
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
limits:
nvidia.com/gpu: "10"
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m # Scale down idle GPU nodes after 5 minutesWith this config, GPU nodes scale to zero when the SD pod is deleted and spin up in ~4 minutes when a pod needs a GPU.
Cost Estimate (AWS ap-south-1)
| Config | Cost |
|---|---|
| g4dn.xlarge (1 T4 GPU) | ~$0.526/hour |
| Running 8 hours/day | ~$4.20/day, ~$126/month |
| Scale to zero nights/weekends | ~$30–40/month |
For 24/7 production: use Spot instances for non-critical workloads:
eksctl create nodegroup \
--cluster my-cluster \
--name gpu-spot \
--node-type g4dn.xlarge \
--spot \
--nodes-min 0 \
--nodes-max 5Spot g4dn.xlarge: ~$0.16/hour (70% cheaper). Risk: interruption.
Common Issues
Pod stuck in Pending:
kubectl describe pod -n ml <pod-name>
# Look for: "0/3 nodes available: 3 Insufficient nvidia.com/gpu"
# → GPU nodes not yet scaled up, or NVIDIA plugin not installedCUDA out of memory:
- Reduce batch size or image resolution
- Use
--medvramflag for A1111 on 8GB VRAM GPUs
Model not loading:
kubectl exec -n ml deploy/stable-diffusion -- ls /models/
# Verify models downloaded correctlyStable Diffusion on Kubernetes gives you a scalable, cost-efficient image generation service that scales to zero when idle and handles burst capacity automatically. For production ML workloads, also check out the setup-kubeflow-ml-pipeline guide for full ML pipeline orchestration.
For learning GPU-accelerated Kubernetes, Kubernetes for ML Engineers on Udemy covers GPU scheduling and MLOps patterns end-to-end.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build a Kubernetes Cost Optimization Bot with AI in 2026
Build an AI-powered bot that analyzes your Kubernetes cluster, finds idle resources, oversized pods, and unused namespaces — and gives cost-cutting recommendations.
Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)
Step-by-step guide to deploying HuggingFace transformer models on Kubernetes using GPU nodes — from cluster setup to inference API in production.
Deploy NVIDIA Triton Inference Server on Kubernetes (2026)
Step-by-step guide to running NVIDIA Triton Inference Server on Kubernetes with GPU nodes — model repository setup, deployment, autoscaling, and monitoring.