Deploy Stable Diffusion on Kubernetes with GPU Nodes (2026 Guide)
Run Stable Diffusion (SDXL + AUTOMATIC1111) on Kubernetes with GPU node pools, autoscaling, and an API endpoint. Step-by-step guide with EKS GPU nodes, persistent model storage, and Ingress setup.
Running Stable Diffusion locally is fine for experiments. Running it in production — with GPU autoscaling, a REST API, model persistence, and proper resource management — requires Kubernetes. This guide deploys SDXL on EKS with GPU nodes.
Architecture
Client Request → Ingress → SD API Service → Stable Diffusion Pod
│
├── GPU Node (g4dn.xlarge)
├── Persistent Volume (models ~10GB)
└── NVIDIA CUDA 12.x
Stack:
- EKS with GPU node group (g4dn.xlarge — NVIDIA T4, 16GB VRAM)
- AUTOMATIC1111 (Stable Diffusion WebUI with API)
- Karpenter or Cluster Autoscaler for GPU node scaling
- EFS or EBS for persistent model storage
- NVIDIA Device Plugin for GPU scheduling
Prerequisites
- EKS cluster (or any Kubernetes with GPU nodes)
kubectlconfigured- Helm 3+
- AWS CLI (if using EKS)
Step 1: Create GPU Node Group on EKS
# Add a GPU node group to existing EKS cluster
eksctl create nodegroup \
--cluster my-cluster \
--region ap-south-1 \
--name gpu-nodes \
--node-type g4dn.xlarge \
--nodes-min 0 \
--nodes-max 3 \
--nodes 0 \
--asg-access \
--managed \
--node-labels "nvidia.com/gpu=true,workload=ml" \
--node-taints "nvidia.com/gpu=true:NoSchedule"Key settings:
--nodes 0— starts with zero GPU nodes (cost savings)--nodes-max 3— scales up to 3 when needed--node-taints— only pods that explicitly tolerate this taint run on GPU nodes
Step 2: Install NVIDIA Device Plugin
The NVIDIA Device Plugin lets Kubernetes see and schedule GPU resources.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace kube-system \
--set tolerations[0].key=nvidia.com/gpu \
--set tolerations[0].operator=Exists \
--set tolerations[0].effect=NoScheduleVerify GPU is visible to Kubernetes:
# Scale up one GPU node first
kubectl get nodes -l nvidia.com/gpu=true
# Check GPU capacity
kubectl describe node <gpu-node-name> | grep -A 5 "Capacity:"
# Should show: nvidia.com/gpu: 1Step 3: Create Persistent Storage for Models
SDXL models are ~6.5GB. Downloading them on every pod start is too slow. Use a PersistentVolume.
# pvc-models.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: sd-models-pvc
namespace: ml
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3
resources:
requests:
storage: 50Gi # Models + outputskubectl create namespace ml
kubectl apply -f pvc-models.yamlStep 4: Download Models (Init Job)
Run a one-time Job to download models into the PVC:
# model-download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: sd-model-download
namespace: ml
spec:
template:
spec:
containers:
- name: downloader
image: python:3.11-slim
command:
- /bin/bash
- -c
- |
pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='stabilityai/stable-diffusion-xl-base-1.0',
local_dir='/models/sdxl-base',
ignore_patterns=['*.fp32.safetensors']
)
"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
volumeMounts:
- name: models
mountPath: /models
resources:
requests:
memory: "2Gi"
cpu: "500m"
volumes:
- name: models
persistentVolumeClaim:
claimName: sd-models-pvc
restartPolicy: Never# Create HuggingFace token secret (get token from huggingface.co/settings/tokens)
kubectl create secret generic hf-secret \
--namespace ml \
--from-literal=token=hf_yourtoken
kubectl apply -f model-download-job.yaml
# Watch download progress
kubectl logs -n ml job/sd-model-download -fStep 5: Deploy Stable Diffusion WebUI
We'll use AUTOMATIC1111 (a1111) which exposes a REST API.
# sd-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: stable-diffusion
namespace: ml
spec:
replicas: 1
selector:
matchLabels:
app: stable-diffusion
template:
metadata:
labels:
app: stable-diffusion
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
nvidia.com/gpu: "true"
containers:
- name: stable-diffusion
image: universonic/stable-diffusion-webui:latest
args:
- --api
- --nowebui
- --listen
- --port
- "7860"
- --ckpt-dir
- /models
- --skip-torch-cuda-test
ports:
- containerPort: 7860
name: api
env:
- name: COMMANDLINE_ARGS
value: "--api --nowebui --listen"
resources:
limits:
nvidia.com/gpu: "1" # Request 1 GPU
memory: "12Gi"
cpu: "4000m"
requests:
nvidia.com/gpu: "1"
memory: "8Gi"
cpu: "2000m"
volumeMounts:
- name: models
mountPath: /models
- name: outputs
mountPath: /outputs
readinessProbe:
httpGet:
path: /sdapi/v1/progress
port: 7860
initialDelaySeconds: 60
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 10
volumes:
- name: models
persistentVolumeClaim:
claimName: sd-models-pvc
- name: outputs
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: stable-diffusion
namespace: ml
spec:
selector:
app: stable-diffusion
ports:
- port: 80
targetPort: 7860
name: apikubectl apply -f sd-deployment.yaml
# Watch pod startup (GPU node will scale up first — takes 3-5 min)
kubectl get pods -n ml -w
kubectl logs -n ml deploy/stable-diffusion -fStep 6: Expose via Ingress
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: sd-ingress
namespace: ml
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: sd-basic-auth
nginx.ingress.kubernetes.io/auth-realm: "SD API"
spec:
ingressClassName: nginx
rules:
- host: sd.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: stable-diffusion
port:
number: 80# Create basic auth for the API
htpasswd -c auth admin
kubectl create secret generic sd-basic-auth \
--from-file=auth \
--namespace ml
kubectl apply -f ingress.yamlStep 7: Generate Images via API
Once running, use the AUTOMATIC1111 REST API:
import requests
import base64
from PIL import Image
import io
SD_API = "https://sd.yourdomain.com"
AUTH = ("admin", "yourpassword")
def generate_image(prompt: str, negative_prompt: str = "", steps: int = 20) -> Image.Image:
payload = {
"prompt": prompt,
"negative_prompt": negative_prompt,
"steps": steps,
"width": 1024,
"height": 1024,
"cfg_scale": 7,
"sampler_name": "DPM++ 2M Karras",
}
response = requests.post(
f"{SD_API}/sdapi/v1/txt2img",
json=payload,
auth=AUTH,
timeout=120,
)
response.raise_for_status()
image_data = response.json()["images"][0]
return Image.open(io.BytesIO(base64.b64decode(image_data)))
# Generate an image
img = generate_image(
prompt="A Kubernetes cluster visualized as a city, cyberpunk style, high detail",
negative_prompt="blurry, low quality",
steps=25,
)
img.save("output.png")
print("Image saved to output.png")Step 8: Auto-Scale GPU Nodes to Zero
The biggest cost optimization: scale GPU nodes to zero when idle.
With Karpenter:
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-nodepool
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["g4dn.xlarge", "g4dn.2xlarge"]
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
limits:
nvidia.com/gpu: "10"
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m # Scale down idle GPU nodes after 5 minutesWith this config, GPU nodes scale to zero when the SD pod is deleted and spin up in ~4 minutes when a pod needs a GPU.
Cost Estimate (AWS ap-south-1)
| Config | Cost |
|---|---|
| g4dn.xlarge (1 T4 GPU) | ~$0.526/hour |
| Running 8 hours/day | ~$4.20/day, ~$126/month |
| Scale to zero nights/weekends | ~$30–40/month |
For 24/7 production: use Spot instances for non-critical workloads:
eksctl create nodegroup \
--cluster my-cluster \
--name gpu-spot \
--node-type g4dn.xlarge \
--spot \
--nodes-min 0 \
--nodes-max 5Spot g4dn.xlarge: ~$0.16/hour (70% cheaper). Risk: interruption.
Common Issues
Pod stuck in Pending:
kubectl describe pod -n ml <pod-name>
# Look for: "0/3 nodes available: 3 Insufficient nvidia.com/gpu"
# → GPU nodes not yet scaled up, or NVIDIA plugin not installedCUDA out of memory:
- Reduce batch size or image resolution
- Use
--medvramflag for A1111 on 8GB VRAM GPUs
Model not loading:
kubectl exec -n ml deploy/stable-diffusion -- ls /models/
# Verify models downloaded correctlyStable Diffusion on Kubernetes gives you a scalable, cost-efficient image generation service that scales to zero when idle and handles burst capacity automatically. For production ML workloads, also check out the setup-kubeflow-ml-pipeline guide for full ML pipeline orchestration.
For learning GPU-accelerated Kubernetes, Kubernetes for ML Engineers on Udemy covers GPU scheduling and MLOps patterns end-to-end.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI Kubernetes Cost Optimizer with Python and Claude API
Use AI to automatically analyze your Kubernetes resource usage, detect waste, and generate optimization recommendations. Full Python project with Claude API.
Build a Kubernetes Cost Optimization Bot with AI in 2026
Build an AI-powered bot that analyzes your Kubernetes cluster, finds idle resources, oversized pods, and unused namespaces — and gives cost-cutting recommendations.
Deploy Gemma 3 on Kubernetes with GPU — Complete Guide (2026)
Google's Gemma 3 is open-weight and runs well on a single GPU. Here's how to deploy it on Kubernetes using vLLM, expose it as an OpenAI-compatible API, and use it in your DevOps workflows.