Deploy Gemma 3 on Kubernetes with GPU — Complete Guide (2026)
Google's Gemma 3 is open-weight and runs well on a single GPU. Here's how to deploy it on Kubernetes using vLLM, expose it as an OpenAI-compatible API, and use it in your DevOps workflows.
Gemma 3 is Google's best open-weight model family. The 12B model runs on a single A10G GPU and delivers strong performance for DevOps automation tasks. Here's the full deployment guide.
Why Gemma 3 for DevOps
- Open-weight — deploy in your own cluster, no API costs, data stays in your VPC
- 12B model fits on 1x A10G or 1x L4 GPU — cheap to run ($0.50–1/hour on spot)
- Strong instruction following — good at structured outputs (JSON, YAML)
- OpenAI-compatible API via vLLM — drop-in replacement, no code changes
Prerequisites
- Kubernetes cluster with GPU nodes (EKS with g5.xlarge or g4dn.xlarge)
- NVIDIA GPU Operator installed
- Hugging Face account (Gemma requires accepting license)
kubectlaccess
Step 1: Accept Gemma License and Get HF Token
- Go to huggingface.co/google/gemma-3-12b-it
- Accept the license agreement
- Generate a token: huggingface.co → Settings → Access Tokens → New token (read)
Step 2: Create HuggingFace Secret
kubectl create secret generic huggingface-secret \
--from-literal=token=hf_your_token_here \
-n gpu-workloadsStep 3: Deploy Gemma 3 with vLLM
# gemma3-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemma3-vllm
namespace: gpu-workloads
spec:
replicas: 1
selector:
matchLabels:
app: gemma3-vllm
template:
metadata:
labels:
app: gemma3-vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "google/gemma-3-12b-it"
- "--dtype"
- "bfloat16"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.90"
- "--port"
- "8000"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "4"
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "2"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
volumes:
- name: model-cache
emptyDir:
sizeLimit: 30Gi
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: gemma3-vllm-svc
namespace: gpu-workloads
spec:
selector:
app: gemma3-vllm
ports:
- port: 8000
targetPort: 8000
type: ClusterIPkubectl apply -f gemma3-deployment.yaml
# Watch deployment
kubectl get pods -n gpu-workloads -w
# Check logs (model download takes 5–10 minutes on first start)
kubectl logs -n gpu-workloads -l app=gemma3-vllm -fStep 4: Test the API
# Port-forward to test locally
kubectl port-forward -n gpu-workloads svc/gemma3-vllm-svc 8000:8000 &
# Test with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-12b-it",
"messages": [
{"role": "user", "content": "Write a Kubernetes readiness probe for an HTTP service on port 8080"}
],
"max_tokens": 500
}'
# List available models
curl http://localhost:8000/v1/modelsStep 5: Persistent Model Cache (Avoid Re-downloading)
Re-downloading 24GB on every pod restart is slow. Use a PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
namespace: gpu-workloads
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3
resources:
requests:
storage: 40GiReplace the emptyDir volume in the deployment with:
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvcStep 6: Use Gemma 3 in DevOps Workflows
Since vLLM exposes an OpenAI-compatible API, you can use it anywhere OpenAI is used:
# Kubernetes YAML generator
from openai import OpenAI
client = OpenAI(
base_url="http://gemma3-vllm-svc.gpu-workloads:8000/v1",
api_key="not-needed"
)
def generate_k8s_manifest(description: str) -> str:
response = client.chat.completions.create(
model="google/gemma-3-12b-it",
messages=[
{
"role": "system",
"content": "You are a Kubernetes expert. Generate valid YAML manifests. Output only YAML, no explanations."
},
{
"role": "user",
"content": description
}
],
max_tokens=1000,
temperature=0.1 # low temp for structured output
)
return response.choices[0].message.content
# Example
yaml_output = generate_k8s_manifest(
"Deployment for a Node.js app on port 3000, 3 replicas, "
"resource limits 500m CPU / 512Mi memory, readiness probe on /health"
)
print(yaml_output)Choosing the Right Gemma 3 Model Size
| Model | GPU Required | VRAM | Use Case |
|---|---|---|---|
| gemma-3-1b-it | CPU or small GPU | 4GB | Very simple tasks, fast responses |
| gemma-3-4b-it | 1x T4 or L4 | 8GB | Light tasks, cost-optimized |
| gemma-3-12b-it | 1x A10G or L4 48GB | 24GB | Good quality, production use |
| gemma-3-27b-it | 2x A10G | 48GB | Best quality, complex reasoning |
For DevOps automation (YAML generation, log analysis, runbook execution), the 12B model is the sweet spot.
Cost on AWS (us-east-1)
| Instance | GPU | On-Demand | Spot (est.) |
|---|---|---|---|
| g4dn.xlarge | 1x T4 16GB | $0.526/hr | ~$0.20/hr |
| g5.xlarge | 1x A10G 24GB | $1.006/hr | ~$0.35/hr |
| g6.xlarge | 1x L4 24GB | $0.805/hr | ~$0.28/hr |
For 8 hours/day use: g5.xlarge spot = ~$2.80/day. Much cheaper than API costs at scale.
Auto-scaling the Deployment
Use KEDA to scale based on queue length (if using a message queue for inference requests):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: gemma3-scaledobject
namespace: gpu-workloads
spec:
scaleTargetRef:
name: gemma3-vllm
minReplicaCount: 0 # scale to zero when idle
maxReplicaCount: 3
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_pending_requests
threshold: "5"
query: vllm:pending_requests:countScale to zero when not in use — significant cost savings for dev/staging environments.
Gemma 3 on Kubernetes gives you a production-grade LLM API running entirely in your own cluster — no data leaves your VPC, no per-token API costs, and full control over the model behavior.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI Kubernetes Cost Optimizer with Python and Claude API
Use AI to automatically analyze your Kubernetes resource usage, detect waste, and generate optimization recommendations. Full Python project with Claude API.
Build an AI Kubernetes Resource Rightsizer with Claude API
Build a Python script that reads kubectl top output and current resource requests/limits, sends it to Claude API (claude-haiku-4-5), and gets back specific CPU/memory rightsizing recommendations to cut cloud costs by 30-40%.
Build a Kubernetes Cost Optimization Bot with AI in 2026
Build an AI-powered bot that analyzes your Kubernetes cluster, finds idle resources, oversized pods, and unused namespaces — and gives cost-cutting recommendations.