Deploy Gemma 3 on Kubernetes with GPU — Complete Guide (2026)
Google's Gemma 3 is open-weight and runs well on a single GPU. Here's how to deploy it on Kubernetes using vLLM, expose it as an OpenAI-compatible API, and use it in your DevOps workflows.
Gemma 3 is Google's best open-weight model family. The 12B model runs on a single A10G GPU and delivers strong performance for DevOps automation tasks. Here's the full deployment guide.
Why Gemma 3 for DevOps
- Open-weight — deploy in your own cluster, no API costs, data stays in your VPC
- 12B model fits on 1x A10G or 1x L4 GPU — cheap to run ($0.50–1/hour on spot)
- Strong instruction following — good at structured outputs (JSON, YAML)
- OpenAI-compatible API via vLLM — drop-in replacement, no code changes
Prerequisites
- Kubernetes cluster with GPU nodes (EKS with g5.xlarge or g4dn.xlarge)
- NVIDIA GPU Operator installed
- Hugging Face account (Gemma requires accepting license)
kubectlaccess
Step 1: Accept Gemma License and Get HF Token
- Go to huggingface.co/google/gemma-3-12b-it
- Accept the license agreement
- Generate a token: huggingface.co → Settings → Access Tokens → New token (read)
Step 2: Create HuggingFace Secret
kubectl create secret generic huggingface-secret \
--from-literal=token=hf_your_token_here \
-n gpu-workloadsStep 3: Deploy Gemma 3 with vLLM
# gemma3-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemma3-vllm
namespace: gpu-workloads
spec:
replicas: 1
selector:
matchLabels:
app: gemma3-vllm
template:
metadata:
labels:
app: gemma3-vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "google/gemma-3-12b-it"
- "--dtype"
- "bfloat16"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.90"
- "--port"
- "8000"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "4"
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "2"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
volumes:
- name: model-cache
emptyDir:
sizeLimit: 30Gi
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: gemma3-vllm-svc
namespace: gpu-workloads
spec:
selector:
app: gemma3-vllm
ports:
- port: 8000
targetPort: 8000
type: ClusterIPkubectl apply -f gemma3-deployment.yaml
# Watch deployment
kubectl get pods -n gpu-workloads -w
# Check logs (model download takes 5–10 minutes on first start)
kubectl logs -n gpu-workloads -l app=gemma3-vllm -fStep 4: Test the API
# Port-forward to test locally
kubectl port-forward -n gpu-workloads svc/gemma3-vllm-svc 8000:8000 &
# Test with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-12b-it",
"messages": [
{"role": "user", "content": "Write a Kubernetes readiness probe for an HTTP service on port 8080"}
],
"max_tokens": 500
}'
# List available models
curl http://localhost:8000/v1/modelsStep 5: Persistent Model Cache (Avoid Re-downloading)
Re-downloading 24GB on every pod restart is slow. Use a PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
namespace: gpu-workloads
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3
resources:
requests:
storage: 40GiReplace the emptyDir volume in the deployment with:
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvcStep 6: Use Gemma 3 in DevOps Workflows
Since vLLM exposes an OpenAI-compatible API, you can use it anywhere OpenAI is used:
# Kubernetes YAML generator
from openai import OpenAI
client = OpenAI(
base_url="http://gemma3-vllm-svc.gpu-workloads:8000/v1",
api_key="not-needed"
)
def generate_k8s_manifest(description: str) -> str:
response = client.chat.completions.create(
model="google/gemma-3-12b-it",
messages=[
{
"role": "system",
"content": "You are a Kubernetes expert. Generate valid YAML manifests. Output only YAML, no explanations."
},
{
"role": "user",
"content": description
}
],
max_tokens=1000,
temperature=0.1 # low temp for structured output
)
return response.choices[0].message.content
# Example
yaml_output = generate_k8s_manifest(
"Deployment for a Node.js app on port 3000, 3 replicas, "
"resource limits 500m CPU / 512Mi memory, readiness probe on /health"
)
print(yaml_output)Choosing the Right Gemma 3 Model Size
| Model | GPU Required | VRAM | Use Case |
|---|---|---|---|
| gemma-3-1b-it | CPU or small GPU | 4GB | Very simple tasks, fast responses |
| gemma-3-4b-it | 1x T4 or L4 | 8GB | Light tasks, cost-optimized |
| gemma-3-12b-it | 1x A10G or L4 48GB | 24GB | Good quality, production use |
| gemma-3-27b-it | 2x A10G | 48GB | Best quality, complex reasoning |
For DevOps automation (YAML generation, log analysis, runbook execution), the 12B model is the sweet spot.
Cost on AWS (us-east-1)
| Instance | GPU | On-Demand | Spot (est.) |
|---|---|---|---|
| g4dn.xlarge | 1x T4 16GB | $0.526/hr | ~$0.20/hr |
| g5.xlarge | 1x A10G 24GB | $1.006/hr | ~$0.35/hr |
| g6.xlarge | 1x L4 24GB | $0.805/hr | ~$0.28/hr |
For 8 hours/day use: g5.xlarge spot = ~$2.80/day. Much cheaper than API costs at scale.
Auto-scaling the Deployment
Use KEDA to scale based on queue length (if using a message queue for inference requests):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: gemma3-scaledobject
namespace: gpu-workloads
spec:
scaleTargetRef:
name: gemma3-vllm
minReplicaCount: 0 # scale to zero when idle
maxReplicaCount: 3
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_pending_requests
threshold: "5"
query: vllm:pending_requests:countScale to zero when not in use — significant cost savings for dev/staging environments.
Gemma 3 on Kubernetes gives you a production-grade LLM API running entirely in your own cluster — no data leaves your VPC, no per-token API costs, and full control over the model behavior.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build a Kubernetes Cost Optimization Bot with AI in 2026
Build an AI-powered bot that analyzes your Kubernetes cluster, finds idle resources, oversized pods, and unused namespaces — and gives cost-cutting recommendations.
Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)
Step-by-step guide to deploying HuggingFace transformer models on Kubernetes using GPU nodes — from cluster setup to inference API in production.
Deploy NVIDIA Triton Inference Server on Kubernetes (2026)
Step-by-step guide to running NVIDIA Triton Inference Server on Kubernetes with GPU nodes — model repository setup, deployment, autoscaling, and monitoring.