🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Deploy Gemma 3 on Kubernetes with GPU — Complete Guide (2026)

Google's Gemma 3 is open-weight and runs well on a single GPU. Here's how to deploy it on Kubernetes using vLLM, expose it as an OpenAI-compatible API, and use it in your DevOps workflows.

DevOpsBoysMay 6, 20264 min read
Share:Tweet

Gemma 3 is Google's best open-weight model family. The 12B model runs on a single A10G GPU and delivers strong performance for DevOps automation tasks. Here's the full deployment guide.


Why Gemma 3 for DevOps

  • Open-weight — deploy in your own cluster, no API costs, data stays in your VPC
  • 12B model fits on 1x A10G or 1x L4 GPU — cheap to run ($0.50–1/hour on spot)
  • Strong instruction following — good at structured outputs (JSON, YAML)
  • OpenAI-compatible API via vLLM — drop-in replacement, no code changes

Prerequisites

  • Kubernetes cluster with GPU nodes (EKS with g5.xlarge or g4dn.xlarge)
  • NVIDIA GPU Operator installed
  • Hugging Face account (Gemma requires accepting license)
  • kubectl access

Step 1: Accept Gemma License and Get HF Token

  1. Go to huggingface.co/google/gemma-3-12b-it
  2. Accept the license agreement
  3. Generate a token: huggingface.co → Settings → Access Tokens → New token (read)

Step 2: Create HuggingFace Secret

bash
kubectl create secret generic huggingface-secret \
  --from-literal=token=hf_your_token_here \
  -n gpu-workloads

Step 3: Deploy Gemma 3 with vLLM

yaml
# gemma3-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemma3-vllm
  namespace: gpu-workloads
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma3-vllm
  template:
    metadata:
      labels:
        app: gemma3-vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "google/gemma-3-12b-it"
        - "--dtype"
        - "bfloat16"
        - "--max-model-len"
        - "8192"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--port"
        - "8000"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "24Gi"
            cpu: "4"
          requests:
            nvidia.com/gpu: "1"
            memory: "16Gi"
            cpu: "2"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
          timeoutSeconds: 10
      volumes:
      - name: model-cache
        emptyDir:
          sizeLimit: 30Gi
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: gemma3-vllm-svc
  namespace: gpu-workloads
spec:
  selector:
    app: gemma3-vllm
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP
bash
kubectl apply -f gemma3-deployment.yaml
 
# Watch deployment
kubectl get pods -n gpu-workloads -w
 
# Check logs (model download takes 5–10 minutes on first start)
kubectl logs -n gpu-workloads -l app=gemma3-vllm -f

Step 4: Test the API

bash
# Port-forward to test locally
kubectl port-forward -n gpu-workloads svc/gemma3-vllm-svc 8000:8000 &
 
# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-12b-it",
    "messages": [
      {"role": "user", "content": "Write a Kubernetes readiness probe for an HTTP service on port 8080"}
    ],
    "max_tokens": 500
  }'
 
# List available models
curl http://localhost:8000/v1/models

Step 5: Persistent Model Cache (Avoid Re-downloading)

Re-downloading 24GB on every pod restart is slow. Use a PVC:

yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
  namespace: gpu-workloads
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 40Gi

Replace the emptyDir volume in the deployment with:

yaml
volumes:
- name: model-cache
  persistentVolumeClaim:
    claimName: model-cache-pvc

Step 6: Use Gemma 3 in DevOps Workflows

Since vLLM exposes an OpenAI-compatible API, you can use it anywhere OpenAI is used:

python
# Kubernetes YAML generator
from openai import OpenAI
 
client = OpenAI(
    base_url="http://gemma3-vllm-svc.gpu-workloads:8000/v1",
    api_key="not-needed"
)
 
def generate_k8s_manifest(description: str) -> str:
    response = client.chat.completions.create(
        model="google/gemma-3-12b-it",
        messages=[
            {
                "role": "system",
                "content": "You are a Kubernetes expert. Generate valid YAML manifests. Output only YAML, no explanations."
            },
            {
                "role": "user",
                "content": description
            }
        ],
        max_tokens=1000,
        temperature=0.1  # low temp for structured output
    )
    return response.choices[0].message.content
 
# Example
yaml_output = generate_k8s_manifest(
    "Deployment for a Node.js app on port 3000, 3 replicas, "
    "resource limits 500m CPU / 512Mi memory, readiness probe on /health"
)
print(yaml_output)

Choosing the Right Gemma 3 Model Size

ModelGPU RequiredVRAMUse Case
gemma-3-1b-itCPU or small GPU4GBVery simple tasks, fast responses
gemma-3-4b-it1x T4 or L48GBLight tasks, cost-optimized
gemma-3-12b-it1x A10G or L4 48GB24GBGood quality, production use
gemma-3-27b-it2x A10G48GBBest quality, complex reasoning

For DevOps automation (YAML generation, log analysis, runbook execution), the 12B model is the sweet spot.


Cost on AWS (us-east-1)

InstanceGPUOn-DemandSpot (est.)
g4dn.xlarge1x T4 16GB$0.526/hr~$0.20/hr
g5.xlarge1x A10G 24GB$1.006/hr~$0.35/hr
g6.xlarge1x L4 24GB$0.805/hr~$0.28/hr

For 8 hours/day use: g5.xlarge spot = ~$2.80/day. Much cheaper than API costs at scale.


Auto-scaling the Deployment

Use KEDA to scale based on queue length (if using a message queue for inference requests):

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: gemma3-scaledobject
  namespace: gpu-workloads
spec:
  scaleTargetRef:
    name: gemma3-vllm
  minReplicaCount: 0  # scale to zero when idle
  maxReplicaCount: 3
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_pending_requests
      threshold: "5"
      query: vllm:pending_requests:count

Scale to zero when not in use — significant cost savings for dev/staging environments.


Gemma 3 on Kubernetes gives you a production-grade LLM API running entirely in your own cluster — no data leaves your VPC, no per-token API costs, and full control over the model behavior.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments