Prometheus Pushgateway vs Pull Model: When to Use Each
Prometheus pulls metrics by default. The Pushgateway lets short-lived jobs push metrics. Here's exactly when each model fits and when you should NOT use Pushgateway.
Prometheus is a pull-based system by design — it scrapes metrics from targets every 15-30 seconds. But short-lived jobs (batch jobs, cron jobs, one-off scripts) die before Prometheus can scrape them. That's where Pushgateway comes in.
The Pull Model (Default Prometheus)
Prometheus scrapes every target on a schedule:
Prometheus → scrapes → /metrics endpoint on each target
This works great for:
- Long-running services (web apps, APIs, databases)
- Kubernetes pods with a stable
/metricsendpoint - Any process that lives longer than your scrape interval
The key insight: Prometheus pulls from the target. The target must be alive when Prometheus comes to scrape.
The Pushgateway
Pushgateway is an intermediary that stores metrics pushed to it:
Short-lived job → pushes → Pushgateway → Prometheus scrapes → Grafana
A batch job pushes its metrics before it exits, then Pushgateway holds them until Prometheus comes to scrape.
# Push metrics from a shell script
cat <<EOF | curl --data-binary @- http://pushgateway:9091/metrics/job/backup_job/instance/server1
# TYPE backup_duration_seconds gauge
backup_duration_seconds 342.5
# TYPE backup_files_processed counter
backup_files_processed 15234
# TYPE backup_success gauge
backup_success 1
EOF# Push from Python
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
backup_duration = Gauge("backup_duration_seconds", "Duration of backup job", registry=registry)
backup_success = Gauge("backup_success", "1 if backup succeeded", registry=registry)
# Run your job
import time
start = time.time()
run_backup() # your job
duration = time.time() - start
backup_duration.set(duration)
backup_success.set(1)
# Push metrics
push_to_gateway("pushgateway:9091", job="backup_job", registry=registry)Deploying Pushgateway on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-pushgateway
namespace: monitoring
spec:
replicas: 1
template:
spec:
containers:
- name: pushgateway
image: prom/pushgateway:v1.8.0
ports:
- containerPort: 9091
args:
- "--persistence.file=/data/metrics" # persist across restarts
volumeMounts:
- name: storage
mountPath: /data
volumes:
- name: storage
persistentVolumeClaim:
claimName: pushgateway-pvc
---
apiVersion: v1
kind: Service
metadata:
name: prometheus-pushgateway
namespace: monitoring
labels:
app: pushgateway
spec:
ports:
- port: 9091
targetPort: 9091
selector:
app: prometheus-pushgatewayScrape it from Prometheus:
# prometheus.yml
scrape_configs:
- job_name: pushgateway
honor_labels: true # important! use the labels from the pushed metrics
static_configs:
- targets: ["prometheus-pushgateway.monitoring:9091"]When to Use Pushgateway ✅
Kubernetes CronJobs:
apiVersion: batch/v1
kind: CronJob
metadata:
name: data-export
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: exporter
image: data-exporter:latest
env:
- name: PUSHGATEWAY_URL
value: "http://prometheus-pushgateway.monitoring:9091"
restartPolicy: OnFailureCI/CD pipeline metrics:
# Track deployment duration and success in GitHub Actions
START_TIME=$(date +%s)
deploy_to_kubernetes
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/deployment/environment/production/app/myapp"
deployment_duration_seconds $DURATION
deployment_success 1
deployment_timestamp $(date +%s)
EOFBatch data processing jobs:
# ETL job that runs every hour
registry = CollectorRegistry()
rows_processed = Gauge("etl_rows_processed", "Rows processed in ETL run", registry=registry)
etl_duration = Gauge("etl_duration_seconds", "Duration of ETL run", registry=registry)
# Process data...
rows_processed.set(processed_count)
etl_duration.set(elapsed)
push_to_gateway("pushgateway:9091", job="hourly_etl", grouping_key={"run_id": run_id}, registry=registry)When NOT to Use Pushgateway ❌
Do NOT use it for long-running services. If your service is always running and can expose a /metrics endpoint, use Prometheus' native pull model. Pushgateway is not a replacement for a real /metrics endpoint.
Do NOT use it to bypass firewall rules. Some engineers use Pushgateway because "Prometheus can't reach our service." This masks a networking/service discovery problem that should be fixed properly.
Do NOT use it for high-cardinality metrics. Pushgateway stores all pushed metrics in memory. If you're pushing metrics with many unique label combinations (one per user, one per request), you'll OOM Pushgateway quickly.
Do NOT use multiple instances of Pushgateway. It has no replication. If it restarts, you lose metrics (unless you use --persistence.file). Don't put it behind a load balancer.
The Staleness Problem
Pushgateway metrics persist until you explicitly delete them. If a cron job runs at 2am and fails, the last successful metrics from 24 hours ago will still be there — making everything look fine.
Fix: Always push a "job success" gauge and alert on it:
# Push 1 if succeeded, 0 if failed
job_success = Gauge("backup_success", "1 if last backup succeeded", registry=registry)
try:
run_backup()
job_success.set(1)
except Exception as e:
job_success.set(0)
raise
finally:
push_to_gateway("pushgateway:9091", job="backup", registry=registry)Alert rule:
- alert: BackupJobFailed
expr: backup_success{job="backup"} == 0
for: 5m
annotations:
summary: "Backup job failed or hasn't run"Also alert if the metric hasn't been updated:
- alert: BackupJobNotRunning
expr: time() - push_time_seconds{job="backup"} > 90000 # 25 hours
annotations:
summary: "Backup job hasn't pushed metrics in 25 hours"Summary
| Pull Model | Pushgateway | |
|---|---|---|
| Use for | Long-running services | Short-lived batch jobs |
| Metric freshness | Every scrape interval | Last pushed value (stale risk) |
| Metric deletion | When pod dies | Manual or job-end explicit delete |
| Scaling | Prometheus scales | Single instance, no HA |
| Setup complexity | Low | Medium |
The golden rule: use pull by default, push only when the job is shorter than your scrape interval.
Resources: Prometheus Pushgateway | When to use Pushgateway (official guidance)
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Build an AI-Powered SLO Breach Predictor with Claude and Prometheus
Build an SLO breach predictor that reads error budget burn rate from Prometheus, uses Claude to analyze patterns, and sends Slack alerts before SLOs breach — not after.
Build an AI Alert Classifier for Grafana Using LLMs (2026)
Tired of noisy Grafana alerts that wake you up for nothing? Build an AI layer that classifies incoming alerts as actionable or noise, enriches them with context, and routes them intelligently — using Claude or GPT-4 as the reasoning engine.