🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Prometheus Pushgateway vs Pull Model: When to Use Each

Prometheus pulls metrics by default. The Pushgateway lets short-lived jobs push metrics. Here's exactly when each model fits and when you should NOT use Pushgateway.

DevOpsBoys4 min read
Share:Tweet

Prometheus is a pull-based system by design — it scrapes metrics from targets every 15-30 seconds. But short-lived jobs (batch jobs, cron jobs, one-off scripts) die before Prometheus can scrape them. That's where Pushgateway comes in.

The Pull Model (Default Prometheus)

Prometheus scrapes every target on a schedule:

Prometheus → scrapes → /metrics endpoint on each target

This works great for:

  • Long-running services (web apps, APIs, databases)
  • Kubernetes pods with a stable /metrics endpoint
  • Any process that lives longer than your scrape interval

The key insight: Prometheus pulls from the target. The target must be alive when Prometheus comes to scrape.

The Pushgateway

Pushgateway is an intermediary that stores metrics pushed to it:

Short-lived job → pushes → Pushgateway → Prometheus scrapes → Grafana

A batch job pushes its metrics before it exits, then Pushgateway holds them until Prometheus comes to scrape.

bash
# Push metrics from a shell script
cat <<EOF | curl --data-binary @- http://pushgateway:9091/metrics/job/backup_job/instance/server1
# TYPE backup_duration_seconds gauge
backup_duration_seconds 342.5
# TYPE backup_files_processed counter  
backup_files_processed 15234
# TYPE backup_success gauge
backup_success 1
EOF
python
# Push from Python
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
 
registry = CollectorRegistry()
 
backup_duration = Gauge("backup_duration_seconds", "Duration of backup job", registry=registry)
backup_success = Gauge("backup_success", "1 if backup succeeded", registry=registry)
 
# Run your job
import time
start = time.time()
run_backup()  # your job
duration = time.time() - start
 
backup_duration.set(duration)
backup_success.set(1)
 
# Push metrics
push_to_gateway("pushgateway:9091", job="backup_job", registry=registry)

Deploying Pushgateway on Kubernetes

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-pushgateway
  namespace: monitoring
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: pushgateway
          image: prom/pushgateway:v1.8.0
          ports:
            - containerPort: 9091
          args:
            - "--persistence.file=/data/metrics"  # persist across restarts
          volumeMounts:
            - name: storage
              mountPath: /data
      volumes:
        - name: storage
          persistentVolumeClaim:
            claimName: pushgateway-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-pushgateway
  namespace: monitoring
  labels:
    app: pushgateway
spec:
  ports:
    - port: 9091
      targetPort: 9091
  selector:
    app: prometheus-pushgateway

Scrape it from Prometheus:

yaml
# prometheus.yml
scrape_configs:
  - job_name: pushgateway
    honor_labels: true  # important! use the labels from the pushed metrics
    static_configs:
      - targets: ["prometheus-pushgateway.monitoring:9091"]

When to Use Pushgateway ✅

Kubernetes CronJobs:

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: data-export
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: exporter
              image: data-exporter:latest
              env:
                - name: PUSHGATEWAY_URL
                  value: "http://prometheus-pushgateway.monitoring:9091"
          restartPolicy: OnFailure

CI/CD pipeline metrics:

bash
# Track deployment duration and success in GitHub Actions
START_TIME=$(date +%s)
deploy_to_kubernetes
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
 
cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/deployment/environment/production/app/myapp"
deployment_duration_seconds $DURATION
deployment_success 1
deployment_timestamp $(date +%s)
EOF

Batch data processing jobs:

python
# ETL job that runs every hour
registry = CollectorRegistry()
rows_processed = Gauge("etl_rows_processed", "Rows processed in ETL run", registry=registry)
etl_duration = Gauge("etl_duration_seconds", "Duration of ETL run", registry=registry)
 
# Process data...
rows_processed.set(processed_count)
etl_duration.set(elapsed)
 
push_to_gateway("pushgateway:9091", job="hourly_etl", grouping_key={"run_id": run_id}, registry=registry)

When NOT to Use Pushgateway ❌

Do NOT use it for long-running services. If your service is always running and can expose a /metrics endpoint, use Prometheus' native pull model. Pushgateway is not a replacement for a real /metrics endpoint.

Do NOT use it to bypass firewall rules. Some engineers use Pushgateway because "Prometheus can't reach our service." This masks a networking/service discovery problem that should be fixed properly.

Do NOT use it for high-cardinality metrics. Pushgateway stores all pushed metrics in memory. If you're pushing metrics with many unique label combinations (one per user, one per request), you'll OOM Pushgateway quickly.

Do NOT use multiple instances of Pushgateway. It has no replication. If it restarts, you lose metrics (unless you use --persistence.file). Don't put it behind a load balancer.

The Staleness Problem

Pushgateway metrics persist until you explicitly delete them. If a cron job runs at 2am and fails, the last successful metrics from 24 hours ago will still be there — making everything look fine.

Fix: Always push a "job success" gauge and alert on it:

python
# Push 1 if succeeded, 0 if failed
job_success = Gauge("backup_success", "1 if last backup succeeded", registry=registry)
 
try:
    run_backup()
    job_success.set(1)
except Exception as e:
    job_success.set(0)
    raise
finally:
    push_to_gateway("pushgateway:9091", job="backup", registry=registry)

Alert rule:

yaml
- alert: BackupJobFailed
  expr: backup_success{job="backup"} == 0
  for: 5m
  annotations:
    summary: "Backup job failed or hasn't run"

Also alert if the metric hasn't been updated:

yaml
- alert: BackupJobNotRunning
  expr: time() - push_time_seconds{job="backup"} > 90000  # 25 hours
  annotations:
    summary: "Backup job hasn't pushed metrics in 25 hours"

Summary

Pull ModelPushgateway
Use forLong-running servicesShort-lived batch jobs
Metric freshnessEvery scrape intervalLast pushed value (stale risk)
Metric deletionWhen pod diesManual or job-end explicit delete
ScalingPrometheus scalesSingle instance, no HA
Setup complexityLowMedium

The golden rule: use pull by default, push only when the job is shorter than your scrape interval.

Resources: Prometheus Pushgateway | When to use Pushgateway (official guidance)

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments