🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Cheatsheets

Prometheus Cheatsheet

PromQL queries, metric types, alerting rules, recording rules, and common Prometheus patterns for Kubernetes monitoring.

6 sections51 commandsClick any row to copy

PromQL Basics

up
up{job='node-exporter'}
http_requests_total
http_requests_total[5m]
rate(http_requests_total[5m])
irate(http_requests_total[5m])
increase(http_requests_total[1h])
sum(rate(http_requests_total[5m]))
sum by (status_code) (rate(http_requests_total[5m]))
avg by (instance) (cpu_usage_percent)
topk(5, rate(http_requests_total[5m]))
bottomk(3, node_memory_MemFree_bytes)

Show all targets and their up/down status (1=up, 0=down)

Filter by label — show only node-exporter targets

Instant vector — current value of counter

Range vector — values over last 5 minutes

Per-second rate of increase over 5 minutes (for counters)

Instant rate — more responsive, less smooth

Total increase over 1 hour

Sum rates across all label combinations

Sum grouped by status_code label

Average grouped by instance

Top 5 highest request rates

3 instances with least free memory

CPU & Memory Queries

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
container_memory_working_set_bytes{container!='', container!='POD'}
sum by (pod) (container_memory_working_set_bytes{namespace='default', container!=''})
rate(container_cpu_usage_seconds_total{container!=''}[5m])
kube_pod_container_resource_limits{resource='memory'}
container_memory_working_set_bytes / kube_pod_container_resource_limits{resource='memory'}

CPU usage % per node

Available memory % per node

Memory usage %

Container working set memory (K8s)

Memory per pod in namespace

CPU usage rate per container

Memory limits set in K8s

Memory usage vs limit ratio

Kubernetes Queries

kube_pod_status_phase{phase='Pending'}
kube_pod_status_phase{phase='Failed'}
kube_pod_container_status_restarts_total > 5
kube_deployment_status_replicas_unavailable > 0
kube_node_status_condition{condition='Ready', status='true'}
kube_node_status_condition{condition='DiskPressure', status='true'}
kube_persistentvolumeclaim_status_phase{phase='Pending'}
kube_horizontalpodautoscaler_status_current_replicas / kube_horizontalpodautoscaler_spec_max_replicas
sum(kube_pod_info) by (node)
kubelet_running_pods

All pods in Pending state

All pods in Failed state

Containers with more than 5 restarts

Deployments with unavailable replicas

All Ready nodes

Nodes with disk pressure

Unbound PVCs

HPA current vs max replicas ratio

Number of pods per node

Total running pods per kubelet

HTTP & Latency Queries

sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
sum by (path) (rate(http_requests_total[5m]))
rate(nginx_ingress_controller_requests{status=~'4..'}[5m])
avg by (ingress) (nginx_ingress_controller_ingress_upstream_latency_seconds)

HTTP 5xx error rate ratio

95th percentile latency

99th percentile latency

Request rate by endpoint path

NGINX Ingress 4xx rate

Average upstream latency per ingress

Alerting Rules

- alert: HighErrorRate expr: sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: 'Error rate above 5%'
- alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3 for: 1m labels: severity: warning
- alert: NodeMemoryHigh expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.90 for: 10m
- alert: DiskSpaceLow expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15 for: 5m labels: severity: warning
- alert: TargetDown expr: up == 0 for: 1m labels: severity: critical
- alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m

Alert when 5xx error rate exceeds 5% for 5 minutes

Alert on pod restart rate > 3 in 15 min

Alert when node memory exceeds 90%

Alert when disk space below 15%

Alert when any scrape target is down

Alert when p95 latency exceeds 500ms

Recording Rules & Management

- record: job:http_requests_total:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
promtool check config /etc/prometheus/prometheus.yml
promtool check rules /etc/prometheus/rules/*.yml
curl -X POST http://localhost:9090/-/reload
curl http://localhost:9090/api/v1/query?query=up
curl http://localhost:9090/api/v1/targets
kubectl port-forward svc/prometheus-server 9090:9090 -n monitoring
amtool alert query --alertmanager.url=http://localhost:9093
amtool silence add --alertmanager.url=http://localhost:9093 alertname=TargetDown --duration=2h

Pre-compute rate for expensive queries

Validate Prometheus config file

Validate alerting rules syntax

Reload Prometheus config without restart

Query Prometheus API directly

List all scrape targets via API

Access Prometheus UI from local machine

List active alerts in Alertmanager

Silence an alert for 2 hours