Prometheus Cheatsheet
PromQL queries, metric types, alerting rules, recording rules, and common Prometheus patterns for Kubernetes monitoring.
PromQL Basics
upShow all targets and their up/down status (1=up, 0=down)
up{job='node-exporter'}Filter by label — show only node-exporter targets
http_requests_totalInstant vector — current value of counter
http_requests_total[5m]Range vector — values over last 5 minutes
rate(http_requests_total[5m])Per-second rate of increase over 5 minutes (for counters)
irate(http_requests_total[5m])Instant rate — more responsive, less smooth
increase(http_requests_total[1h])Total increase over 1 hour
sum(rate(http_requests_total[5m]))Sum rates across all label combinations
sum by (status_code) (rate(http_requests_total[5m]))Sum grouped by status_code label
avg by (instance) (cpu_usage_percent)Average grouped by instance
topk(5, rate(http_requests_total[5m]))Top 5 highest request rates
bottomk(3, node_memory_MemFree_bytes)3 instances with least free memory
Show all targets and their up/down status (1=up, 0=down)
Filter by label — show only node-exporter targets
Instant vector — current value of counter
Range vector — values over last 5 minutes
Per-second rate of increase over 5 minutes (for counters)
Instant rate — more responsive, less smooth
Total increase over 1 hour
Sum rates across all label combinations
Sum grouped by status_code label
Average grouped by instance
Top 5 highest request rates
3 instances with least free memory
CPU & Memory Queries
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)CPU usage % per node
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100Available memory % per node
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100Memory usage %
container_memory_working_set_bytes{container!='', container!='POD'}Container working set memory (K8s)
sum by (pod) (container_memory_working_set_bytes{namespace='default', container!=''})Memory per pod in namespace
rate(container_cpu_usage_seconds_total{container!=''}[5m])CPU usage rate per container
kube_pod_container_resource_limits{resource='memory'}Memory limits set in K8s
container_memory_working_set_bytes / kube_pod_container_resource_limits{resource='memory'}Memory usage vs limit ratio
CPU usage % per node
Available memory % per node
Memory usage %
Container working set memory (K8s)
Memory per pod in namespace
CPU usage rate per container
Memory limits set in K8s
Memory usage vs limit ratio
Kubernetes Queries
kube_pod_status_phase{phase='Pending'}All pods in Pending state
kube_pod_status_phase{phase='Failed'}All pods in Failed state
kube_pod_container_status_restarts_total > 5Containers with more than 5 restarts
kube_deployment_status_replicas_unavailable > 0Deployments with unavailable replicas
kube_node_status_condition{condition='Ready', status='true'}All Ready nodes
kube_node_status_condition{condition='DiskPressure', status='true'}Nodes with disk pressure
kube_persistentvolumeclaim_status_phase{phase='Pending'}Unbound PVCs
kube_horizontalpodautoscaler_status_current_replicas / kube_horizontalpodautoscaler_spec_max_replicasHPA current vs max replicas ratio
sum(kube_pod_info) by (node)Number of pods per node
kubelet_running_podsTotal running pods per kubelet
All pods in Pending state
All pods in Failed state
Containers with more than 5 restarts
Deployments with unavailable replicas
All Ready nodes
Nodes with disk pressure
Unbound PVCs
HPA current vs max replicas ratio
Number of pods per node
Total running pods per kubelet
HTTP & Latency Queries
sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))HTTP 5xx error rate ratio
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))95th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))99th percentile latency
sum by (path) (rate(http_requests_total[5m]))Request rate by endpoint path
rate(nginx_ingress_controller_requests{status=~'4..'}[5m])NGINX Ingress 4xx rate
avg by (ingress) (nginx_ingress_controller_ingress_upstream_latency_seconds)Average upstream latency per ingress
HTTP 5xx error rate ratio
95th percentile latency
99th percentile latency
Request rate by endpoint path
NGINX Ingress 4xx rate
Average upstream latency per ingress
Alerting Rules
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: 'Error rate above 5%'Alert when 5xx error rate exceeds 5% for 5 minutes
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
for: 1m
labels:
severity: warningAlert on pod restart rate > 3 in 15 min
- alert: NodeMemoryHigh
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.90
for: 10mAlert when node memory exceeds 90%
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
for: 5m
labels:
severity: warningAlert when disk space below 15%
- alert: TargetDown
expr: up == 0
for: 1m
labels:
severity: criticalAlert when any scrape target is down
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5mAlert when p95 latency exceeds 500ms
Alert when 5xx error rate exceeds 5% for 5 minutes
Alert on pod restart rate > 3 in 15 min
Alert when node memory exceeds 90%
Alert when disk space below 15%
Alert when any scrape target is down
Alert when p95 latency exceeds 500ms
Recording Rules & Management
- record: job:http_requests_total:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))Pre-compute rate for expensive queries
promtool check config /etc/prometheus/prometheus.ymlValidate Prometheus config file
promtool check rules /etc/prometheus/rules/*.ymlValidate alerting rules syntax
curl -X POST http://localhost:9090/-/reloadReload Prometheus config without restart
curl http://localhost:9090/api/v1/query?query=upQuery Prometheus API directly
curl http://localhost:9090/api/v1/targetsList all scrape targets via API
kubectl port-forward svc/prometheus-server 9090:9090 -n monitoringAccess Prometheus UI from local machine
amtool alert query --alertmanager.url=http://localhost:9093List active alerts in Alertmanager
amtool silence add --alertmanager.url=http://localhost:9093 alertname=TargetDown --duration=2hSilence an alert for 2 hours
Pre-compute rate for expensive queries
Validate Prometheus config file
Validate alerting rules syntax
Reload Prometheus config without restart
Query Prometheus API directly
List all scrape targets via API
Access Prometheus UI from local machine
List active alerts in Alertmanager
Silence an alert for 2 hours