Prometheus Cheatsheet

PromQL queries, metric types, alerting rules, recording rules, and common Prometheus patterns for Kubernetes monitoring.

6 sections51 commandsClick any row to copy

PromQL Basics

up

Show all targets and their up/down status (1=up, 0=down)

up{job='node-exporter'}

Filter by label — show only node-exporter targets

http_requests_total

Instant vector — current value of counter

http_requests_total[5m]

Range vector — values over last 5 minutes

rate(http_requests_total[5m])

Per-second rate of increase over 5 minutes (for counters)

irate(http_requests_total[5m])

Instant rate — more responsive, less smooth

increase(http_requests_total[1h])

Total increase over 1 hour

sum(rate(http_requests_total[5m]))

Sum rates across all label combinations

sum by (status_code) (rate(http_requests_total[5m]))

Sum grouped by status_code label

avg by (instance) (cpu_usage_percent)

Average grouped by instance

topk(5, rate(http_requests_total[5m]))

Top 5 highest request rates

bottomk(3, node_memory_MemFree_bytes)

3 instances with least free memory

Show all targets and their up/down status (1=up, 0=down)

Filter by label — show only node-exporter targets

Instant vector — current value of counter

Range vector — values over last 5 minutes

Per-second rate of increase over 5 minutes (for counters)

Instant rate — more responsive, less smooth

Total increase over 1 hour

Sum rates across all label combinations

Sum grouped by status_code label

Average grouped by instance

Top 5 highest request rates

3 instances with least free memory

CPU & Memory Queries

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)

CPU usage % per node

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

Available memory % per node

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Memory usage %

container_memory_working_set_bytes{container!='', container!='POD'}

Container working set memory (K8s)

sum by (pod) (container_memory_working_set_bytes{namespace='default', container!=''})

Memory per pod in namespace

rate(container_cpu_usage_seconds_total{container!=''}[5m])

CPU usage rate per container

kube_pod_container_resource_limits{resource='memory'}

Memory limits set in K8s

container_memory_working_set_bytes / kube_pod_container_resource_limits{resource='memory'}

Memory usage vs limit ratio

CPU usage % per node

Available memory % per node

Memory usage %

Container working set memory (K8s)

Memory per pod in namespace

CPU usage rate per container

Memory limits set in K8s

Memory usage vs limit ratio

Kubernetes Queries

kube_pod_status_phase{phase='Pending'}

All pods in Pending state

kube_pod_status_phase{phase='Failed'}

All pods in Failed state

kube_pod_container_status_restarts_total > 5

Containers with more than 5 restarts

kube_deployment_status_replicas_unavailable > 0

Deployments with unavailable replicas

kube_node_status_condition{condition='Ready', status='true'}

All Ready nodes

kube_node_status_condition{condition='DiskPressure', status='true'}

Nodes with disk pressure

kube_persistentvolumeclaim_status_phase{phase='Pending'}

Unbound PVCs

kube_horizontalpodautoscaler_status_current_replicas / kube_horizontalpodautoscaler_spec_max_replicas

HPA current vs max replicas ratio

sum(kube_pod_info) by (node)

Number of pods per node

kubelet_running_pods

Total running pods per kubelet

All pods in Pending state

All pods in Failed state

Containers with more than 5 restarts

Deployments with unavailable replicas

All Ready nodes

Nodes with disk pressure

Unbound PVCs

HPA current vs max replicas ratio

Number of pods per node

Total running pods per kubelet

HTTP & Latency Queries

sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))

HTTP 5xx error rate ratio

histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

95th percentile latency

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

99th percentile latency

sum by (path) (rate(http_requests_total[5m]))

Request rate by endpoint path

rate(nginx_ingress_controller_requests{status=~'4..'}[5m])

NGINX Ingress 4xx rate

avg by (ingress) (nginx_ingress_controller_ingress_upstream_latency_seconds)

Average upstream latency per ingress

HTTP 5xx error rate ratio

95th percentile latency

99th percentile latency

Request rate by endpoint path

NGINX Ingress 4xx rate

Average upstream latency per ingress

Alerting Rules

- alert: HighErrorRate
  expr: sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: 'Error rate above 5%'

Alert when 5xx error rate exceeds 5% for 5 minutes

- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
  for: 1m
  labels:
    severity: warning

Alert on pod restart rate > 3 in 15 min

- alert: NodeMemoryHigh
  expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.90
  for: 10m

Alert when node memory exceeds 90%

- alert: DiskSpaceLow
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
  for: 5m
  labels:
    severity: warning

Alert when disk space below 15%

- alert: TargetDown
  expr: up == 0
  for: 1m
  labels:
    severity: critical

Alert when any scrape target is down

- alert: HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m

Alert when p95 latency exceeds 500ms

Alert when 5xx error rate exceeds 5% for 5 minutes

Alert on pod restart rate > 3 in 15 min

Alert when node memory exceeds 90%

Alert when disk space below 15%

Alert when any scrape target is down

Alert when p95 latency exceeds 500ms

Recording Rules & Management

- record: job:http_requests_total:rate5m
  expr: sum by (job) (rate(http_requests_total[5m]))

Pre-compute rate for expensive queries

promtool check config /etc/prometheus/prometheus.yml

Validate Prometheus config file

promtool check rules /etc/prometheus/rules/*.yml

Validate alerting rules syntax

curl -X POST http://localhost:9090/-/reload

Reload Prometheus config without restart

curl http://localhost:9090/api/v1/query?query=up

Query Prometheus API directly

curl http://localhost:9090/api/v1/targets

List all scrape targets via API

kubectl port-forward svc/prometheus-server 9090:9090 -n monitoring

Access Prometheus UI from local machine

amtool alert query --alertmanager.url=http://localhost:9093

List active alerts in Alertmanager

amtool silence add --alertmanager.url=http://localhost:9093 alertname=TargetDown --duration=2h

Silence an alert for 2 hours