Build a Complete Kubernetes Monitoring Stack from Scratch (2026)
Step-by-step project walkthrough: set up Prometheus, Grafana, Loki, and AlertManager on Kubernetes using Helm. Real configs, real dashboards, production-ready.
Logging, metrics, and alerts on Kubernetes — this project walkthrough sets up the full observability stack from scratch. By the end you'll have:
- Prometheus collecting cluster and app metrics
- Grafana with pre-built dashboards
- Loki for log aggregation
- AlertManager sending alerts to Slack
This is exactly what a production cluster needs. Let's build it.
What You Need
- A running Kubernetes cluster (Minikube, k3s, or a cloud cluster)
kubectlconfiguredhelmv3 installed- A Slack webhook URL (for alerts — optional but recommended)
Step 1: Set Up Namespaces and Helm Repos
# Create a dedicated namespace
kubectl create namespace monitoring
# Add the Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# Add Grafana repo (for Loki)
helm repo add grafana https://grafana.github.io/helm-charts
helm repo updateStep 2: Install kube-prometheus-stack
The kube-prometheus-stack chart installs Prometheus, Grafana, AlertManager, and all the node exporters in one shot.
Create a values file:
# monitoring-values.yaml
grafana:
enabled: true
adminPassword: "devopsboys-admin" # Change this!
persistence:
enabled: true
size: 5Gi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: ""
type: file
options:
path: /var/lib/grafana/dashboards/default
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
alertmanager:
enabled: true
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 2Gi
nodeExporter:
enabled: true
kubeStateMetrics:
enabled: trueInstall it:
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values monitoring-values.yaml \
--waitCheck everything came up:
kubectl get pods -n monitoring
# NAME READY STATUS RESTARTS
# monitoring-grafana-xxx 3/3 Running 0
# monitoring-kube-prometheus-prometheus-xxx 2/2 Running 0
# monitoring-kube-alertmanager-xxx 2/2 Running 0
# monitoring-prometheus-node-exporter-xxx 1/1 Running 0
# monitoring-kube-state-metrics-xxx 1/1 Running 0Step 3: Access Grafana
# Port-forward to access Grafana locally
kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoringOpen http://localhost:3000 — login with admin / devopsboys-admin.
You'll already have dashboards pre-installed:
- Kubernetes / Compute Resources / Cluster — CPU/memory overview
- Kubernetes / Networking — network traffic
- Node Exporter Full — detailed node metrics
Step 4: Configure AlertManager with Slack
Create a Slack webhook at api.slack.com/apps. Then create the alertmanager config:
# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-monitoring-kube-prometheus-alertmanager
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'slack-critical'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#devops-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'slack-critical'
slack_configs:
- channel: '#devops-critical'
title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'kubectl apply -f alertmanager-config.yamlStep 5: Install Loki for Log Aggregation
Loki collects logs from all pods and makes them queryable in Grafana.
# loki-values.yaml
loki:
commonConfig:
replication_factor: 1
storage:
type: filesystem
auth_enabled: false
promtail:
enabled: true
config:
clients:
- url: http://loki-gateway/loki/api/v1/pushhelm install loki grafana/loki-stack \
--namespace monitoring \
--values loki-values.yaml \
--set grafana.enabled=false \
--waitAdd Loki as a datasource in Grafana:
- Go to Configuration → Data Sources → Add data source
- Select Loki
- URL:
http://loki:3100 - Save & Test
Now in Explore tab, you can query logs:
{namespace="default"} |= "error"
{app="my-app"} | json | level="error"
Step 6: Add Custom Application Metrics
For your own app to expose metrics to Prometheus, add a /metrics endpoint. If you're using Python:
from prometheus_client import start_http_server, Counter, Histogram
REQUEST_COUNT = Counter('app_requests_total', 'Total requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('app_request_latency_seconds', 'Request latency')
start_http_server(8080) # Exposes /metrics on port 8080Then tell Prometheus to scrape it with a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
namespace: monitoring
labels:
release: monitoring # Must match kube-prometheus-stack's label selector
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- defaultkubectl apply -f servicemonitor.yamlVerify Prometheus picks it up:
kubectl port-forward svc/monitoring-kube-prometheus-prometheus 9090:9090 -n monitoring
# Open http://localhost:9090/targets — your app should appearStep 7: Create a Custom Grafana Dashboard
In Grafana, go to Dashboards → New Dashboard → Add Panel.
Panel 1: Request rate
sum(rate(app_requests_total[5m])) by (endpoint)
Panel 2: Error rate
sum(rate(app_requests_total{status=~"5.."}[5m])) / sum(rate(app_requests_total[5m]))
Panel 3: p99 latency
histogram_quantile(0.99, sum(rate(app_request_latency_seconds_bucket[5m])) by (le))
Save the dashboard and export the JSON — commit it to your Git repo. On the next cluster, import it directly.
Step 8: Set Up Useful Alerts
Add PrometheusRule for common alerts:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: monitoring
labels:
release: monitoring
spec:
groups:
- name: app.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(app_requests_total{status=~"5.."}[5m]))
/ sum(rate(app_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} over 5 minutes"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"kubectl apply -f alerts.yamlFull Stack Summary
┌─────────────────────────────────────────────┐
│ Your Kubernetes Cluster │
│ │
│ App Pods ──metrics──► Prometheus │
│ App Pods ──logs────► Loki (via Promtail) │
│ Nodes ──────────────► Node Exporter │
│ K8s API ────────────► kube-state-metrics │
│ │
│ Prometheus ──────► AlertManager ──► Slack │
│ Prometheus + Loki ──► Grafana Dashboards │
└─────────────────────────────────────────────┘
Resources to Go Deeper
- KodeKloud Kubernetes Monitoring Course — Hands-on labs for Prometheus, Grafana, and alerting in Kubernetes
- Grafana Cloud Free Tier — Host dashboards without running your own Grafana
- DigitalOcean Kubernetes $200 Credit — Run a real cluster to practice this project
This stack takes about an hour to set up the first time. Once it's running, you'll have full visibility into every pod, node, and application in your cluster. Commit your values files to Git, and you can spin up the same observability stack on any new cluster in minutes.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Grafana Loki: The Complete Log Aggregation Guide for DevOps Engineers (2026)
Grafana Loki is the Prometheus-inspired log aggregation system built for Kubernetes. This guide covers architecture, installation, LogQL queries, and production best practices.
How to Set Up Prometheus Alertmanager from Scratch (2026)
Step-by-step guide to setting up Prometheus Alertmanager for Kubernetes monitoring. Covers installation, alert rules, routing, Slack/PagerDuty integration, silencing, and production best practices.