What is etcd in Kubernetes? Explained Simply
etcd is Kubernetes' brain — it stores the entire cluster state. Here's what it is, how it works, and why backing it up is the most important thing you can do for your cluster.
etcd is a distributed key-value store that acts as Kubernetes' database. Every piece of information about your cluster — every pod, deployment, service, secret, config — lives in etcd.
If etcd dies without a backup, your cluster is gone. Not the workloads (they keep running on nodes), but all the state that makes Kubernetes know what should exist.
What etcd Stores
# Everything in your cluster is in etcd:
kubectl get pods → etcd query
kubectl get deployments → etcd query
kubectl create deployment → etcd write
kubectl delete pod → etcd writeWhen you run kubectl get pods, the API server queries etcd and returns the data. When you create something, the API server writes to etcd. That's the source of truth.
Example of what etcd stores internally:
/registry/pods/default/nginx-pod-abc123 → full pod spec + status
/registry/deployments/production/my-app → deployment spec
/registry/secrets/default/db-password → base64 encoded secret
/registry/services/default/my-service → service spec
Everything is a key-value pair. The value is a JSON/protobuf serialized Kubernetes object.
etcd in the Kubernetes Architecture
kubectl → API Server → etcd (read/write)
↓
Scheduler (reads from API Server)
Controller Manager (reads + writes via API Server)
kubelet on nodes (reads from API Server)
Only the API Server talks directly to etcd. Everything else goes through the API Server.
This is important: the API Server is stateless. It doesn't remember anything between requests. All state lives in etcd.
How etcd Works (Simply)
etcd uses the Raft consensus algorithm to maintain consistency across multiple instances.
In a production cluster, you run 3 or 5 etcd instances (always odd number):
etcd-1 ←→ etcd-2 ←→ etcd-3
Leader elected → all writes go to leader
Leader replicates to followers → majority must confirm before write is committed
Why odd numbers? Raft needs a majority (quorum) to elect a leader.
- 3 nodes: needs 2 to agree — can survive 1 failure
- 5 nodes: needs 3 to agree — can survive 2 failures
- 2 nodes: needs 2 to agree — can survive 0 failures (useless for HA)
In a self-managed cluster (kubeadm), etcd runs as a pod on control plane nodes. In managed Kubernetes (EKS, GKE, AKS), the cloud provider manages etcd — you never see it.
Checking etcd Health
# On a self-managed cluster, exec into etcd pod
kubectl exec -n kube-system etcd-master-node -- \
etcdctl endpoint health \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Check cluster member list
kubectl exec -n kube-system etcd-master-node -- \
etcdctl member list \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyBackup etcd — The Most Important Kubernetes Task
If your control plane dies and you have no etcd backup, you lose all cluster state. Your workloads keep running on nodes but you can't manage the cluster anymore.
Create a backup (snapshot):
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify backup
etcdctl snapshot status /backup/etcd-snapshot-$(date +%Y%m%d).db --write-out=tableRestore from backup:
# Stop API server first
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir /var/lib/etcd-restored
# Update etcd to use the restored data directory
# Then restart etcdAutomate daily backups:
# CronJob to back up etcd to S3 daily
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
hostNetwork: true
containers:
- name: etcd-backup
image: bitnami/etcd:latest
command:
- /bin/sh
- -c
- |
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db
aws s3 cp /backup/etcd-$(date +%Y%m%d).db s3://my-backup-bucket/etcd/
env:
- name: ETCDCTL_API
value: "3"
- name: ETCDCTL_CACERT
value: /etc/kubernetes/pki/etcd/ca.crt
- name: ETCDCTL_CERT
value: /etc/kubernetes/pki/etcd/server.crt
- name: ETCDCTL_KEY
value: /etc/kubernetes/pki/etcd/server.key
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcdCommon etcd Issues
High memory usage: etcd stores history (compaction needed)
etcdctl compact $(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
etcdctl defragSlow writes: Usually disk I/O. etcd needs fast disk — SSDs required for production. Never run etcd on a shared disk with other workloads.
Split brain: 2 of 3 nodes down — etcd can't achieve quorum. Cluster becomes read-only. Fix: bring nodes back or restore from backup.
The CKA exam almost always has an etcd backup question. If you're preparing: etcdctl snapshot save and etcdctl snapshot restore commands must be memorized.
Practice etcd backup and restore with hands-on labs at KodeKloud.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build a Kubernetes Cluster with kubeadm from Scratch (2026)
Step-by-step guide to building a real multi-node Kubernetes cluster using kubeadm — no managed services, no shortcuts.
How to Build a DevOps Home Lab for Free in 2026
You don't need expensive hardware to practice DevOps. Here's how to build a complete home lab with Kubernetes, CI/CD, and monitoring using free tools and cloud free tiers.
How to Crack the CKA Exam in 2026: Study Plan, Resources, and Tips
Complete CKA exam prep guide for 2026 — what to study, how to practice, which resources actually help, and tips to pass on the first attempt.