🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

What is etcd in Kubernetes? Explained Simply

etcd is Kubernetes' brain — it stores the entire cluster state. Here's what it is, how it works, and why backing it up is the most important thing you can do for your cluster.

DevOpsBoysJun 10, 20264 min read
Share:Tweet

etcd is a distributed key-value store that acts as Kubernetes' database. Every piece of information about your cluster — every pod, deployment, service, secret, config — lives in etcd.

If etcd dies without a backup, your cluster is gone. Not the workloads (they keep running on nodes), but all the state that makes Kubernetes know what should exist.


What etcd Stores

bash
# Everything in your cluster is in etcd:
kubectl get pods etcd query
kubectl get deployments etcd query
kubectl create deployment etcd write
kubectl delete pod etcd write

When you run kubectl get pods, the API server queries etcd and returns the data. When you create something, the API server writes to etcd. That's the source of truth.

Example of what etcd stores internally:

/registry/pods/default/nginx-pod-abc123    → full pod spec + status
/registry/deployments/production/my-app   → deployment spec
/registry/secrets/default/db-password     → base64 encoded secret
/registry/services/default/my-service     → service spec

Everything is a key-value pair. The value is a JSON/protobuf serialized Kubernetes object.


etcd in the Kubernetes Architecture

kubectl → API Server → etcd (read/write)
                 ↓
         Scheduler (reads from API Server)
         Controller Manager (reads + writes via API Server)
         kubelet on nodes (reads from API Server)

Only the API Server talks directly to etcd. Everything else goes through the API Server.

This is important: the API Server is stateless. It doesn't remember anything between requests. All state lives in etcd.


How etcd Works (Simply)

etcd uses the Raft consensus algorithm to maintain consistency across multiple instances.

In a production cluster, you run 3 or 5 etcd instances (always odd number):

etcd-1 ←→ etcd-2 ←→ etcd-3

Leader elected → all writes go to leader
Leader replicates to followers → majority must confirm before write is committed

Why odd numbers? Raft needs a majority (quorum) to elect a leader.

  • 3 nodes: needs 2 to agree — can survive 1 failure
  • 5 nodes: needs 3 to agree — can survive 2 failures
  • 2 nodes: needs 2 to agree — can survive 0 failures (useless for HA)

In a self-managed cluster (kubeadm), etcd runs as a pod on control plane nodes. In managed Kubernetes (EKS, GKE, AKS), the cloud provider manages etcd — you never see it.


Checking etcd Health

bash
# On a self-managed cluster, exec into etcd pod
kubectl exec -n kube-system etcd-master-node -- \
  etcdctl endpoint health \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
 
# Check cluster member list
kubectl exec -n kube-system etcd-master-node -- \
  etcdctl member list \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Backup etcd — The Most Important Kubernetes Task

If your control plane dies and you have no etcd backup, you lose all cluster state. Your workloads keep running on nodes but you can't manage the cluster anymore.

Create a backup (snapshot):

bash
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
 
# Verify backup
etcdctl snapshot status /backup/etcd-snapshot-$(date +%Y%m%d).db --write-out=table

Restore from backup:

bash
# Stop API server first
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir /var/lib/etcd-restored
 
# Update etcd to use the restored data directory
# Then restart etcd

Automate daily backups:

yaml
# CronJob to back up etcd to S3 daily
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 2 * * *"   # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          containers:
          - name: etcd-backup
            image: bitnami/etcd:latest
            command:
            - /bin/sh
            - -c
            - |
              etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db
              aws s3 cp /backup/etcd-$(date +%Y%m%d).db s3://my-backup-bucket/etcd/
            env:
            - name: ETCDCTL_API
              value: "3"
            - name: ETCDCTL_CACERT
              value: /etc/kubernetes/pki/etcd/ca.crt
            - name: ETCDCTL_CERT
              value: /etc/kubernetes/pki/etcd/server.crt
            - name: ETCDCTL_KEY
              value: /etc/kubernetes/pki/etcd/server.key
            volumeMounts:
            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd

Common etcd Issues

High memory usage: etcd stores history (compaction needed)

bash
etcdctl compact $(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
etcdctl defrag

Slow writes: Usually disk I/O. etcd needs fast disk — SSDs required for production. Never run etcd on a shared disk with other workloads.

Split brain: 2 of 3 nodes down — etcd can't achieve quorum. Cluster becomes read-only. Fix: bring nodes back or restore from backup.


The CKA exam almost always has an etcd backup question. If you're preparing: etcdctl snapshot save and etcdctl snapshot restore commands must be memorized.

Practice etcd backup and restore with hands-on labs at KodeKloud.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments