🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

What is Chaos Engineering — Explained Simply

Chaos engineering sounds like deliberately breaking things. It is — but in a controlled way that makes your systems stronger. Here's what it is, how it works, and how to start.

DevOpsBoysJun 1, 20264 min read
Share:Tweet

Netflix famously runs a tool called Chaos Monkey that randomly kills production servers. This sounds insane. It's actually brilliant.

Here's why, and what chaos engineering actually means.


The Core Idea

Traditional approach: Hope your system is resilient. Find out it's not when it fails for real users.

Chaos engineering: Deliberately break things in controlled experiments to find weaknesses before they cause real incidents.

You're not randomly breaking things. You're running scientific experiments to validate assumptions about your system's behavior.


The Chaos Engineering Hypothesis

Every chaos experiment starts with a hypothesis:

"If we kill one of three API server pods, the service should continue handling requests with no user-visible impact because we have load balancing and auto-restart."

Then you test it. If the hypothesis is correct — great, your assumption was valid. If not — you found a weakness to fix before it becomes a 3 AM incident.


Real Examples

Netflix's Chaos Monkey

Randomly terminates EC2 instances in production during business hours. Force engineers to build services that survive instance failures. Netflix found that engineers only build resilient services when failure is expected, not just possible.

LinkedIn's Murphy

Introduces network latency between services. Finds which services have cascading failure issues when dependencies are slow.

Shopify's Toxiproxy

Simulates network conditions: latency, packet loss, connection resets. Tests how apps behave under degraded networking.


Chaos Engineering on Kubernetes

Kubernetes is the ideal platform for chaos engineering — you can kill pods, drain nodes, and inject network faults programmatically.

Tool 1: Chaos Monkey for Kubernetes (kube-monkey)

yaml
# kube-monkey randomly kills pods on a schedule
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    kube-monkey/enabled: "enabled"        # opt-in to chaos
    kube-monkey/mtbf: "3"                 # kill every 3 days
    kube-monkey/kill-mode: "fixed"
    kube-monkey/kill-value: "1"           # kill 1 pod per attack

Tool 2: Chaos Mesh (More Powerful)

bash
# Install Chaos Mesh
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-testing --create-namespace
yaml
# Kill a random pod in the production namespace
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-experiment
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      "app": "api-server"
  duration: "60s"   # Stop after 60 seconds
  scheduler:
    cron: "0 10 * * 1-5"  # Weekdays at 10 AM
yaml
# Inject network delay to simulate slow dependencies
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-experiment
spec:
  action: delay
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      "app": "api-server"
  delay:
    latency: "500ms"
    correlation: "25"
    jitter: "100ms"
  duration: "5m"

The Five Principles of Chaos Engineering

  1. Build a hypothesis around steady state behavior — define what "normal" looks like first (requests/second, error rate, latency p99)

  2. Vary real-world events — use realistic failure modes: pod deaths, network partitions, disk failures, high CPU

  3. Run experiments in production — staging doesn't have real traffic patterns. Production is where you find real weaknesses

  4. Automate experiments to run continuously — one-off experiments aren't chaos engineering. Scheduled experiments are

  5. Minimize blast radius — start small, isolate experiments, have a stop button


How to Start (Without Scaring Your Team)

Week 1: Game Day (Planned Chaos)

Don't start with automated chaos. Start with a "game day" — a planned failure exercise:

  1. Tell everyone: "On Thursday at 2 PM, we're going to kill one pod"
  2. Make sure monitoring/alerting is in place first
  3. Kill the pod during business hours
  4. Watch what happens — does alerting fire? Does traffic redirect?
  5. Fix any problems found

A game day builds confidence and shows the team chaos is controlled.

Week 2-4: Automate Low-Risk Experiments

Start with your most resilient services:

yaml
# Start with: kill non-critical service pods during business hours
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: safe-start
spec:
  action: pod-failure
  mode: one
  selector:
    labelSelectors:
      "tier": "non-critical"    # Not databases or critical APIs
  duration: "30s"
  scheduler:
    cron: "0 14 * * 2"  # Every Tuesday at 2 PM

Measuring Success

Define your Steady State:

Normal: 
- Error rate < 0.1%
- P99 latency < 200ms
- No alerts firing

After pod kill:
- Error rate stayed < 0.5%  (brief spike, recovered)
- P99 latency < 400ms  (slight increase, recovered)
- Alert fired but resolved within 2 minutes

If actual behavior matches expected — hypothesis confirmed. If not — you found something to fix.


What Chaos Engineering is NOT

  • Not random destruction — every experiment has a hypothesis and expected outcome
  • Not always production — start in staging, graduate to production
  • Not for startups with no monitoring — set up alerts and observability FIRST
  • Not for untested code — build basic resilience before adding chaos

Chaos engineering shifts you from "we hope our system is resilient" to "we know exactly how our system behaves under failure." That confidence is worth the controlled pain of experiments.

Learn Kubernetes reliability and resilience with hands-on labs at KodeKloud.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments