Chaos Engineering: Break It To Make It Robust

Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Let’s see this in action. Imagine we have a simple web service that depends on a database.

# Kubernetes Deployment for our web service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-service
  template:
    metadata:
      labels:
        app: web-service
    spec:
      containers:
      - name: app
        image: my-web-app:1.0
        ports:
        - containerPort: 8080
        env:
        - name: DB_HOST
          value: "database-service" # Service name for the database
        - name: DB_PORT
          value: "5432"

# Kubernetes Service for our database
apiVersion: v1
kind: Service
metadata:
  name: database-service
spec:
  selector:
    app: database
  ports:
  - protocol: TCP
    port: 5432
    targetPort: 5432

Our web service pods will try to connect to database-service:5432. If the database is unavailable, the web service will start failing requests.

Now, let’s introduce a controlled failure. We can use a tool like Chaos Mesh to simulate network latency between the web service and the database.

# Using Chaos Mesh to inject network latency
chaos-mesh --node-selector "app=web-service" --target "pod" --action network-delay --delay "200ms" --duration "5m" --mode "one"

This command targets pods with the label app=web-service and applies a network delay of 200 milliseconds to their outgoing traffic for 5 minutes. The mode: "one" ensures this experiment runs on only one of the web service pods.

What happens? The web service pods that are part of the experiment will experience increased latency when trying to reach the database. If the web service isn’t designed to handle this, it might start timing out requests, returning errors, or even crashing.

The mental model for chaos engineering involves a few key principles:

Hypothesize about steady state: Before an experiment, you define what "normal" behavior looks like for your system. This involves understanding key metrics like error rates, latency, and throughput. For our web service, a steady state might be an error rate below 0.1% and P99 latency below 100ms.
Introduce blast radius: You start small. Instead of impacting all your web service pods, you might target one pod, then a subset, or even a specific network path. This limits the potential damage if your hypothesis is wrong. The mode: "one" in the Chaos Mesh example is controlling the blast radius.
Run experiments in production (carefully): The ultimate goal is to test in the environment where failures will actually occur. This requires careful planning, monitoring, and the ability to quickly roll back experiments.
Automate experiments: Over time, chaos experiments can be integrated into CI/CD pipelines or run on a schedule, continuously verifying system resilience.

The levers you control in chaos engineering are primarily the type of failure you inject and the blast radius. You can inject:

Resource exhaustion: CPU, memory, disk I/O.
Network issues: Latency, packet loss, connection resets, DNS failures.
Application failures: Crashing processes, injecting errors into responses.
Infrastructure failures: Shutting down nodes, terminating pods, blocking access to external services.

The blast radius is controlled by how many instances of a component are affected, or how many users or requests are impacted.

A common pitfall is to only test negative scenarios. While injecting failures is crucial, you also need to test that your system recovers gracefully. For example, after injecting latency, does the web service start returning errors, but then, once the latency is removed, does it automatically resume normal operation without manual intervention? This recovery path is as important as the failure injection itself.

The next step in chaos engineering is to explore more complex failure scenarios, like injecting correlated failures across multiple services.

Related Concepts

More Deep Dives in Sre