SRE Chaos Engineering: Inject Failures to Build Confidence (2026)

Chaos engineering isn’t about breaking things randomly; it’s about proactively finding weaknesses before your users do by deliberately injecting controlled failures.

Imagine you’re running a distributed e-commerce platform. Here’s a snippet of what a typical user transaction might look like internally:

{
  "transaction_id": "txn_abc123",
  "user_id": "user_xyz789",
  "timestamp": "2023-10-27T10:30:00Z",
  "steps": [
    {
      "service": "api-gateway",
      "action": "receive_request",
      "latency_ms": 5,
      "status": "success"
    },
    {
      "service": "user-service",
      "action": "get_user_profile",
      "latency_ms": 25,
      "status": "success",
      "dependencies": ["auth-service"]
    },
    {
      "service": "product-catalog",
      "action": "search_products",
      "latency_ms": 50,
      "status": "success",
      "query": "running shoes"
    },
    {
      "service": "inventory-service",
      "action": "check_stock",
      "latency_ms": 30,
      "status": "success",
      "product_ids": ["prod_123", "prod_456"]
    },
    {
      "service": "payment-service",
      "action": "process_payment",
      "latency_ms": 100,
      "status": "success",
      "amount": 75.99
    },
    {
      "service": "order-service",
      "action": "create_order",
      "latency_ms": 40,
      "status": "success",
      "order_details": {
        "user_id": "user_xyz789",
        "items": [{"product_id": "prod_123", "quantity": 1}],
        "total": 75.99
      }
    },
    {
      "service": "notification-service",
      "action": "send_confirmation_email",
      "latency_ms": 60,
      "status": "success",
      "recipient": "user_xyz789@example.com"
    }
  ]
}

This JSON represents a single user’s journey through various microservices to complete a purchase. Each step shows the service, the action, how long it took, and whether it succeeded. This is the "steady state" you aim to preserve.

The core problem chaos engineering addresses is the inherent complexity and emergent behavior in distributed systems. You can test individual components in isolation (unit tests, integration tests), but you can’t fully predict how they’ll behave when interconnected under real-world conditions, especially when things go wrong. Chaos engineering introduces these "wrong" conditions in a controlled, experimental way to build confidence in your system’s resilience.

The internal workings involve a "chaos agent" (like Netflix’s Chaos Monkey or Gremlin) that runs within your infrastructure. This agent is configured to perform specific experiments. An experiment might involve:

Resource Exhaustion: Temporarily increasing CPU or memory utilization on a specific set of application servers.
Network Latency/Packet Loss: Injecting delays or dropping network packets between services.
Service Unavailability: Terminating instances of a specific service.
Dependency Failure: Simulating a downstream service returning errors or timing out.

Consider an experiment where we want to test the order-service’s resilience to a flaky inventory-service. We might configure an agent to randomly return a 503 Service Unavailable error for 5% of inventory-service requests for 10 minutes. The experiment’s hypothesis would be: "If the inventory-service becomes unavailable for some requests, the order-service will gracefully degrade, perhaps by informing the user that stock cannot be confirmed at this moment, rather than crashing the entire transaction."

The levers you control are the scope (which hosts/services are affected), the impact (what kind of failure, e.g., CPU, network, process kill), the duration (how long the failure persists), and the frequency/probability (how often it occurs). You define a hypothesis before running an experiment, and then you observe the system’s actual behavior. If the behavior matches your hypothesis, you’ve gained confidence. If it doesn’t, you’ve found a weakness to fix.

The one thing most people don’t realize is that the most valuable chaos experiments are often the ones that don’t break anything visibly to the end-user but do reveal subtle performance degradations or incorrect fallback logic. For instance, a common experiment is to slightly increase the CPU load on a database replica; a well-architected system should handle this without impacting read performance, but a poorly tuned one might start returning slower queries, which you’d only catch with fine-grained monitoring during the experiment.

After mastering controlled failure injection, the next logical step is to automate these experiments as part of your CI/CD pipeline.

More Deep Dives in Sre