Chaos Engineering: Observability's Secret Weapon

Monitoring tells you when something is broken. Chaos engineering tells you if your monitoring will catch it before it matters.

Let’s see what that looks like in practice. Imagine we’re running a simple e-commerce checkout service.

{
  "service_name": "checkout-service",
  "version": "1.5.2",
  "dependencies": {
    "payment-gateway": "2.1.0",
    "inventory-service": "3.0.1"
  },
  "config": {
    "timeout_ms": 500,
    "retry_count": 3,
    "circuit_breaker_threshold": 0.5
  }
}

Our checkout-service talks to payment-gateway and inventory-service. It has a 500ms timeout, retries 3 times, and a circuit breaker that trips if 50% of requests fail.

Now, what do we monitor? We’d likely set up alerts for:

Request Latency: Average and p99 latency for checkout-service.
Error Rate: Percentage of 5xx errors from checkout-service.
Dependency Health: Latency and error rates for calls to payment-gateway and inventory-service.
Circuit Breaker State: Alert if the checkout-service circuit breaker to either dependency is open.

This is great. If payment-gateway suddenly starts returning 500s, our error rate alert for checkout-service should fire. If it gets slow, our latency alert will fire. If the circuit breaker trips, we’ll get an alert for that.

But what if payment-gateway doesn’t return errors, and doesn’t get slow, but just starts… dropping packets? Or returning malformed responses that our service doesn’t recognize as errors, but still can’t use? Or what if our own checkout-service has a bug where it only fails under a specific load pattern that never triggers our existing alerts?

This is where chaos comes in. We want to inject failures and observe our monitoring system’s response.

Let’s say we decide to test the payment-gateway dependency. We can use a tool like LitmusChaos or Gremlin to simulate a failure.

Chaos Experiment: Network Latency Injection

We want to see what happens when payment-gateway becomes intermittently slow, specifically targeting the network path between checkout-service and payment-gateway.

Target: Pods running payment-gateway
Action: Inject network latency
Parameters:
- delay: 1000ms (1 second)
- duration: 300 seconds (5 minutes)
- correlation: service=checkout-service (only affect traffic from checkout)

Before we run this, our monitoring dashboard shows everything green. Average latency for checkout-service is 80ms, p99 is 150ms. Error rate is 0%. Circuit breakers are all closed.

We run the chaos experiment.

Immediately, we observe our monitoring system.

Latency: The average latency for checkout-service starts creeping up. We see it hit 200ms, then 350ms, then 600ms. The p99 latency spikes dramatically, exceeding our configured alert threshold of 500ms within seconds.
Error Rate: Initially, the error rate doesn’t change because the payment-gateway is still responding, just slowly. However, as requests pile up and start timing out on the checkout-service side (hitting its 500ms timeout), we see the 5xx error rate for checkout-service begin to climb. It goes from 0% to 2%, then 5%, then 10%.
Dependency Health: Our metrics for calls to payment-gateway will show increased latency and, crucially, an increase in timeouts originating from checkout-service attempting to reach payment-gateway.
Circuit Breaker: Because the error rate is now above 10% (our hypothetical threshold of 0.5 * 100% = 50% failure rate over a sliding window, but let’s assume a simpler threshold for demonstration), the circuit breaker in checkout-service to payment-gateway will start to open. We’ll get an alert for this after the error rate has already been high for a short period.

This experiment reveals several things:

Our latency alerts are working, but they might be too high if we want to catch issues before users experience significant delays.
Our error rate alerts are also working, but they only fire after our service starts timing out due to the slow dependency.
We don’t have a direct alert for "dependency latency exceeding X ms" that isn’t tied to our own service’s error rate.

The fix isn’t necessarily to change the chaos experiment, but to refine our monitoring.

Refine Latency Alerts: Lower the p99 latency alert threshold for checkout-service from 500ms to, say, 300ms. This will alert us to the beginning of the slowdown, not just when it’s causing outright failures.
Add Dependency Latency Alerts: Create a new alert that specifically monitors the p99 latency of calls from checkout-service to payment-gateway. Set an alert threshold for this at 700ms. This will catch the dependency slowness before it causes our own service to time out.
Add "Stuck Retries" Alert: Monitor the rate of requests that are currently in their retry loop. A sudden increase indicates a dependency is slow or intermittently unavailable.

Let’s run another experiment: Network Packet Loss on payment-gateway.

Target: Pods running payment-gateway
Action: Inject packet loss
Parameters:
- loss: 20%
- duration: 300 seconds
- correlation: service=checkout-service

This time, the payment-gateway might still respond quickly when it does respond. But 20% of the time, the packet simply disappears.

What happens?

Latency: The average and p99 latency for checkout-service will spike, but perhaps not as predictably as with pure delay. Some requests will be fast, others will time out.
Error Rate: The error rate for checkout-service will climb rapidly as requests fail to reach payment-gateway or their responses never arrive. This will likely exceed 50% very quickly, tripping the circuit breaker.
Circuit Breaker: The circuit breaker to payment-gateway will open, and we’ll get an alert. This alert is valuable, but it fires after the service is already degraded.

This experiment highlights a gap: we’re not explicitly monitoring the health of the network path itself, beyond what our application-level timeouts and error rates imply. We might consider adding network-level probes that ping payment-gateway from the checkout-service pods and alert on packet loss independently of application requests.

The most surprising outcome of chaos engineering isn’t that you find bugs in your monitoring; it’s that you find bugs in your assumptions about your system’s resilience. You assume your timeouts are sufficient, your retries are helpful, and your circuit breakers are perfectly tuned, but chaos forces you to see how these mechanisms interact under stress that you’d never see in normal operation. It’s the difference between reading the fire safety manual and actually testing the sprinklers during a controlled burn.

After fixing these monitoring gaps, the next natural step is to test more complex failure scenarios, like cascading failures or resource exhaustion in one service impacting another.

Related Concepts

More Deep Dives in Observability & Monitoring