SRE Fault Injection: Test Resilience in Staging and Prod (2026)

Fault injection is the most powerful tool in your SRE arsenal for proactively identifying weaknesses before they impact users, and it’s surprisingly easy to integrate into your existing workflows.

Let’s say you’re running a simple web service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-web-app
  template:
    metadata:
      labels:
        app: my-web-app
    spec:
      containers:
      - name: web
        image: nginx:latest
        ports:
        - containerPort: 80

This deploys three replicas of an Nginx web server. Now, imagine you want to test how your application handles a dependency failure. Your application might call out to a user-service.

Here’s how you can inject a fault using a tool like Chaos Mesh, which operates by injecting custom resources into your Kubernetes cluster. First, you’d install Chaos Mesh if you haven’t already:

kubectl apply -f https://raw.githubusercontent.com/chaos-mesh/chaos-mesh/master/install/chaos-mesh.yaml

Now, let’s simulate a network latency issue for the user-service. We’ll create a NetworkChaos resource:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: user-service-latency
  namespace: default
spec:
  action: delay
  mode: one
  selector:
    labelSelectors:
      app: user-service # Assuming your user-service pods have this label
  delay:
    latency: "200ms" # Introduce 200ms of latency
    correlation: "100%" # Apply to all packets
    distribution: "fixed"
  duration: "5m" # Apply for 5 minutes

When applied, this resource tells Chaos Mesh to intercept network traffic destined for pods matching app: user-service and add a 200ms delay. Your my-web-app will now experience slower responses from the user-service. You’d observe this in your application’s latency metrics, error rates (if timeouts are involved), and potentially user-facing slowdowns.

The mental model here is straightforward: you define the what (network delay), the where (pods labeled app: user-service), the how (200ms latency), and the when (for 5 minutes). Chaos Mesh then orchestrates the actual injection by manipulating network rules (often using iptables or eBPF) within the affected pods’ network namespaces.

You can go much further. What if the user-service itself is unavailable?

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: user-service-unavailable
  namespace: default
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      app: user-service
  duration: "1m"

This PodChaos resource will randomly select a pod with the label app: user-service and terminate it. Kubernetes will then restart it. During that minute of unavailability, your my-web-app would likely see connection errors. This tests your application’s retry mechanisms and circuit breakers.

The power comes from combining these. You might inject network latency and then kill a pod, or inject CPU pressure into a different service to see how it impacts your primary application’s performance under duress. You can also inject faults directly into your application’s pods, for example, to simulate memory pressure or disk I/O issues.

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: my-app-cpu-hog
  namespace: default
spec:
  action: cpu-hog
  mode: all
  selector:
    labelSelectors:
      app: my-web-app
  duration: "2m"
  parameters:
    cpu-percent: "80" # Use 80% of CPU

This cpu-hog action, when applied to my-web-app, will saturate 80% of the CPU resources for each pod in the deployment for two minutes. This is invaluable for understanding how your application behaves when its own underlying compute resources are constrained, testing its internal scheduling and resource management.

The most surprising thing is how granular you can get. You can target specific ports, specific requests, or even specific users based on headers. This allows you to test very targeted failure scenarios that are difficult to reproduce manually.

The next logical step is to automate these experiments, integrating them into your CI/CD pipeline or running them on a schedule against your staging or even production environments (with extreme caution and appropriate safeguards).

More Deep Dives in Sre