Fault injection is the most powerful tool in your SRE arsenal for proactively identifying weaknesses before they impact users, and it’s surprisingly easy to integrate into your existing workflows.
Let’s say you’re running a simple web service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-web-app
spec:
replicas: 3
selector:
matchLabels:
app: my-web-app
template:
metadata:
labels:
app: my-web-app
spec:
containers:
- name: web
image: nginx:latest
ports:
- containerPort: 80
This deploys three replicas of an Nginx web server. Now, imagine you want to test how your application handles a dependency failure. Your application might call out to a user-service.
Here’s how you can inject a fault using a tool like Chaos Mesh, which operates by injecting custom resources into your Kubernetes cluster. First, you’d install Chaos Mesh if you haven’t already:
kubectl apply -f https://raw.githubusercontent.com/chaos-mesh/chaos-mesh/master/install/chaos-mesh.yaml
Now, let’s simulate a network latency issue for the user-service. We’ll create a NetworkChaos resource:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: user-service-latency
namespace: default
spec:
action: delay
mode: one
selector:
labelSelectors:
app: user-service # Assuming your user-service pods have this label
delay:
latency: "200ms" # Introduce 200ms of latency
correlation: "100%" # Apply to all packets
distribution: "fixed"
duration: "5m" # Apply for 5 minutes
When applied, this resource tells Chaos Mesh to intercept network traffic destined for pods matching app: user-service and add a 200ms delay. Your my-web-app will now experience slower responses from the user-service. You’d observe this in your application’s latency metrics, error rates (if timeouts are involved), and potentially user-facing slowdowns.
The mental model here is straightforward: you define the what (network delay), the where (pods labeled app: user-service), the how (200ms latency), and the when (for 5 minutes). Chaos Mesh then orchestrates the actual injection by manipulating network rules (often using iptables or eBPF) within the affected pods’ network namespaces.
You can go much further. What if the user-service itself is unavailable?
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: user-service-unavailable
namespace: default
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: user-service
duration: "1m"
This PodChaos resource will randomly select a pod with the label app: user-service and terminate it. Kubernetes will then restart it. During that minute of unavailability, your my-web-app would likely see connection errors. This tests your application’s retry mechanisms and circuit breakers.
The power comes from combining these. You might inject network latency and then kill a pod, or inject CPU pressure into a different service to see how it impacts your primary application’s performance under duress. You can also inject faults directly into your application’s pods, for example, to simulate memory pressure or disk I/O issues.
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: my-app-cpu-hog
namespace: default
spec:
action: cpu-hog
mode: all
selector:
labelSelectors:
app: my-web-app
duration: "2m"
parameters:
cpu-percent: "80" # Use 80% of CPU
This cpu-hog action, when applied to my-web-app, will saturate 80% of the CPU resources for each pod in the deployment for two minutes. This is invaluable for understanding how your application behaves when its own underlying compute resources are constrained, testing its internal scheduling and resource management.
The most surprising thing is how granular you can get. You can target specific ports, specific requests, or even specific users based on headers. This allows you to test very targeted failure scenarios that are difficult to reproduce manually.
The next logical step is to automate these experiments, integrating them into your CI/CD pipeline or running them on a schedule against your staging or even production environments (with extreme caution and appropriate safeguards).