Writing and maintaining an SRE incident response runbook is less about documenting every possible failure and more about creating a living, breathing guide that empowers your team to react effectively to the unknown.
Let’s see this in action. Imagine a common scenario: a critical service is experiencing high latency.
// Simulated Monitoring Alert
{
"timestamp": "2023-10-27T10:30:00Z",
"service": "user-profile-api",
"metric": "p99_latency_ms",
"value": 2500,
"threshold": 500,
"severity": "CRITICAL"
}
A human SRE sees this. What’s the first thing they do? Not panic, but consult the runbook for user-profile-api.
The Mental Model: From Checklist to Compass
The core problem an incident response runbook solves is cognitive overload during a high-stress event. When systems fail, our brains tend to narrow focus, miss obvious-but-uncommon causes, and forget established procedures. A good runbook acts as a distributed, calm brain for the team.
It’s not just a static list of "if X, then Y." It’s a dynamic map.
-
The "What’s Broken?" Section: This is the entry point. It should quickly confirm the observed symptoms and point to the likely affected component.
- Symptom: High p99 latency on
user-profile-api. - Likely Component:
user-profile-apiitself, or its direct dependencies (e.g.,user-data-db,auth-service).
- Symptom: High p99 latency on
-
The "Diagnosis Toolkit": This is the heart of the runbook. It provides concrete commands and checks to quickly narrow down the root cause. Each item should be actionable and include expected outputs or thresholds.
-
Check 1: Service Health & Load:
- Command:
kubectl get pods -l app=user-profile-api -n production -o wide - What to look for: Are all pods running? Are any in
CrashLoopBackOfforErrorstates? Is the CPU/memory utilization across pods maxed out? - Why it works: This immediately tells you if the problem is a widespread deployment issue or if individual instances are struggling. If pods are restarting, it’s a strong indicator of application-level instability.
- Command:
-
Check 2: Resource Saturation (Kubernetes):
- Command:
kubectl top pods -l app=user-profile-api -n productionandkubectl top nodes - What to look for: Are any
user-profile-apipods consistently hitting their CPU or memory limits? Is the node hosting these pods saturated? - Why it works: Even if pods are
Running, they might be throttled by Kubernetes if they’ve hit their resource requests or limits, causing them to become unresponsive. Node saturation impacts all pods on that node.
- Command:
-
Check 3: Dependency Latency:
- Command:
curl -s -o /dev/null -w "%{time_total}\n" http://user-data-db.internal:5432/health(or your service’s equivalent health check endpoint) - What to look for: High latency (e.g., > 500ms) on calls to direct dependencies like
user-data-db. - Why it works: The
user-profile-apimight be healthy, but it’s waiting on a slow dependency, making it appear slow. This isolates the problem to the downstream service.
- Command:
-
Check 4: Application Logs:
- Command:
kubectl logs -l app=user-profile-api -n production --tail=100(and filter for errors or high latency markers) - What to look for: Error messages like
database connection pool exhausted,context deadline exceeded,GC overhead limit exceeded, or specific slow query logs. - Why it works: Application logs often contain the most granular details about why an operation is slow, such as inefficient database queries or internal application logic errors.
- Command:
-
Check 5: Network Connectivity (Pod-to-Pod):
- Command:
kubectl exec <some-user-profile-pod> -n production -- ping user-data-db.internal - What to look for: Packet loss or high latency on pings between pods.
- Why it works: Network issues within the cluster (e.g., CNI misconfiguration, overloaded network interfaces) can manifest as slow application performance.
- Command:
-
-
The "Remediation Playbook": Once a cause is identified, this section provides the exact steps to fix it.
-
Cause: Resource Exhaustion (CPU/Memory on Pods)
- Diagnosis: Check 2 shows pods consistently hitting limits.
kubectl describe pod <pod-name>showsOOMKilledor CPU throttling. - Fix: Increase resource limits in the deployment YAML.
resources: limits: cpu: "2000m" # Increased from 1000m memory: "2Gi" # Increased from 1Gi requests: cpu: "1000m" memory: "1Gi" - Why it works: Provides the application with more processing power and memory, allowing it to handle the current load without being throttled or killed by the container orchestrator.
- Diagnosis: Check 2 shows pods consistently hitting limits.
-
Cause: Dependency Latency (e.g.,
user-data-dbslow)- Diagnosis: Check 3 shows high latency to
user-data-db. Application logs (Check 4) might show slow query warnings. - Fix: Escalate to the
user-data-dbteam with specific query/performance data. If it’s a known issue with a workaround, document it here. For example, "Ifuser-data-dbp95 latency > 300ms, temporarily disableuser-profile-api’s secondary cache by settingFEATURE_FLAG_USER_CACHE_SECONDARY=falsein its deployment." - Why it works: This bypasses the slow dependency, restoring
user-profile-apiperformance while the root cause in the dependency is addressed separately.
- Diagnosis: Check 3 shows high latency to
-
Cause: Application Bug (e.g., Memory Leak)
- Diagnosis: Check 4 shows increasing memory usage over time in logs, or
kubectl top podsshows steady memory growth on specific pods. - Fix: Trigger a rolling restart of the deployment.
kubectl rollout restart deployment user-profile-api -n production - Why it works: A rolling restart replaces old, potentially leaking pods with new ones, temporarily resolving the issue until the leak causes problems again. This buys time for a code fix.
- Diagnosis: Check 4 shows increasing memory usage over time in logs, or
-
The one thing most people don’t know about runbooks is that their primary value isn’t in the commands but in the decisions they guide. A runbook shouldn’t just say "run kubectl logs"; it should say "run kubectl logs and look for X, Y, Z because that indicates problem A, which can be fixed by B." This explicit linking of observation to action, and action to outcome, is what builds confidence and speed.
The next step after fixing high latency on user-profile-api is often dealing with the cascading effects of the original problem, or the side effects of your fix. You might find that notification-service is now experiencing 5xx errors because it couldn’t reach user-profile-api during the outage.