Writing and maintaining an SRE incident response runbook is less about documenting every possible failure and more about creating a living, breathing guide that empowers your team to react effectively to the unknown.

Let’s see this in action. Imagine a common scenario: a critical service is experiencing high latency.

// Simulated Monitoring Alert
{
  "timestamp": "2023-10-27T10:30:00Z",
  "service": "user-profile-api",
  "metric": "p99_latency_ms",
  "value": 2500,
  "threshold": 500,
  "severity": "CRITICAL"
}

A human SRE sees this. What’s the first thing they do? Not panic, but consult the runbook for user-profile-api.

The Mental Model: From Checklist to Compass

The core problem an incident response runbook solves is cognitive overload during a high-stress event. When systems fail, our brains tend to narrow focus, miss obvious-but-uncommon causes, and forget established procedures. A good runbook acts as a distributed, calm brain for the team.

It’s not just a static list of "if X, then Y." It’s a dynamic map.

  1. The "What’s Broken?" Section: This is the entry point. It should quickly confirm the observed symptoms and point to the likely affected component.

    • Symptom: High p99 latency on user-profile-api.
    • Likely Component: user-profile-api itself, or its direct dependencies (e.g., user-data-db, auth-service).
  2. The "Diagnosis Toolkit": This is the heart of the runbook. It provides concrete commands and checks to quickly narrow down the root cause. Each item should be actionable and include expected outputs or thresholds.

    • Check 1: Service Health & Load:

      • Command: kubectl get pods -l app=user-profile-api -n production -o wide
      • What to look for: Are all pods running? Are any in CrashLoopBackOff or Error states? Is the CPU/memory utilization across pods maxed out?
      • Why it works: This immediately tells you if the problem is a widespread deployment issue or if individual instances are struggling. If pods are restarting, it’s a strong indicator of application-level instability.
    • Check 2: Resource Saturation (Kubernetes):

      • Command: kubectl top pods -l app=user-profile-api -n production and kubectl top nodes
      • What to look for: Are any user-profile-api pods consistently hitting their CPU or memory limits? Is the node hosting these pods saturated?
      • Why it works: Even if pods are Running, they might be throttled by Kubernetes if they’ve hit their resource requests or limits, causing them to become unresponsive. Node saturation impacts all pods on that node.
    • Check 3: Dependency Latency:

      • Command: curl -s -o /dev/null -w "%{time_total}\n" http://user-data-db.internal:5432/health (or your service’s equivalent health check endpoint)
      • What to look for: High latency (e.g., > 500ms) on calls to direct dependencies like user-data-db.
      • Why it works: The user-profile-api might be healthy, but it’s waiting on a slow dependency, making it appear slow. This isolates the problem to the downstream service.
    • Check 4: Application Logs:

      • Command: kubectl logs -l app=user-profile-api -n production --tail=100 (and filter for errors or high latency markers)
      • What to look for: Error messages like database connection pool exhausted, context deadline exceeded, GC overhead limit exceeded, or specific slow query logs.
      • Why it works: Application logs often contain the most granular details about why an operation is slow, such as inefficient database queries or internal application logic errors.
    • Check 5: Network Connectivity (Pod-to-Pod):

      • Command: kubectl exec <some-user-profile-pod> -n production -- ping user-data-db.internal
      • What to look for: Packet loss or high latency on pings between pods.
      • Why it works: Network issues within the cluster (e.g., CNI misconfiguration, overloaded network interfaces) can manifest as slow application performance.
  3. The "Remediation Playbook": Once a cause is identified, this section provides the exact steps to fix it.

    • Cause: Resource Exhaustion (CPU/Memory on Pods)

      • Diagnosis: Check 2 shows pods consistently hitting limits. kubectl describe pod <pod-name> shows OOMKilled or CPU throttling.
      • Fix: Increase resource limits in the deployment YAML.
        resources:
          limits:
            cpu: "2000m" # Increased from 1000m
            memory: "2Gi" # Increased from 1Gi
          requests:
            cpu: "1000m"
            memory: "1Gi"
        
      • Why it works: Provides the application with more processing power and memory, allowing it to handle the current load without being throttled or killed by the container orchestrator.
    • Cause: Dependency Latency (e.g., user-data-db slow)

      • Diagnosis: Check 3 shows high latency to user-data-db. Application logs (Check 4) might show slow query warnings.
      • Fix: Escalate to the user-data-db team with specific query/performance data. If it’s a known issue with a workaround, document it here. For example, "If user-data-db p95 latency > 300ms, temporarily disable user-profile-api’s secondary cache by setting FEATURE_FLAG_USER_CACHE_SECONDARY=false in its deployment."
      • Why it works: This bypasses the slow dependency, restoring user-profile-api performance while the root cause in the dependency is addressed separately.
    • Cause: Application Bug (e.g., Memory Leak)

      • Diagnosis: Check 4 shows increasing memory usage over time in logs, or kubectl top pods shows steady memory growth on specific pods.
      • Fix: Trigger a rolling restart of the deployment. kubectl rollout restart deployment user-profile-api -n production
      • Why it works: A rolling restart replaces old, potentially leaking pods with new ones, temporarily resolving the issue until the leak causes problems again. This buys time for a code fix.

The one thing most people don’t know about runbooks is that their primary value isn’t in the commands but in the decisions they guide. A runbook shouldn’t just say "run kubectl logs"; it should say "run kubectl logs and look for X, Y, Z because that indicates problem A, which can be fixed by B." This explicit linking of observation to action, and action to outcome, is what builds confidence and speed.

The next step after fixing high latency on user-profile-api is often dealing with the cascading effects of the original problem, or the side effects of your fix. You might find that notification-service is now experiencing 5xx errors because it couldn’t reach user-profile-api during the outage.

Want structured learning?

Take the full Sre course →