A post-mortem isn’t about assigning blame; it’s about finding the single, often mundane, systemic flaw that, when nudged, cascades into an outage.

Let’s look at a real-world incident and see how a structured post-mortem process could have surfaced the root cause and prevented recurrence.

Imagine a scenario where a popular e-commerce site experiences intermittent "503 Service Unavailable" errors for about 30 minutes during peak traffic. Users can’t access their carts, and orders are failing. The immediate response is to scale up web servers, which temporarily helps, but the problem returns.

Incident Overview:

  • What happened: Users experienced 503 errors, indicating the web servers couldn’t reach the upstream services they depend on.
  • Impact: Significant revenue loss, customer frustration, and brand damage.
  • Duration: Approximately 30 minutes of widespread impact.

The Post-Mortem Process: Driving Action

A well-structured post-mortem doesn’t just document what happened, but why it happened and how to prevent it.

1. Timeline Construction (The Narrative)

The first step is to build a precise, chronological timeline of events from multiple perspectives: monitoring alerts, user reports, engineering actions, and system logs. This isn’t just a list of timestamps; it’s a narrative of the incident unfolding.

  • 14:00 UTC: First spike in webserver_request_error_rate alert.
  • 14:05 UTC: User reports of "cart not loading" on social media.
  • 14:10 UTC: On-call engineer (Alice) receives high_latency_checkout_service alert.
  • 14:15 UTC: Alice scales checkout-service pods by 50%. Temporary relief, error rate dips.
  • 14:20 UTC: webserver_request_error_rate spikes again, higher than before.
  • 14:25 UTC: Bob, another engineer, notices database_connection_pool_exhaustion alert for the inventory-db.
  • 14:30 UTC: Bob identifies that checkout-service’s increased load (due to Alice’s scaling) is holding connections to inventory-db longer, preventing other services (like cart) from acquiring them.
  • 14:35 UTC: Bob implements a connection leak fix in checkout-service and restarts pods.
  • 14:40 UTC: All error rates return to normal.

2. Root Cause Analysis (The "Why")

This is where we move beyond the symptoms to the underlying systemic issues. A common pitfall is stopping at the immediate cause (e.g., "checkout service was slow"). We need to dig deeper.

Common Causes & Their Diagnosis/Fix:

  • Cause 1: Resource Exhaustion in a Downstream Service.

    • Diagnosis: Observe metrics for upstream services (checkout-service, inventory-service, user-profile-service). Look for increased latency, error rates, or saturation of resources like CPU, memory, or connection pools.
      • kubectl top pods -n default (for CPU/memory)
      • kubectl exec <pod-name> -n default -- ss -s (for TCP connection summary)
      • Prometheus queries like sum(rate(http_requests_total{job="checkout-service",status=~"5xx"}[5m])) by (instance)
    • Fix: Identify the bottleneck service. If it’s a connection pool (like inventory-db in our example), scale the database, increase its connection pool size, or optimize queries. If it’s CPU/memory, scale the affected service.
      • For connection pool exhaustion: kubectl edit deployment checkout-service and increase replicas. Simultaneously, kubectl exec <postgres-pod> -- psql -U postgres -c "ALTER SYSTEM SET max_connections = 500;" followed by a database restart.
    • Why it works: Increasing resources or optimizing resource usage allows the downstream service to handle the load, preventing it from becoming a bottleneck for its callers.
  • Cause 2: Cascading Failures Due to Inadequate Circuit Breaking.

    • Diagnosis: When one service slows down, its callers wait. If those callers also have downstream dependencies, the problem propagates. Look for a chain of services exhibiting increased latency or errors, starting from a single point.
      • Distributed tracing tools (Jaeger, Zipkin) are invaluable here. Look for long spans in a chain.
      • Check service mesh metrics (if applicable) for circuit breaker state.
    • Fix: Implement or tune circuit breakers in the calling services. Configure them to fail fast when a downstream dependency is unhealthy, preventing the calling service from being overwhelmed by waiting.
      • Example (Istio): In destination-rule for checkout-service, add:
        trafficPolicy:
          connectionPool:
            tcp:
              maxConnections: 100
            http:
              http1MaxPendingRequests: 10
              maxRequestsPerConnection: 1
          outlierDetection:
            consecutive5xxErrors: 5
            interval: 10s
            baseEjectionTime: 30s
            maxEjectionPercent: 50
        
    • Why it works: Circuit breakers stop requests from being sent to a failing service, allowing the calling service to remain responsive and preventing the failure from cascading further.
  • Cause 3: Unforeseen Load Patterns from a New Feature/Deployment.

    • Diagnosis: Correlate the incident start time with recent deployments or feature flag rollouts. Analyze traffic patterns for specific user segments or API endpoints.
      • git log --since="24h" and kubectl get pods --sort-by='.metadata.creationTimestamp' to find recent changes.
      • Examine access logs for unusual spikes on specific URIs.
    • Fix: Roll back the feature/deployment. If the pattern is desired but the system can’t handle it, optimize the code, add caching, or scale infrastructure before the next rollout.
      • kubectl rollout undo deployment/<deployment-name>
    • Why it works: Reverting the change removes the unexpected load. Future optimization ensures the system can handle the load if the feature is re-enabled.
  • Cause 4: Misconfigured Rate Limiting or Quotas.

    • Diagnosis: Check rate limiting configurations for APIs and services. Look for alerts related to exceeding defined limits, even if the underlying service is healthy.
      • kubectl get virtualservice <name> -n <namespace> -o yaml (for Istio rate limiting).
      • Check logs of your API Gateway or ingress controller.
    • Fix: Adjust rate limits to accommodate legitimate traffic spikes or identify and fix the source of excessive requests if it’s an abuse pattern.
      • Example: Increase the rateLimits in an Istio EnvoyFilter or VirtualService.
    • Why it works: Correctly configured rate limits protect services from being overwhelmed by too many requests, ensuring fair usage and stability.
  • Cause 5: Network Partition or Latency Between Services.

    • Diagnosis: Use network diagnostic tools to check connectivity and latency between affected services. This might involve ping, traceroute, or specialized network observability tools.
      • kubectl exec <source-pod> -n <namespace> -- ping <destination-ip>
      • kubectl exec <source-pod> -n <namespace> -- traceroute <destination-ip>
    • Fix: Investigate network infrastructure issues (e.g., load balancers, firewalls, cloud provider network). If it’s a configuration issue within a service mesh, correct it.
      • Example: If a Kubernetes Service selector is incorrect, kubectl edit svc <service-name> and fix the selector.
    • Why it works: Restoring reliable network paths between services ensures they can communicate effectively.
  • Cause 6: A "Chatty" Dependency Triggered by a Common Event.

    • Diagnosis: Identify a single event or data change that triggers multiple services to make a large number of calls to a common, often external or less-resilient, dependency.
      • Analyze logs and traces for a common upstream trigger.
      • Check the dependency’s own monitoring for load spikes correlating with the incident.
    • Fix: Implement caching for the dependency’s responses, introduce backoff/retry mechanisms with jitter, or work with the dependency owner to improve its capacity or reduce its chattiness.
      • Example: If a notification service is being hammered, add a cache layer to the NotificationService client in checkout-service with a TTL of 60 seconds.
    • Why it works: Reducing the direct load on the problematic dependency, either through caching or smarter client behavior, prevents it from becoming a bottleneck.

3. Action Items (The "How to Prevent")

This is the most critical part. Each identified root cause must have a clear, assigned action item with a due date.

  • Action Item 1: Implement a connection pool monitoring and alerting system for inventory-db with a threshold of 80% usage. (Owner: Bob, Due: Next Sprint)
  • Action Item 2: Add circuit breakers to web-service calls to checkout-service using Istio’s outlierDetection. (Owner: Alice, Due: End of Week)
  • Action Item 3: Review checkout-service connection management to ensure connections to inventory-db are released promptly, even on errors. (Owner: Alice, Due: End of Week)
  • Action Item 4: Document the "chatty dependency" pattern and add specific checks for this in our deployment readiness reviews. (Owner: SRE Lead, Due: Next Month)

4. Lessons Learned (The "What We Learned")

This section captures broader insights.

  • Our monitoring for downstream connection pools was insufficient.
  • The impact of scaling one service on its dependencies needs more explicit consideration.
  • Distributed tracing is essential for quickly diagnosing cascading failures.

The Next Hurdle:

After fixing the immediate issue and implementing these action items, the next problem you’ll likely encounter is the subtle degradation of performance in a less-monitored, but equally critical, microservice that was previously masked by the more obvious failure.

Want structured learning?

Take the full Sre course →