Post Mortem Analysis: Beyond Blame

A blameless post-mortem is useless if it doesn’t lead to concrete actions that prevent the same incident from happening again.

Let’s look at a hypothetical incident: "Service X unavailable for 30 minutes."

Imagine Service X is a critical piece of your infrastructure, say, a user authentication service. When it goes down, everything that relies on it grinds to a halt – logins, API requests, even internal dashboards. The real problem isn’t just that Service X is unavailable; it’s the cascade of failures and the lack of immediate visibility into why it failed that truly breaks the system’s resilience.

Here are the common culprits when Service X suddenly becomes unavailable:

Resource Exhaustion (CPU/Memory): Service X, or a critical dependency it talks to, is simply out of CPU or memory. This isn’t about a "bug" in the code, but rather a mismatch between the workload and the provisioned resources.
- Diagnosis: kubectl top pods -n <namespace> and kubectl top nodes -n <namespace> will show you if pods or nodes are pegged at 90-100% utilization. Check the metrics for Service X and its immediate downstream dependencies in your monitoring tool (e.g., Prometheus/Grafana) for CPU and memory usage spikes.
- Fix: Scale up the resources for Service X. This could mean increasing the resources.requests and resources.limits for CPU/memory in its Kubernetes deployment YAML. For example, change resources.requests.cpu: "500m" to resources.requests.cpu: "1000m" and resources.limits.cpu: "1000m" to resources.limits.cpu: "2000m". Alternatively, if the issue is with a node, you might need to add more nodes to your cluster or evict less critical pods.
- Why it works: By increasing the allocated resources, you give the process more headroom to handle its workload, preventing the operating system from throttling or killing it due to resource contention.
Network Connectivity Issues: Service X can’t reach its dependencies, or clients can’t reach Service X. This could be a misconfiguration in the network policies, a downed load balancer, or DNS resolution failures.
- Diagnosis: Use kubectl exec <pod-name> -n <namespace> -- curl <dependency-service-url> from within the Service X pod to test connectivity to its dependencies. From an external client, ping <service-x-external-ip> or curl <service-x-external-ip> will test reachability. Check kube-proxy logs and CNI plugin logs for network errors.
- Fix: If a network policy is blocking traffic, modify the policy to allow necessary connections. For example, if Service X needs to talk to Service Y on port 8080, ensure the network policy for Service Y’s namespace permits ingress from Service X’s namespace on that port. If DNS is the issue, check your CoreDNS/kube-dns configuration and logs.
- Why it works: Correcting network policies or DNS allows the packets to flow to their intended destinations, re-establishing communication.
Database/Datastore Saturation: Service X relies on a database (e.g., PostgreSQL, Redis) that is overloaded, experiencing high latency, or has run out of connections.
- Diagnosis: Check the database’s connection count, CPU, memory, and I/O utilization. Look for slow query logs or high numbers of active connections in the database itself. In Prometheus, query pg_stat_activity for PostgreSQL or redis_connected_clients for Redis.
- Fix: Increase the database’s resources (CPU, RAM, IOPS), optimize queries, or increase the max_connections parameter in the database configuration (e.g., max_connections = 200 in postgresql.conf). If connection pooling is used by Service X, ensure its pool size is adequate but not excessive.
- Why it works: A healthy database can respond to Service X’s requests promptly, preventing timeouts and service degradation.
Application-Level Errors (Infinite Loops, Deadlocks): A bug in Service X’s code is causing it to enter an infinite loop or a deadlock, consuming CPU without making progress, or blocking other threads indefinitely.
- Diagnosis: Analyze thread dumps or heap dumps from the running Service X pod during the incident. Look for threads stuck in the same method, or threads waiting indefinitely on locks. Tools like jstack (for Java) or pprof (for Go) are invaluable. Check application logs for repeated error messages or unusual patterns.
- Fix: Identify the specific code path causing the loop or deadlock and refactor it. This might involve adding timeouts to blocking operations, ensuring proper lock acquisition/release, or fixing faulty loop conditions.
- Why it works: By resolving the faulty logic, you allow the application threads to execute correctly and terminate, freeing up resources and allowing the service to process requests.
Configuration Drift / Bad Deployment: A recent deployment or configuration change introduced an error. This could be incorrect environment variables, missing secrets, or a faulty application configuration file.
- Diagnosis: Compare the current deployment’s configuration (e.g., Kubernetes ConfigMaps, Secrets, environment variables) with the previous known-good version. Check deployment history and associated commit logs. kubectl describe deployment <deployment-name> -n <namespace> can show rollout history and status.
- Fix: Revert to the previous known-good deployment or configuration. For example, if a new environment variable FEATURE_FLAG_X was set to an invalid value, revert the ConfigMap or Secret to its prior state and trigger a new deployment.
- Why it works: Restoring a working configuration removes the erroneous setting that was preventing Service X from functioning correctly.
External Dependency Failure: Service X relies on an external service (e.g., a third-party API, a SaaS provider) that has become unavailable or is returning errors.
- Diagnosis: Check the status pages of your external dependencies. From within Service X, try to manually call the external API to see if it responds. Implement circuit breakers in Service X to gracefully handle failures from external services.
- Fix: If the external service is down, you may need to wait for them to fix it. Implement fallback mechanisms in Service X if possible, or temporarily disable features that rely on the failing service. If it’s a rate limiting issue, implement exponential backoff and retry logic.
- Why it works: By understanding and mitigating the impact of external failures, you prevent them from bringing down your own services.

After fixing these, the next error you’ll likely encounter is a 503 Service Unavailable from a different service that was depending on Service X, but now can’t reach it because Service X itself is still in a degraded state or has become unreachable due to a new network issue.

Related Concepts

More Deep Dives in Sre