On-Call & Incident Response: Runbooks, Escalation, Postmortems

On-call is often framed as a necessary evil, a badge of honor even, but the real win is not just surviving it, it’s optimizing it to the point where it’s no longer a source of dread.

Imagine a typical alert firing at 3 AM. The system is api-gateway, and it’s reporting 5xx errors to monitoring-service. This isn’t just a blip; it’s the api-gateway service, a critical entry point for all external traffic, failing to respond correctly to requests from users. The interesting part is that the api-gateway thinks it’s healthy, but its downstream dependencies are timing out, and it’s dutifully returning errors for them.

Here’s how to dig into that 3 AM api-gateway 5xx alert:

The Dependency is Unresponsive (Most Common): The api-gateway is trying to talk to a backend service (e.g., user-service, product-service), and that service isn’t answering. The api-gateway has a configured timeout (say, 500ms), and when the user-service doesn’t respond within that time, the api-gateway gives up and returns a 502 Bad Gateway or 504 Gateway Timeout.
- Diagnosis: Check the user-service itself. Look at its CPU, memory, and network. kubectl top pod -n production user-service-abcde. Is it pegged? Check its logs for errors: kubectl logs -n production user-service-abcde -c main. Look for Java OutOfMemoryError or database connection pool exhaustion.
- Fix: If user-service is overloaded, scale it up. kubectl scale deployment -n production user-service --replicas=5. This gives it more capacity to handle requests. If it’s a code issue (e.g., inefficient query), that needs a code fix and redeploy.
- Why it works: More replicas mean more instances of user-service are available to process requests, reducing the load on any single instance and allowing it to respond within the api-gateway’s timeout.
Network Partition Between Services: The api-gateway and user-service can’t communicate. This could be a Kubernetes network policy, a firewall, or a cloud provider’s internal networking issue.
- Diagnosis: From the api-gateway pod, try to curl the user-service’s internal Kubernetes service IP. kubectl exec -n production api-gateway-fghij -c main -- curl -v http://user-service.production.svc.cluster.local:8080/health. If this times out or fails, it’s a network issue. Check NetworkPolicy resources in Kubernetes: kubectl get networkpolicy -n production.
- Fix: Adjust the NetworkPolicy to allow traffic from api-gateway’s namespace/labels to user-service’s namespace/labels. Example: apiVersion: networking.k8s.io/v1; kind: NetworkPolicy; metadata: {name: allow-gateway-to-user, namespace: production}; spec: {podSelector: {matchLabels: {app: user-service}}, policyTypes: [Ingress], ingress: [{from: [{podSelector: {matchLabels: {app: api-gateway}}}]}]}.
- Why it works: The NetworkPolicy explicitly permits the necessary communication path, resolving the connectivity block.
api-gateway Timeout Too Low: The api-gateway is configured to wait for backend services for a very short period, and these services are just naturally a bit slow, even when healthy.
- Diagnosis: Check the api-gateway configuration. This might be a ConfigMap or a Helm value. kubectl get configmap api-gateway-config -n production -o yaml. Look for a connect_timeout or read_timeout setting, e.g., read_timeout: 200ms.
- Fix: Increase the timeout in the api-gateway configuration. Change read_timeout: 200ms to read_timeout: 750ms. Then, redeploy the api-gateway. kubectl rollout restart deployment -n production api-gateway.
- Why it works: A longer timeout allows the api-gateway to wait patiently for slower but functional backend services, preventing it from prematurely declaring them unavailable.
Resource Exhaustion on api-gateway Itself: The api-gateway pods are running out of CPU or memory, making them slow to process requests and leading to their own internal timeouts.
- Diagnosis: Check api-gateway resource usage. kubectl top pod -n production api-gateway-fghij. Are CPU or memory nearing their limits? Check the api-gateway logs for signs of slowness or errors: kubectl logs -n production api-gateway-fghij -c main.
- Fix: Increase the resource requests/limits for the api-gateway pods in its deployment manifest. Change resources: {requests: {cpu: "200m", memory: "512Mi"}, limits: {cpu: "500m", memory: "1Gi"}} to resources: {requests: {cpu: "400m", memory: "768Mi"}, limits: {cpu: "1Gi", memory: "2Gi"}}.
- Why it works: More allocated resources mean the api-gateway has the horsepower to process incoming requests and their outgoing calls to dependencies without becoming a bottleneck itself.
Health Check Misconfiguration: The api-gateway’s health checks are too aggressive, or the backend services’ health checks are failing incorrectly. The api-gateway might be marking a healthy backend as unhealthy and stopping traffic to it.
- Diagnosis: Check the livenessProbe and readinessProbe for both api-gateway and user-service. kubectl describe pod -n production user-service-abcde. Are they failing? Are the thresholds (e.g., initialDelaySeconds: 30, periodSeconds: 10) reasonable? Is the path being checked (/health) actually reflecting service health?
- Fix: Adjust probe parameters. For example, increase initialDelaySeconds if the service takes time to start. If the /health endpoint is problematic, fix the service’s health check endpoint or change the probe path to a more reliable one. For user-service, change readinessProbe to httpGet: {path: /ready, port: 8080} if /ready is more robust.
- Why it works: Correctly configured health checks ensure that traffic is only sent to truly healthy instances and that the api-gateway doesn’t prematurely stop communicating with a momentarily slow but ultimately healthy dependency.
DNS Resolution Issues: The api-gateway can’t resolve the hostname of the user-service.
- Diagnosis: Exec into the api-gateway pod and try to resolve the service name. kubectl exec -n production api-gateway-fghij -c main -- nslookup user-service.production.svc.cluster.local. If this fails or returns incorrect IPs, the issue is with DNS. Check CoreDNS logs in your cluster.
- Fix: Restart the CoreDNS pods. kubectl rollout restart deployment -n kube-system coredns. Ensure your cluster’s DNS configuration is correct.
- Why it works: Restarts or fixes to the cluster’s DNS service ensure that pod hostnames can be correctly translated into IP addresses, enabling inter-service communication.

After fixing all these, your next alert will likely be about the user-service experiencing 5xx errors due to a slow database query that wasn’t previously obvious.

Related Concepts

More Deep Dives in DevOps & Platform Engineering