On-call is often framed as a necessary evil, a badge of honor even, but the real win is not just surviving it, it’s optimizing it to the point where it’s no longer a source of dread.
Imagine a typical alert firing at 3 AM. The system is api-gateway, and it’s reporting 5xx errors to monitoring-service. This isn’t just a blip; it’s the api-gateway service, a critical entry point for all external traffic, failing to respond correctly to requests from users. The interesting part is that the api-gateway thinks it’s healthy, but its downstream dependencies are timing out, and it’s dutifully returning errors for them.
Here’s how to dig into that 3 AM api-gateway 5xx alert:
-
The Dependency is Unresponsive (Most Common): The
api-gatewayis trying to talk to a backend service (e.g.,user-service,product-service), and that service isn’t answering. Theapi-gatewayhas a configured timeout (say,500ms), and when theuser-servicedoesn’t respond within that time, theapi-gatewaygives up and returns a502 Bad Gatewayor504 Gateway Timeout.- Diagnosis: Check the
user-serviceitself. Look at its CPU, memory, and network.kubectl top pod -n production user-service-abcde. Is it pegged? Check its logs for errors:kubectl logs -n production user-service-abcde -c main. Look for JavaOutOfMemoryErroror database connection pool exhaustion. - Fix: If
user-serviceis overloaded, scale it up.kubectl scale deployment -n production user-service --replicas=5. This gives it more capacity to handle requests. If it’s a code issue (e.g., inefficient query), that needs a code fix and redeploy. - Why it works: More replicas mean more instances of
user-serviceare available to process requests, reducing the load on any single instance and allowing it to respond within theapi-gateway’s timeout.
- Diagnosis: Check the
-
Network Partition Between Services: The
api-gatewayanduser-servicecan’t communicate. This could be a Kubernetes network policy, a firewall, or a cloud provider’s internal networking issue.- Diagnosis: From the
api-gatewaypod, try tocurltheuser-service’s internal Kubernetes service IP.kubectl exec -n production api-gateway-fghij -c main -- curl -v http://user-service.production.svc.cluster.local:8080/health. If this times out or fails, it’s a network issue. CheckNetworkPolicyresources in Kubernetes:kubectl get networkpolicy -n production. - Fix: Adjust the
NetworkPolicyto allow traffic fromapi-gateway’s namespace/labels touser-service’s namespace/labels. Example:apiVersion: networking.k8s.io/v1; kind: NetworkPolicy; metadata: {name: allow-gateway-to-user, namespace: production}; spec: {podSelector: {matchLabels: {app: user-service}}, policyTypes: [Ingress], ingress: [{from: [{podSelector: {matchLabels: {app: api-gateway}}}]}]}. - Why it works: The
NetworkPolicyexplicitly permits the necessary communication path, resolving the connectivity block.
- Diagnosis: From the
-
api-gatewayTimeout Too Low: Theapi-gatewayis configured to wait for backend services for a very short period, and these services are just naturally a bit slow, even when healthy.- Diagnosis: Check the
api-gatewayconfiguration. This might be a ConfigMap or a Helm value.kubectl get configmap api-gateway-config -n production -o yaml. Look for aconnect_timeoutorread_timeoutsetting, e.g.,read_timeout: 200ms. - Fix: Increase the timeout in the
api-gatewayconfiguration. Changeread_timeout: 200mstoread_timeout: 750ms. Then, redeploy theapi-gateway.kubectl rollout restart deployment -n production api-gateway. - Why it works: A longer timeout allows the
api-gatewayto wait patiently for slower but functional backend services, preventing it from prematurely declaring them unavailable.
- Diagnosis: Check the
-
Resource Exhaustion on
api-gatewayItself: Theapi-gatewaypods are running out of CPU or memory, making them slow to process requests and leading to their own internal timeouts.- Diagnosis: Check
api-gatewayresource usage.kubectl top pod -n production api-gateway-fghij. Are CPU or memory nearing their limits? Check theapi-gatewaylogs for signs of slowness or errors:kubectl logs -n production api-gateway-fghij -c main. - Fix: Increase the resource requests/limits for the
api-gatewaypods in its deployment manifest. Changeresources: {requests: {cpu: "200m", memory: "512Mi"}, limits: {cpu: "500m", memory: "1Gi"}}toresources: {requests: {cpu: "400m", memory: "768Mi"}, limits: {cpu: "1Gi", memory: "2Gi"}}. - Why it works: More allocated resources mean the
api-gatewayhas the horsepower to process incoming requests and their outgoing calls to dependencies without becoming a bottleneck itself.
- Diagnosis: Check
-
Health Check Misconfiguration: The
api-gateway’s health checks are too aggressive, or the backend services’ health checks are failing incorrectly. Theapi-gatewaymight be marking a healthy backend as unhealthy and stopping traffic to it.- Diagnosis: Check the
livenessProbeandreadinessProbefor bothapi-gatewayanduser-service.kubectl describe pod -n production user-service-abcde. Are they failing? Are the thresholds (e.g.,initialDelaySeconds: 30,periodSeconds: 10) reasonable? Is the path being checked (/health) actually reflecting service health? - Fix: Adjust probe parameters. For example, increase
initialDelaySecondsif the service takes time to start. If the/healthendpoint is problematic, fix the service’s health check endpoint or change the probe path to a more reliable one. Foruser-service, changereadinessProbetohttpGet: {path: /ready, port: 8080}if/readyis more robust. - Why it works: Correctly configured health checks ensure that traffic is only sent to truly healthy instances and that the
api-gatewaydoesn’t prematurely stop communicating with a momentarily slow but ultimately healthy dependency.
- Diagnosis: Check the
-
DNS Resolution Issues: The
api-gatewaycan’t resolve the hostname of theuser-service.- Diagnosis: Exec into the
api-gatewaypod and try to resolve the service name.kubectl exec -n production api-gateway-fghij -c main -- nslookup user-service.production.svc.cluster.local. If this fails or returns incorrect IPs, the issue is with DNS. Check CoreDNS logs in your cluster. - Fix: Restart the CoreDNS pods.
kubectl rollout restart deployment -n kube-system coredns. Ensure your cluster’s DNS configuration is correct. - Why it works: Restarts or fixes to the cluster’s DNS service ensure that pod hostnames can be correctly translated into IP addresses, enabling inter-service communication.
- Diagnosis: Exec into the
After fixing all these, your next alert will likely be about the user-service experiencing 5xx errors due to a slow database query that wasn’t previously obvious.