The most surprising truth about SRE Root Cause Analysis (RCA) is that it’s not primarily about finding the single bug, but about building a system that prevents any bug from causing the same type of outage twice.
Let’s watch an SRE team tackle a "service unavailable" incident.
The Incident: Users are reporting 503 Service Unavailable errors on the primary e-commerce site.
Initial Triage:
- Metrics: The
requests_per_secondmetric for thefrontend-webservice has dropped to zero. Theerror_rateforfrontend-webis1.0. Thecpu_usageforfrontend-webpods is95%. Thenetwork_trafficfromfrontend-webtobackend-apihas also dropped. - Logs:
frontend-weblogs show repeateddial tcp <backend-api-ip>:8080: connect: no route to hosterrors. - Alerts: An alert fires:
High CPU Usage on frontend-web pods.
The Problem: The frontend-web service cannot reach the backend-api service, causing it to fail to process requests and return 503 errors. This is interesting because it’s not a simple deployment bug; it’s a communication breakdown under load.
Root Cause Analysis - Common Causes & Fixes:
-
Network Policy Blocking Traffic:
- Diagnosis: Check Kubernetes Network Policies.
Look for policies that might restrict egress fromkubectl get networkpolicy -n <namespace> -o yamlfrontend-webtobackend-api’s port8080. - Fix: If a restrictive policy is found, update it to allow the necessary traffic. For example, if a policy was too specific:
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-frontend-to-backend namespace: <namespace> spec: podSelector: matchLabels: app: frontend-web policyTypes: - Egress egress: - to: - podSelector: matchLabels: app: backend-api ports: - protocol: TCP port: 8080 - Why it works: NetworkPolicies are Kubernetes’ built-in firewall. If the
frontend-webpods lack an explicit egress rule allowing them to connect tobackend-apion port8080, the traffic will be dropped by the CNI plugin (like Calico or Cilium). This fix explicitly permits that connection.
- Diagnosis: Check Kubernetes Network Policies.
-
DNS Resolution Failure for Backend Service:
- Diagnosis: Exec into a
frontend-webpod and try to resolve thebackend-apiservice name.
If this fails or times out, DNS is the issue.kubectl exec -it <frontend-web-pod-name> -n <namespace> -- nslookup backend-api.<namespace>.svc.cluster.local - Fix: Check the health of the CoreDNS pods in the
kube-systemnamespace.
If CoreDNS pods are unhealthy or logging errors, restart them:kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl logs <coredns-pod-name> -n kube-system
Or, if the configuration is bad, edit the CoreDNS ConfigMap:kubectl delete pod <coredns-pod-name> -n kube-system
Ensure thekubectl edit configmap coredns -n kube-systemforwarddirective is correctly set (e.g.,forward . /etc/resolv.conf). - Why it works: Kubernetes services are typically resolved via DNS. If the cluster’s DNS service (CoreDNS) is down, overloaded, or misconfigured, pods cannot find the IP addresses of other services, leading to "no route to host" errors when they try to connect directly.
- Diagnosis: Exec into a
-
Backend API Service Unhealthy/Down:
- Diagnosis: Check the status of the
backend-apipods and their readiness/liveness probes.
Look for pods inkubectl get pods -n <namespace> -l app=backend-api kubectl describe pod <backend-api-pod-name> -n <namespace>CrashLoopBackOff,Error, orNotReadystates. Check probe failures in thedescribeoutput. - Fix: If pods are unhealthy, investigate the
backend-apiapplication itself. This might involve looking at its logs, checking its dependencies, or redeploying a stable version. If probes are failing, adjust probe parameters (e.g.,initialDelaySeconds,periodSeconds,timeoutSeconds) or fix the application bug causing the probe to fail.# Example of adjusting readiness probe readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 15 periodSeconds: 20 timeoutSeconds: 5 failureThreshold: 3 - Why it works: If the
backend-apiservice has no healthy pods available, the Kubernetes Service abstraction will have no endpoints to route traffic to. Thefrontend-webservice, attempting to connect to thebackend-apiservice’s ClusterIP, will eventually time out or fail to establish a connection, manifesting as "no route to host" if the underlying network stack can’t find a valid destination.
- Diagnosis: Check the status of the
-
Resource Exhaustion on Backend API:
- Diagnosis: Monitor CPU and memory usage for
backend-apipods.
Also, check ifkubectl top pods -n <namespace> -l app=backend-apibackend-apipods are being OOMKilled.
Look forkubectl get events -n <namespace> | grep backend-apiOOMKilledevents. - Fix: Increase the resource requests and limits for the
backend-apideployment.
Alternatively, scale up the number ofresources: requests: cpu: "500m" memory: "1Gi" limits: cpu: "1" memory: "2Gi"backend-apireplicas.kubectl scale deployment backend-api --replicas=5 -n <namespace> - Why it works: If the
backend-apiservice is overwhelmed by requests (either due to high traffic or a code inefficiency), it can consume all its allocated CPU or memory. This can lead to the application becoming unresponsive, crashing, or being terminated by the kubelet (OOMKilled). When thefrontend-webtries to connect, there are no healthybackend-apiinstances to respond.
- Diagnosis: Monitor CPU and memory usage for
-
Incorrect Service Definition or Endpoint:
- Diagnosis: Verify the
backend-apiService definition and its associated Endpoints.
Ensure thekubectl get svc backend-api -n <namespace> -o yaml kubectl get endpoints backend-api -n <namespace> -o yamlselectorin the Service matches the labels on thebackend-apipods, and that the Endpoints object lists the correct IP addresses and ports of healthybackend-apipods. - Fix: Correct the
selectorin the Service definition if it doesn’t match the pod labels.
If the Endpoints object is empty or incorrect, it usually indicates a problem with the# Example: Correcting selector spec: selector: app: backend-api # Ensure this matches pod labelsbackend-apipods’ labels or their readiness probes not passing. Fix those underlying issues. - Why it works: The Kubernetes Service object acts as a stable IP and DNS name for a set of pods. It dynamically populates its Endpoints object with the IPs and ports of pods that match its selector and are ready to receive traffic. If the selector is wrong, no pods will be associated. If pods aren’t ready, they won’t appear in Endpoints. Without valid endpoints, the Service cannot route traffic.
- Diagnosis: Verify the
-
External Network Issue (Less Common in Cluster, More for External Dependencies):
- Diagnosis: If
backend-apiitself depends on an external service, check connectivity frombackend-apipods to that external service.
Ifkubectl exec -it <backend-api-pod-name> -n <namespace> -- curl <external-service-url>backend-apiis reporting general network instability, check the node the pods are running on for network issues. - Fix: Address the external dependency or node-level network problem. This might involve contacting the provider of the external service, or investigating the underlying network configuration on the Kubernetes node.
- Why it works: While the "no route to host" was observed from
frontend-webtobackend-api, sometimes the root cause lies withinbackend-api’s ability to function, which could be due to its own external network failures. Ifbackend-apiis failing to start or process requests due to its own external network issues,frontend-webwill see it as unavailable.
- Diagnosis: If
After fixing these, the next error might be a 500 Internal Server Error if the backend-api application itself has a bug that only surfaces when it’s actually receiving traffic, or a 504 Gateway Timeout if the backend-api is now reachable but taking too long to respond.