The most surprising truth about SRE Root Cause Analysis (RCA) is that it’s not primarily about finding the single bug, but about building a system that prevents any bug from causing the same type of outage twice.

Let’s watch an SRE team tackle a "service unavailable" incident.

The Incident: Users are reporting 503 Service Unavailable errors on the primary e-commerce site.

Initial Triage:

  • Metrics: The requests_per_second metric for the frontend-web service has dropped to zero. The error_rate for frontend-web is 1.0. The cpu_usage for frontend-web pods is 95%. The network_traffic from frontend-web to backend-api has also dropped.
  • Logs: frontend-web logs show repeated dial tcp <backend-api-ip>:8080: connect: no route to host errors.
  • Alerts: An alert fires: High CPU Usage on frontend-web pods.

The Problem: The frontend-web service cannot reach the backend-api service, causing it to fail to process requests and return 503 errors. This is interesting because it’s not a simple deployment bug; it’s a communication breakdown under load.

Root Cause Analysis - Common Causes & Fixes:

  1. Network Policy Blocking Traffic:

    • Diagnosis: Check Kubernetes Network Policies.
      kubectl get networkpolicy -n <namespace> -o yaml
      
      Look for policies that might restrict egress from frontend-web to backend-api’s port 8080.
    • Fix: If a restrictive policy is found, update it to allow the necessary traffic. For example, if a policy was too specific:
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-frontend-to-backend
        namespace: <namespace>
      spec:
        podSelector:
          matchLabels:
            app: frontend-web
        policyTypes:
        - Egress
        egress:
        - to:
          - podSelector:
              matchLabels:
                app: backend-api
          ports:
          - protocol: TCP
            port: 8080
      
    • Why it works: NetworkPolicies are Kubernetes’ built-in firewall. If the frontend-web pods lack an explicit egress rule allowing them to connect to backend-api on port 8080, the traffic will be dropped by the CNI plugin (like Calico or Cilium). This fix explicitly permits that connection.
  2. DNS Resolution Failure for Backend Service:

    • Diagnosis: Exec into a frontend-web pod and try to resolve the backend-api service name.
      kubectl exec -it <frontend-web-pod-name> -n <namespace> -- nslookup backend-api.<namespace>.svc.cluster.local
      
      If this fails or times out, DNS is the issue.
    • Fix: Check the health of the CoreDNS pods in the kube-system namespace.
      kubectl get pods -n kube-system -l k8s-app=kube-dns
      kubectl logs <coredns-pod-name> -n kube-system
      
      If CoreDNS pods are unhealthy or logging errors, restart them:
      kubectl delete pod <coredns-pod-name> -n kube-system
      
      Or, if the configuration is bad, edit the CoreDNS ConfigMap:
      kubectl edit configmap coredns -n kube-system
      
      Ensure the forward directive is correctly set (e.g., forward . /etc/resolv.conf).
    • Why it works: Kubernetes services are typically resolved via DNS. If the cluster’s DNS service (CoreDNS) is down, overloaded, or misconfigured, pods cannot find the IP addresses of other services, leading to "no route to host" errors when they try to connect directly.
  3. Backend API Service Unhealthy/Down:

    • Diagnosis: Check the status of the backend-api pods and their readiness/liveness probes.
      kubectl get pods -n <namespace> -l app=backend-api
      kubectl describe pod <backend-api-pod-name> -n <namespace>
      
      Look for pods in CrashLoopBackOff, Error, or NotReady states. Check probe failures in the describe output.
    • Fix: If pods are unhealthy, investigate the backend-api application itself. This might involve looking at its logs, checking its dependencies, or redeploying a stable version. If probes are failing, adjust probe parameters (e.g., initialDelaySeconds, periodSeconds, timeoutSeconds) or fix the application bug causing the probe to fail.
      # Example of adjusting readiness probe
      readinessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 20
        timeoutSeconds: 5
        failureThreshold: 3
      
    • Why it works: If the backend-api service has no healthy pods available, the Kubernetes Service abstraction will have no endpoints to route traffic to. The frontend-web service, attempting to connect to the backend-api service’s ClusterIP, will eventually time out or fail to establish a connection, manifesting as "no route to host" if the underlying network stack can’t find a valid destination.
  4. Resource Exhaustion on Backend API:

    • Diagnosis: Monitor CPU and memory usage for backend-api pods.
      kubectl top pods -n <namespace> -l app=backend-api
      
      Also, check if backend-api pods are being OOMKilled.
      kubectl get events -n <namespace> | grep backend-api
      
      Look for OOMKilled events.
    • Fix: Increase the resource requests and limits for the backend-api deployment.
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          cpu: "1"
          memory: "2Gi"
      
      Alternatively, scale up the number of backend-api replicas.
      kubectl scale deployment backend-api --replicas=5 -n <namespace>
      
    • Why it works: If the backend-api service is overwhelmed by requests (either due to high traffic or a code inefficiency), it can consume all its allocated CPU or memory. This can lead to the application becoming unresponsive, crashing, or being terminated by the kubelet (OOMKilled). When the frontend-web tries to connect, there are no healthy backend-api instances to respond.
  5. Incorrect Service Definition or Endpoint:

    • Diagnosis: Verify the backend-api Service definition and its associated Endpoints.
      kubectl get svc backend-api -n <namespace> -o yaml
      kubectl get endpoints backend-api -n <namespace> -o yaml
      
      Ensure the selector in the Service matches the labels on the backend-api pods, and that the Endpoints object lists the correct IP addresses and ports of healthy backend-api pods.
    • Fix: Correct the selector in the Service definition if it doesn’t match the pod labels.
      # Example: Correcting selector
      spec:
        selector:
          app: backend-api # Ensure this matches pod labels
      
      If the Endpoints object is empty or incorrect, it usually indicates a problem with the backend-api pods’ labels or their readiness probes not passing. Fix those underlying issues.
    • Why it works: The Kubernetes Service object acts as a stable IP and DNS name for a set of pods. It dynamically populates its Endpoints object with the IPs and ports of pods that match its selector and are ready to receive traffic. If the selector is wrong, no pods will be associated. If pods aren’t ready, they won’t appear in Endpoints. Without valid endpoints, the Service cannot route traffic.
  6. External Network Issue (Less Common in Cluster, More for External Dependencies):

    • Diagnosis: If backend-api itself depends on an external service, check connectivity from backend-api pods to that external service.
      kubectl exec -it <backend-api-pod-name> -n <namespace> -- curl <external-service-url>
      
      If backend-api is reporting general network instability, check the node the pods are running on for network issues.
    • Fix: Address the external dependency or node-level network problem. This might involve contacting the provider of the external service, or investigating the underlying network configuration on the Kubernetes node.
    • Why it works: While the "no route to host" was observed from frontend-web to backend-api, sometimes the root cause lies within backend-api’s ability to function, which could be due to its own external network failures. If backend-api is failing to start or process requests due to its own external network issues, frontend-web will see it as unavailable.

After fixing these, the next error might be a 500 Internal Server Error if the backend-api application itself has a bug that only surfaces when it’s actually receiving traffic, or a 504 Gateway Timeout if the backend-api is now reachable but taking too long to respond.

Want structured learning?

Take the full Reliability Engineering (SRE) course →