SRE toil is the manual, repetitive, tactical work that comes from running a production service.
Here’s a Kubernetes cluster experiencing intermittent Pod restarts due to CrashLoopBackOff errors. This isn’t just a simple application crash; it’s the Kubernetes control plane giving up on a Pod because it can’t maintain a healthy state. The interesting part is that the kubelet on the node is reporting the Pod is unhealthy, but the root cause is often an upstream resource constraint or misconfiguration that prevents the Pod from even starting correctly.
Here are the common culprits:
-
Resource Limits Exceeded: The
Podis requesting more CPU or memory than the node can provide, or its configured limits are too low and it’s being OOMKilled.- Diagnosis:
kubectl describe pod <pod-name> -n <namespace>. Look forOOMKilledin theLast StateorReasonfields. Also,kubectl top pod <pod-name> -n <namespace>can show current resource usage. Checkkubectl describe node <node-name>for overall node capacity and allocatable resources. - Fix: Increase resource requests and limits in the
Pod’s YAML. For example, changeresources: { requests: { cpu: "200m", memory: "256Mi" }, limits: { cpu: "500m", memory: "512Mi" } }toresources: { requests: { cpu: "400m", memory: "512Mi" }, limits: { cpu: "1000m", memory: "1024Mi" } }. - Why it works: This gives the container more guaranteed resources and a higher ceiling, preventing it from being terminated due to memory pressure or CPU starvation.
- Diagnosis:
-
ImagePullBackOff / ErrImagePull: The
kubeletcannot pull the container image.- Diagnosis:
kubectl describe pod <pod-name> -n <namespace>. Look forImagePullBackOfforErrImagePullin theEventssection. Checkkubectl get pods -n <namespace> -o wideto see which node the pod is scheduled on and thenssh <node-name>and runsudo journalctl -u kubelet -fto see more detailed logs from thekubelettrying to pull the image. - Fix: Ensure the image name and tag are correct in the
Podspec. If using a private registry, verify that theimagePullSecretsare correctly configured in thePodorServiceAccountand that the secret itself is valid. For example, if the image ismyregistry.com/myimage:v1.2.3, ensure it’s spelled exactly like that. If using a private registry, ensure the secretmy-registry-secretexists in the same namespace:kubectl get secret my-registry-secret -n <namespace>. - Why it works: This resolves the issue by pointing the
kubeletto the correct image location or providing valid credentials for private repositories, allowing it to download the necessary container image.
- Diagnosis:
-
Readiness Probe Failure: The application inside the
Podis not responding to its readiness probe within the configured timeout.- Diagnosis:
kubectl describe pod <pod-name> -n <namespace>. Check theEventsfor repeated probe failures (e.g.,Readiness probe failed: HTTP request failed). Also,kubectl logs <pod-name> -n <namespace>will show application logs that might indicate why it’s not ready. - Fix: Adjust the probe’s
initialDelaySeconds,periodSeconds,timeoutSeconds, orfailureThreshold. For example, increaseinitialDelaySecondsfrom5to15if the application takes longer to start:livenessProbe: { httpGet: { path: /healthz, port: 8080 }, initialDelaySeconds: 15, periodSeconds: 10 }. - Why it works: This allows the application more time to initialize or increases the tolerance for temporary unresponsiveness before Kubernetes considers the
Podunhealthy and restarts it.
- Diagnosis:
-
Persistent Volume Claim (PVC) Not Bound: The
Podis waiting for aPersistentVolumeClaimto be bound to aPersistentVolume, but this binding is failing.- Diagnosis:
kubectl describe pvc <pvc-name> -n <namespace>. Look for events indicating why it’s not binding (e.g., no matching PVs, storage class issues). Also,kubectl get pvandkubectl get scto check available volumes and storage class configurations. - Fix: Ensure a
PersistentVolumeexists that matches thestorageClassNameandaccessModesrequested by the PVC, or that theStorageClassis correctly configured to dynamically provision volumes. For example, if the PVC requestsstorageClassName: gp2andaccessModes: [ReadWriteOnce], ensure agp2storage class is available and can provision aReadWriteOncevolume. - Why it works: This ensures that the underlying storage is available and correctly provisioned or bound, allowing the
Podto mount its required volumes and start successfully.
- Diagnosis:
-
Network Policy Blocking Traffic: A
NetworkPolicyis preventing thePodfrom receiving necessary inbound or outbound traffic, leading to timeouts or connection failures that trigger health checks.- Diagnosis:
kubectl get networkpolicy -n <namespace>. Review policies to see if they are overly restrictive. Usetcpdumpon the node where the pod is running to inspect traffic to/from the pod’s IP address. - Fix: Modify the
NetworkPolicyto allow the required traffic. For instance, if aPodneeds to receive traffic on port 8080 from other pods in the same namespace, a policy might look like:podSelector: {}andingress: [ { from: [ {} ], ports: [ { protocol: TCP, port: 8080 } ] } ]. - Why it works: This relaxes network restrictions, permitting the
Podto communicate with other services or receive external requests essential for its operation and health checks.
- Diagnosis:
-
Init Container Failures: The
PodhasinitContainersthat are failing to complete their tasks before the main application containers start.- Diagnosis:
kubectl logs <pod-name> -c <init-container-name> -n <namespace>. TheinitContainersrun to completion before application containers. If they fail, thePodwill restart withCrashLoopBackOff. - Fix: Debug the
initContainer’s logic by examining its logs. Common issues include incorrect configuration, missing dependencies, or failure to perform its setup task (e.g., initializing a database schema, downloading configuration files). For example, if aninitContainerfails because it can’t download a config file, fix theinitContainer’s command or mount aConfigMapcorrectly. - Why it works: This ensures that prerequisite tasks are successfully completed by the
initContainers, allowing the main application containers to start in a properly configured environment.
- Diagnosis:
The next error you’ll likely encounter after resolving CrashLoopBackOff is ImagePullBackOff if your image registry is temporarily unavailable, or a probe-related error if your application logic is still flawed.