SRE Runbook Automation: Convert Runbooks to Code (2026)

Runbooks are typically static documents detailing steps to resolve incidents or perform routine tasks. Converting them to code transforms these passive guides into active, automated solutions that can execute resolutions directly, reducing human error and response time.

Let’s see this in action with a common SRE task: restarting a misbehaving service.

Imagine a redis service on a Kubernetes cluster that’s become unresponsive. The traditional runbook might look like this:

Check Redis status: kubectl get pods -l app=redis
Identify the problematic pod: Look for CrashLoopBackOff or Error states.
Delete the pod: kubectl delete pod <pod-name>
Verify restart: kubectl get pods -l app=redis
Check Redis logs: kubectl logs <new-pod-name>

Now, let’s turn this into an automated script using Python and the Kubernetes Python client.

from kubernetes import client, config
import time

def restart_redis_pod():
    # Load Kubernetes configuration
    config.load_kube_config() # or config.load_incluster_config() if running inside the cluster

    v1 = client.CoreV1Api()

    # Define the label selector for Redis pods
    redis_label_selector = "app=redis"
    namespace = "default" # Assuming default namespace, adjust if needed

    try:
        # 1. Check Redis status
        pods = v1.list_namespaced_pod(namespace, label_selector=redis_label_selector)

        problematic_pod_name = None
        for pod in pods.items:
            # 2. Identify the problematic pod
            if pod.status.phase == "Running" and pod.status.container_statuses:
                for container_status in pod.status.container_statuses:
                    if container_status.state.waiting and container_status.state.waiting.reason == "CrashLoopBackOff":
                        problematic_pod_name = pod.metadata.name
                        print(f"Found problematic Redis pod: {problematic_pod_name}")
                        break
                if problematic_pod_name:
                    break
            elif pod.status.phase in ["Failed", "Unknown"]:
                problematic_pod_name = pod.metadata.name
                print(f"Found problematic Redis pod (phase: {pod.status.phase}): {problematic_pod_name}")
                break

        if not problematic_pod_name:
            print("No problematic Redis pods found. Service appears healthy.")
            return

        # 3. Delete the pod
        print(f"Deleting pod: {problematic_pod_name}...")
        v1.delete_namespaced_pod(name=problematic_pod_name, namespace=namespace, body=client.V1DeleteOptions())
        print(f"Pod {problematic_pod_name} deleted.")

        # Wait for the pod to be terminated and a new one to start
        print("Waiting for new Redis pod to start...")
        time.sleep(10) # Give Kubernetes some time to reschedule and create a new pod

        # 4. Verify restart
        pods_after_restart = v1.list_namespaced_pod(namespace, label_selector=redis_label_selector)
        new_pod_found = False
        for pod in pods_after_restart.items:
            if pod.metadata.name != problematic_pod_name and pod.status.phase == "Running":
                print(f"New Redis pod running: {pod.metadata.name}")
                new_pod_found = True
                # 5. Check Redis logs (optional, often done by a separate monitoring tool)
                # print(f"Fetching logs for {pod.metadata.name}...")
                # logs = v1.read_namespaced_pod_log(name=pod.metadata.name, namespace=namespace)
                # print(logs)
                break

        if not new_pod_found:
            print("New Redis pod did not start successfully. Further investigation needed.")

    except client.ApiException as e:
        print(f"Kubernetes API error: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
    restart_redis_pod()

This script encapsulates the runbook logic. It uses the Kubernetes API to interact directly with the cluster. The config.load_kube_config() line tells the client to use your local ~/.kube/config file, assuming you have kubectl configured. If this script were running inside the Kubernetes cluster (e.g., as a Job or within a pod), you’d use config.load_incluster_config().

The core of the automation lies in identifying the problematic pod using its status and phase, then issuing a delete command. Kubernetes’ Deployment or StatefulSet controller, which is managing the Redis pods, will automatically detect the missing pod and create a replacement based on its template. This is the "magic" behind the automated restart. The script then waits briefly and checks for a running pod with a different name, confirming the replacement.

The mental model here is shifting from "human reads instructions, human executes commands" to "script reads state, script executes commands, system (Kubernetes) reacts." The problem this solves is the inherent latency and error proneness of manual intervention during incidents. When seconds count, an automated script can be the difference between a minor blip and a major outage.

The levers you control are primarily the label_selector and namespace to target the correct resources, and the logic within the script to identify what constitutes a "problematic" state. You can extend this by adding more sophisticated health checks, integrating with alerting systems (e.g., Prometheus Alertmanager), or adding rollback steps.

One of the most powerful aspects of runbook automation is its ability to perform actions that are too dangerous or too repetitive for humans to do regularly. For instance, automatically scaling down a fleet of services during off-peak hours, or performing a complex, multi-step rollback across several dependent services based on specific failure criteria. This isn’t just about convenience; it’s about enabling a higher level of operational maturity.

The next step in this journey is often integrating such scripts into a larger incident response framework, perhaps triggered by an alert, or building more complex workflows that orchestrate actions across multiple services.

More Deep Dives in Sre