The most surprising truth about SRE toil reduction is that the goal isn’t to eliminate toil, but to make it so boring and repetitive that it’s impossible to ignore.

Imagine you’re on call, and every few hours, you get an alert: "Service X is slow." You log in, check the database connection pool, see it’s maxed out, restart the pool, and the alerts stop. This happens again an hour later. And again. It’s maddening, right? This is toil.

Let’s say you’re managing a fleet of web servers. You notice that every Tuesday, around 2 PM PST, you have to manually restart the web server process on about 10% of your machines because of a memory leak that only manifests under sustained load. This is toil.

Here’s how you might tackle that specific Tuesday restart problem:

  1. Identify the exact symptoms and scope: You know it’s a memory leak, affecting about 10% of servers, every Tuesday at 2 PM PST. You’ve confirmed it’s the web server process by checking top and htop for high memory usage and then systemctl status your-web-service to see the process.

  2. Gather diagnostic data: Before automating, you’d want to collect more data.

    • Command: journalctl -u your-web-service -S "last Tuesday 13:00" -U "last Tuesday 15:00" --output=json | jq '. + {host: ._SYSTEMD_UNIT}' > /tmp/webserver_logs_tuesday.json
    • Purpose: This captures logs specifically from the web service for the hour before and after your typical intervention time last Tuesday, tagging each log entry with the hostname. Analyzing this JSON output with tools like jq or a log aggregation platform can pinpoint the exact log messages preceding the memory growth.
  3. Develop a targeted fix: You discover the leak is caused by a specific caching module that doesn’t evict old entries under high load. The fix is to add a configuration parameter to the web server.

    • Configuration File: /etc/your-web-service/config.yaml
    • Change: Add the line cache.max_entries: 10000 to the config.yaml file.
    • Why it works: This explicitly limits the cache size, preventing it from growing indefinitely and thus stopping the memory leak.
  4. Automate the deployment of the fix: You’ll use a configuration management tool like Ansible.

    • Ansible Playbook Snippet:
      ---
      - name: Apply web service memory leak fix
        hosts: webservers
        become: yes
        tasks:
          - name: Ensure cache limit is set
            ansible.builtin.lineinfile:
              path: /etc/your-web-service/config.yaml
              regexp: '^cache\.max_entries:'
              line: 'cache.max_entries: 10000'
              state: present
            notify: restart web service
      
        handlers:
          - name: restart web service
            ansible.builtin.systemd:
              name: your-web-service
              state: restarted
      
    • Why it works: This playbook ensures the cache.max_entries line exists and is set to 10000 in the configuration file on all hosts in the webservers group. If the line is added or changed, it triggers a handler to restart the your-web-service systemd unit.
  5. Automate the monitoring for the need for the fix (if the fix wasn’t permanent): If the fix wasn’t a permanent code change but rather a temporary workaround (e.g., restarting a process), you’d automate the detection and remediation.

    • Monitoring Tool: Prometheus with Alertmanager.
    • PromQL Query: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
    • Alert Rule (in Alertmanager config):
      - alert: LowMemoryOnWebServer
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
      
          summary: "Low memory on {{ $labels.instance }}"
      
      
          description: "Instance {{ $labels.instance }} has less than 10% memory available."
      
      
    • Automation Action (via webhook to a script or another automation tool): When this alert fires, trigger a script that checks the web server process’s memory usage. If it exceeds a threshold (e.g., 80% of RAM), it automatically restarts the your-web-service process.
      #!/bin/bash
      SERVICE="your-web-service"
      THRESHOLD_PERCENT=80
      
      # Get total and available memory in KB
      MEM_TOTAL=$(grep MemTotal /proc/meminfo | awk '{print $2}')
      MEM_AVAILABLE=$(grep MemAvailable /proc/meminfo | awk '{print $2}')
      MEM_USED_PERCENT=$(awk "BEGIN {printf \"%.0f\", (100 - ((${MEM_AVAILABLE} * 100) / ${MEM_TOTAL}))) }")
      
      if [ "$MEM_USED_PERCENT" -gt "$THRESHOLD_PERCENT" ]; then
        echo "High memory usage detected: ${MEM_USED_PERCENT}%. Restarting ${SERVICE}."
        systemctl restart ${SERVICE}
      else
        echo "Memory usage acceptable: ${MEM_USED_PERCENT}%."
      fi
      
    • Why it works: Prometheus monitors system metrics. Alertmanager fires an alert when available memory drops below 10% for 5 minutes. A separate script, triggered by this alert, checks the actual memory usage of the web server process and restarts it only if it’s consuming too much, preventing unnecessary restarts.
  6. The "Boring" Automation: You’ve now automated the detection and, if necessary, the remediation. This is great! But what if the original issue was that every Tuesday, you had to manually log in and check if the memory usage was high, even if it wasn’t yet critical?

    • The Toil: Manually checking a dashboard or running htop on 50 servers every Tuesday.
    • The Automation: A scheduled job that runs on a central orchestrator.
      #!/bin/bash
      # This script runs via cron every Tuesday at 1:30 PM PST
      # It checks memory usage on all web servers and logs it.
      
      WEB_SERVERS=("web01.example.com" "web02.example.com" "...") # List of your web servers
      
      for server in "${WEB_SERVERS[@]}"; do
        # SSH into the server and get memory usage percentage
        MEM_USAGE=$(ssh ${server} "free | awk '/^Mem:/ {printf \"%.0f\", \$3/\$2 * 100.0}'")
        TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")
        echo "${TIMESTAMP} - ${server} - Memory Usage: ${MEM_USAGE}%" >> /var/log/webserver_memory_checks.log
      done
      
    • Why it works: This script runs automatically every Tuesday. It logs the memory usage for every server. It doesn’t do anything proactive; it just records data. This data can then be reviewed once and if it shows a consistent pattern of increasing memory usage before it becomes critical, it proves the need for the permanent fix (step 3) or a more aggressive automated restart (step 5). It turns a manual, attention-grabbing task into a passive data collection exercise. It’s so simple, so repetitive, and so easy to verify that you’ll want to automate the review of the log file itself, leading you to the next level of automation or permanent fix.

The next concept you’ll likely grapple with is understanding the difference between detection automation and remediation automation, and when to apply each.

Want structured learning?

Take the full Sre course →