The most surprising truth about SRE toil reduction is that the goal isn’t to eliminate toil, but to make it so boring and repetitive that it’s impossible to ignore.
Imagine you’re on call, and every few hours, you get an alert: "Service X is slow." You log in, check the database connection pool, see it’s maxed out, restart the pool, and the alerts stop. This happens again an hour later. And again. It’s maddening, right? This is toil.
Let’s say you’re managing a fleet of web servers. You notice that every Tuesday, around 2 PM PST, you have to manually restart the web server process on about 10% of your machines because of a memory leak that only manifests under sustained load. This is toil.
Here’s how you might tackle that specific Tuesday restart problem:
-
Identify the exact symptoms and scope: You know it’s a memory leak, affecting about 10% of servers, every Tuesday at 2 PM PST. You’ve confirmed it’s the web server process by checking
topandhtopfor high memory usage and thensystemctl status your-web-serviceto see the process. -
Gather diagnostic data: Before automating, you’d want to collect more data.
- Command:
journalctl -u your-web-service -S "last Tuesday 13:00" -U "last Tuesday 15:00" --output=json | jq '. + {host: ._SYSTEMD_UNIT}' > /tmp/webserver_logs_tuesday.json - Purpose: This captures logs specifically from the web service for the hour before and after your typical intervention time last Tuesday, tagging each log entry with the hostname. Analyzing this JSON output with tools like
jqor a log aggregation platform can pinpoint the exact log messages preceding the memory growth.
- Command:
-
Develop a targeted fix: You discover the leak is caused by a specific caching module that doesn’t evict old entries under high load. The fix is to add a configuration parameter to the web server.
- Configuration File:
/etc/your-web-service/config.yaml - Change: Add the line
cache.max_entries: 10000to theconfig.yamlfile. - Why it works: This explicitly limits the cache size, preventing it from growing indefinitely and thus stopping the memory leak.
- Configuration File:
-
Automate the deployment of the fix: You’ll use a configuration management tool like Ansible.
- Ansible Playbook Snippet:
--- - name: Apply web service memory leak fix hosts: webservers become: yes tasks: - name: Ensure cache limit is set ansible.builtin.lineinfile: path: /etc/your-web-service/config.yaml regexp: '^cache\.max_entries:' line: 'cache.max_entries: 10000' state: present notify: restart web service handlers: - name: restart web service ansible.builtin.systemd: name: your-web-service state: restarted - Why it works: This playbook ensures the
cache.max_entriesline exists and is set to10000in the configuration file on all hosts in thewebserversgroup. If the line is added or changed, it triggers a handler to restart theyour-web-servicesystemd unit.
- Ansible Playbook Snippet:
-
Automate the monitoring for the need for the fix (if the fix wasn’t permanent): If the fix wasn’t a permanent code change but rather a temporary workaround (e.g., restarting a process), you’d automate the detection and remediation.
- Monitoring Tool: Prometheus with Alertmanager.
- PromQL Query:
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10 - Alert Rule (in Alertmanager config):
- alert: LowMemoryOnWebServer expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10 for: 5m labels: severity: warning annotations: summary: "Low memory on {{ $labels.instance }}" description: "Instance {{ $labels.instance }} has less than 10% memory available." - Automation Action (via webhook to a script or another automation tool): When this alert fires, trigger a script that checks the web server process’s memory usage. If it exceeds a threshold (e.g., 80% of RAM), it automatically restarts the
your-web-serviceprocess.#!/bin/bash SERVICE="your-web-service" THRESHOLD_PERCENT=80 # Get total and available memory in KB MEM_TOTAL=$(grep MemTotal /proc/meminfo | awk '{print $2}') MEM_AVAILABLE=$(grep MemAvailable /proc/meminfo | awk '{print $2}') MEM_USED_PERCENT=$(awk "BEGIN {printf \"%.0f\", (100 - ((${MEM_AVAILABLE} * 100) / ${MEM_TOTAL}))) }") if [ "$MEM_USED_PERCENT" -gt "$THRESHOLD_PERCENT" ]; then echo "High memory usage detected: ${MEM_USED_PERCENT}%. Restarting ${SERVICE}." systemctl restart ${SERVICE} else echo "Memory usage acceptable: ${MEM_USED_PERCENT}%." fi - Why it works: Prometheus monitors system metrics. Alertmanager fires an alert when available memory drops below 10% for 5 minutes. A separate script, triggered by this alert, checks the actual memory usage of the web server process and restarts it only if it’s consuming too much, preventing unnecessary restarts.
-
The "Boring" Automation: You’ve now automated the detection and, if necessary, the remediation. This is great! But what if the original issue was that every Tuesday, you had to manually log in and check if the memory usage was high, even if it wasn’t yet critical?
- The Toil: Manually checking a dashboard or running
htopon 50 servers every Tuesday. - The Automation: A scheduled job that runs on a central orchestrator.
#!/bin/bash # This script runs via cron every Tuesday at 1:30 PM PST # It checks memory usage on all web servers and logs it. WEB_SERVERS=("web01.example.com" "web02.example.com" "...") # List of your web servers for server in "${WEB_SERVERS[@]}"; do # SSH into the server and get memory usage percentage MEM_USAGE=$(ssh ${server} "free | awk '/^Mem:/ {printf \"%.0f\", \$3/\$2 * 100.0}'") TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S") echo "${TIMESTAMP} - ${server} - Memory Usage: ${MEM_USAGE}%" >> /var/log/webserver_memory_checks.log done - Why it works: This script runs automatically every Tuesday. It logs the memory usage for every server. It doesn’t do anything proactive; it just records data. This data can then be reviewed once and if it shows a consistent pattern of increasing memory usage before it becomes critical, it proves the need for the permanent fix (step 3) or a more aggressive automated restart (step 5). It turns a manual, attention-grabbing task into a passive data collection exercise. It’s so simple, so repetitive, and so easy to verify that you’ll want to automate the review of the log file itself, leading you to the next level of automation or permanent fix.
- The Toil: Manually checking a dashboard or running
The next concept you’ll likely grapple with is understanding the difference between detection automation and remediation automation, and when to apply each.