The most surprising thing about systemd’s watchdog is that it’s not about detecting crashes, but about forcing a response from services that are merely unresponsive.
Let’s see it in action. Imagine we have a simple service, my-app.service, that sometimes gets stuck.
[Unit]
Description=My Stuck Application
[Service]
ExecStart=/usr/local/bin/my-app
Restart=always
RestartSec=5
WatchdogSec=10 # This is the key!
[Install]
WantedBy=multi-user.target
And the my-app script itself needs to signal its health:
import time
import os
# Assume we have a dummy file to indicate if we're "stuck"
STUCK_FILE = "/tmp/my-app-stuck"
def main():
pid = os.getpid()
print(f"My App started with PID {pid}")
# Simulate a long-running task that might freeze
for i in range(60):
print(f"Tick {i}...")
if os.path.exists(STUCK_FILE):
print("Detected STUCK file. Not signaling watchdog.")
# This is where the app *fails* to signal
time.sleep(30) # Simulate being stuck
else:
# Signal the watchdog that we're alive and well
with open(f"/proc/$$/fd/9", "w") as f: # File descriptor 9 is the watchdog
f.write("1")
time.sleep(1)
print("My App finished normally.")
if __name__ == "__main__":
main()
To enable this, we’d create the service file:
sudo nano /etc/systemd/system/my-app.service
And the script:
sudo nano /usr/local/bin/my-app
sudo chmod +x /usr/local/bin/my-app
Now, we enable and start it:
sudo systemctl daemon-reload
sudo systemctl enable my-app
sudo systemctl start my-app
If my-app runs without the /tmp/my-app-stuck file, it will continuously write to its watchdog file descriptor (FD 9), keeping systemd happy. You can see this by checking systemctl status my-app. You’ll see a line like Watchdog: watchdog indicating it’s active.
Now, let’s simulate the app getting stuck. Create the "stuck" file:
touch /tmp/my-app-stuck
If you check systemctl status my-app again, you’ll see the Watchdog timeout message. After 10 seconds (the WatchdogSec=10 value), systemd will decide my-app is no longer responsive. Because Restart=always is set, systemd will then kill the my-app process and restart it after a 5-second delay (RestartSec=5).
The mental model here is that systemd is not a passive observer. When a service declares WatchdogSec, it’s essentially asking systemd to actively monitor it. The service must periodically write to a specific file descriptor that systemd makes available. If this "heartbeat" stops for longer than WatchdogSec, systemd assumes the service is hung, even if the process itself hasn’t technically crashed or exited. It then invokes the Restart= directive.
The WatchdogSec value should be set to a duration that is comfortably longer than the longest expected normal operation between heartbeats, but short enough to detect a hang in a timely manner. For instance, if your application performs a task that typically takes 5 seconds and then signals, setting WatchdogSec=10 is reasonable. If it normally signals every 30 seconds, WatchdogSec=60 might be appropriate.
The core mechanism involves systemd opening file descriptor 9 for the service process. The service process then writes to this descriptor. If the descriptor is written to, systemd resets its internal timer for that service. If the timer expires, the watchdog condition is met.
The one thing most people don’t realize is that the service must actively participate in the watchdog process. Simply setting WatchdogSec= in the service unit file is not enough; the application itself needs to be programmed to write to the watchdog file descriptor. If the application doesn’t support this, you can’t use systemd’s watchdog feature for it.
The next step is often configuring the kernel’s own hardware watchdog timer to achieve a similar, but system-wide, level of resilience.