The most surprising thing about systemd service restart policies is that on-failure doesn’t actually mean "restart if the service exits with a non-zero status."
Let’s see what that looks like in practice. Imagine we have a simple service that just exits:
# /etc/systemd/system/my-flaky-app.service
[Unit]
Description=A flaky application
[Service]
ExecStart=/usr/bin/my-flaky-app
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
And the app itself:
# /usr/bin/my-flaky-app
import sys
import time
print("My flaky app is starting...")
time.sleep(2)
print("My flaky app is exiting with status 1.")
sys.exit(1)
We enable and start it:
sudo systemctl enable my-flaky-app.service
sudo systemctl start my-flaky-app.service
After a few seconds, we check its status:
sudo systemctl status my-flaky-app.service
You’ll see output like this:
● my-flaky-app.service - A flaky application
Loaded: loaded (/etc/systemd/system/my-flaky-app.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) since Mon 2023-10-27 10:00:00 UTC; 4s ago
Trigger: ● my-flaky-app.service
Docs: man:systemd.service(5)
Main PID: 12345 (code=exited, status=1/FAILURE)
CPU: 10ms
Oct 27 10:00:00 hostname my-flaky-app.service[12345]: My flaky app is starting...
Oct 27 10:00:02 hostname my-flaky-app.service[12345]: My flaky app is exiting with status 1.
Oct 27 10:00:02 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=1/FAILURE
Oct 27 10:00:02 hostname systemd[1]: my-flaky-app.service: Triggering OnFailure= dependencies.
Oct 27 10:00:02 hostname systemd[1]: my-flaky-app.service: Scheduled restart job, restart counter is at 1.
Oct 27 10:00:02 hostname systemd[1]: Stopped A flaky application.
Oct 27 10:00:02 hostname systemd[1]: Starting A flaky application...
Oct 27 10:00:02 hostname my-flaky-app.service[12346]: My flaky app is starting...
Oct 27 10:00:04 hostname my-flaky-app.service[12346]: My flaky app is exiting with status 1.
Oct 27 10:00:04 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=1/FAILURE
The service is indeed restarting. Now, let’s change the app to exit with status 0:
# /usr/bin/my-flaky-app
import sys
import time
print("My flaky app is starting...")
time.sleep(2)
print("My flaky app is exiting with status 0.")
sys.exit(0)
We restart it:
sudo systemctl restart my-flaky-app.service
And check the status again:
sudo systemctl status my-flaky-app.service
You’ll see:
● my-flaky-app.service - A flaky application
Loaded: loaded (/etc/systemd/system/my-flaky-app.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2023-10-27 10:05:00 UTC; 1s ago
Trigger: ● my-flaky-app.service
Docs: man:systemd.service(5)
Main PID: 12350 (code=exited, status=0/SUCCESS)
Oct 27 10:05:00 hostname my-flaky-app.service[12350]: My flaky app is starting...
Oct 27 10:05:02 hostname my-flaky-app.service[12350]: My flaky app is exiting with status 0.
Oct 27 10:05:02 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=0/SUCCESS
Oct 27 10:05:02 hostname systemd[1]: my-flaky-app.service: Succeeded.
Oct 27 10:05:02 hostname systemd[1]: my-flaky-app.service: Consumed 10ms CPU time.
It’s inactive (dead). It didn’t restart. This is because on-failure doesn’t just look at the exit code. It’s more nuanced. systemd considers a service to have "failed" if it terminates unexpectedly. This includes:
- Non-zero exit codes: This is the most common case and what we expect.
- Crashes (Signals): If the process is terminated by a signal (e.g., SIGSEGV, SIGKILL),
systemdtreats it as a failure. You can see this insystemctl statusascode=killed, signal=XX. - Timeouts: If the service exceeds
WatchdogSec(if configured) or ifsystemditself times out waiting for it to start or stop (though this is less common forRestart=). - "Clean" exit codes (0): If the service exits with status 0,
systemdconsiders it a successful and intentional termination, unless you tell it otherwise.
This is where Restart= and RestartForce= come into play.
Restart=no: The default. Never restart.Restart=on-success: Restart only if the service exits cleanly (status 0). This is useful for services that are meant to run once and then exit, like batch jobs, but you want to ensure they completed successfully before moving on.Restart=on-failure: Restart if the service exits with a non-zero status or is terminated by a signal. This is the most common policy for daemons.Restart=on-abnormal: Restart if the service is terminated by a signal or times out. This is a bit less common.Restart=on-watchdog: Restart only if the watchdog timeout expires.Restart=on-abort: Restart if the service exits uncleanly (non-zero status or signal). This is equivalent toon-failure.Restart=always: Always restart, regardless of exit status or signal. This is the most aggressive policy and is useful for services that must be running and you don’t care how they exit, as long as they come back up.
The key to understanding on-failure is that it doesn’t trigger on a successful exit (status 0). If you want a service to restart even if it exits with status 0, you need Restart=always.
Let’s go back to our flaky app and change Restart to always:
# /etc/systemd/system/my-flaky-app.service
[Unit]
Description=A flaky application
[Service]
ExecStart=/usr/bin/my-flaky-app
Restart=always # Changed from on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Now, with the Python script exiting with sys.exit(0), we enable and start:
sudo systemctl daemon-reload
sudo systemctl enable my-flaky-app.service
sudo systemctl start my-flaky-app.service
Checking the status:
sudo systemctl status my-flaky-app.service
Output:
● my-flaky-app.service - A flaky application
Loaded: loaded (/etc/systemd/system/my-flaky-app.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) since Mon 2023-10-27 10:10:00 UTC; 4s ago
Trigger: ● my-flaky-app.service
Docs: man:systemd.service(5)
Main PID: 12360 (code=exited, status=0/SUCCESS)
Oct 27 10:10:00 hostname my-flaky-app.service[12360]: My flaky app is starting...
Oct 27 10:10:02 hostname my-flaky-app.service[12360]: My flaky app is exiting with status 0.
Oct 27 10:10:02 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=0/SUCCESS
Oct 27 10:10:02 hostname systemd[1]: my-flaky-app.service: Succeeded.
Oct 27 10:10:02 hostname systemd[1]: my-flaky-app.service: Scheduled restart job, restart counter is at 1.
Oct 27 10:10:02 hostname systemd[1]: Stopped A flaky application.
Oct 27 10:10:02 hostname systemd[1]: Starting A flaky application...
Oct 27 10:10:02 hostname my-flaky-app.service[12361]: My flaky app is starting...
Oct 27 10:10:04 hostname my-flaky-app.service[12361]: My flaky app is exiting with status 0.
Oct 27 10:10:04 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=0/SUCCESS
Now it restarts. This behavior is controlled by the Restart= directive in the [Service] section of your .service unit file. The RestartSec= directive specifies the delay in seconds before attempting a restart.
The systemd documentation on systemd.service(5) is the definitive source, but understanding these nuances is critical for reliable service management.
The next thing you’ll likely run into is how systemd handles service dependencies and ordering when restarts are involved.