The most surprising thing about systemd service restart policies is that on-failure doesn’t actually mean "restart if the service exits with a non-zero status."

Let’s see what that looks like in practice. Imagine we have a simple service that just exits:

# /etc/systemd/system/my-flaky-app.service
[Unit]
Description=A flaky application

[Service]
ExecStart=/usr/bin/my-flaky-app
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

And the app itself:

# /usr/bin/my-flaky-app
import sys
import time

print("My flaky app is starting...")
time.sleep(2)
print("My flaky app is exiting with status 1.")
sys.exit(1)

We enable and start it:

sudo systemctl enable my-flaky-app.service
sudo systemctl start my-flaky-app.service

After a few seconds, we check its status:

sudo systemctl status my-flaky-app.service

You’ll see output like this:

● my-flaky-app.service - A flaky application
     Loaded: loaded (/etc/systemd/system/my-flaky-app.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) since Mon 2023-10-27 10:00:00 UTC; 4s ago
    Trigger: ● my-flaky-app.service
       Docs: man:systemd.service(5)
   Main PID: 12345 (code=exited, status=1/FAILURE)
        CPU: 10ms

Oct 27 10:00:00 hostname my-flaky-app.service[12345]: My flaky app is starting...
Oct 27 10:00:02 hostname my-flaky-app.service[12345]: My flaky app is exiting with status 1.
Oct 27 10:00:02 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=1/FAILURE
Oct 27 10:00:02 hostname systemd[1]: my-flaky-app.service: Triggering OnFailure= dependencies.
Oct 27 10:00:02 hostname systemd[1]: my-flaky-app.service: Scheduled restart job, restart counter is at 1.
Oct 27 10:00:02 hostname systemd[1]: Stopped A flaky application.
Oct 27 10:00:02 hostname systemd[1]: Starting A flaky application...
Oct 27 10:00:02 hostname my-flaky-app.service[12346]: My flaky app is starting...
Oct 27 10:00:04 hostname my-flaky-app.service[12346]: My flaky app is exiting with status 1.
Oct 27 10:00:04 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=1/FAILURE

The service is indeed restarting. Now, let’s change the app to exit with status 0:

# /usr/bin/my-flaky-app
import sys
import time

print("My flaky app is starting...")
time.sleep(2)
print("My flaky app is exiting with status 0.")
sys.exit(0)

We restart it:

sudo systemctl restart my-flaky-app.service

And check the status again:

sudo systemctl status my-flaky-app.service

You’ll see:

● my-flaky-app.service - A flaky application
     Loaded: loaded (/etc/systemd/system/my-flaky-app.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Mon 2023-10-27 10:05:00 UTC; 1s ago
    Trigger: ● my-flaky-app.service
       Docs: man:systemd.service(5)
   Main PID: 12350 (code=exited, status=0/SUCCESS)

Oct 27 10:05:00 hostname my-flaky-app.service[12350]: My flaky app is starting...
Oct 27 10:05:02 hostname my-flaky-app.service[12350]: My flaky app is exiting with status 0.
Oct 27 10:05:02 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=0/SUCCESS
Oct 27 10:05:02 hostname systemd[1]: my-flaky-app.service: Succeeded.
Oct 27 10:05:02 hostname systemd[1]: my-flaky-app.service: Consumed 10ms CPU time.

It’s inactive (dead). It didn’t restart. This is because on-failure doesn’t just look at the exit code. It’s more nuanced. systemd considers a service to have "failed" if it terminates unexpectedly. This includes:

  1. Non-zero exit codes: This is the most common case and what we expect.
  2. Crashes (Signals): If the process is terminated by a signal (e.g., SIGSEGV, SIGKILL), systemd treats it as a failure. You can see this in systemctl status as code=killed, signal=XX.
  3. Timeouts: If the service exceeds WatchdogSec (if configured) or if systemd itself times out waiting for it to start or stop (though this is less common for Restart=).
  4. "Clean" exit codes (0): If the service exits with status 0, systemd considers it a successful and intentional termination, unless you tell it otherwise.

This is where Restart= and RestartForce= come into play.

  • Restart=no: The default. Never restart.
  • Restart=on-success: Restart only if the service exits cleanly (status 0). This is useful for services that are meant to run once and then exit, like batch jobs, but you want to ensure they completed successfully before moving on.
  • Restart=on-failure: Restart if the service exits with a non-zero status or is terminated by a signal. This is the most common policy for daemons.
  • Restart=on-abnormal: Restart if the service is terminated by a signal or times out. This is a bit less common.
  • Restart=on-watchdog: Restart only if the watchdog timeout expires.
  • Restart=on-abort: Restart if the service exits uncleanly (non-zero status or signal). This is equivalent to on-failure.
  • Restart=always: Always restart, regardless of exit status or signal. This is the most aggressive policy and is useful for services that must be running and you don’t care how they exit, as long as they come back up.

The key to understanding on-failure is that it doesn’t trigger on a successful exit (status 0). If you want a service to restart even if it exits with status 0, you need Restart=always.

Let’s go back to our flaky app and change Restart to always:

# /etc/systemd/system/my-flaky-app.service
[Unit]
Description=A flaky application

[Service]
ExecStart=/usr/bin/my-flaky-app
Restart=always  # Changed from on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Now, with the Python script exiting with sys.exit(0), we enable and start:

sudo systemctl daemon-reload
sudo systemctl enable my-flaky-app.service
sudo systemctl start my-flaky-app.service

Checking the status:

sudo systemctl status my-flaky-app.service

Output:

● my-flaky-app.service - A flaky application
     Loaded: loaded (/etc/systemd/system/my-flaky-app.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) since Mon 2023-10-27 10:10:00 UTC; 4s ago
    Trigger: ● my-flaky-app.service
       Docs: man:systemd.service(5)
   Main PID: 12360 (code=exited, status=0/SUCCESS)

Oct 27 10:10:00 hostname my-flaky-app.service[12360]: My flaky app is starting...
Oct 27 10:10:02 hostname my-flaky-app.service[12360]: My flaky app is exiting with status 0.
Oct 27 10:10:02 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=0/SUCCESS
Oct 27 10:10:02 hostname systemd[1]: my-flaky-app.service: Succeeded.
Oct 27 10:10:02 hostname systemd[1]: my-flaky-app.service: Scheduled restart job, restart counter is at 1.
Oct 27 10:10:02 hostname systemd[1]: Stopped A flaky application.
Oct 27 10:10:02 hostname systemd[1]: Starting A flaky application...
Oct 27 10:10:02 hostname my-flaky-app.service[12361]: My flaky app is starting...
Oct 27 10:10:04 hostname my-flaky-app.service[12361]: My flaky app is exiting with status 0.
Oct 27 10:10:04 hostname systemd[1]: my-flaky-app.service: Main process exited, code=exited, status=0/SUCCESS

Now it restarts. This behavior is controlled by the Restart= directive in the [Service] section of your .service unit file. The RestartSec= directive specifies the delay in seconds before attempting a restart.

The systemd documentation on systemd.service(5) is the definitive source, but understanding these nuances is critical for reliable service management.

The next thing you’ll likely run into is how systemd handles service dependencies and ordering when restarts are involved.

Want structured learning?

Take the full Systemd course →