I use systemd user timers as a cron replacement. I have a particular program set to execute every 20 minutes. The program is not a daemon, is network-dependent, and launches a number of child processes. I've noticed however that the timer frequently stalls after a few hours (or days). The timer is still active, yet the program is no longer executed every 20 minutes.
systemctl status --user PROGRAM.service
Feb 13 15:03:45 HOSTNAME systemd: Job PROGRAM.service/start timed out.
Feb 13 15:03:45 HOSTNAME systemd: Timed out starting DESCRIPTION.
Feb 13 15:03:45 HOSTNAME systemd: Job PROGRAM.service/start failed with result 'timeout'.
ExecStart=/usr/bin/timeout 20m /path/to/program
Description=Run PROGRAM.service every 20 minutes
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID -ELFUTILS +KMOD -IDN
There are two important things in
systemd which I think you are hitting in this case:
When you start a process with
systemd, all the child processes (at least by default) are part of the same group.
If any one of those children does not die, it is considered that the process is still (at least somewhat) running.
What does that mean?
The timer description says:
Note that in case the unit to activate is already active at the time the timer elapses it is not restarted, but simply left running.
In other words, if any one of your processes is still running 20 minutes later, the timer system will not restart anything.
Why does this make sense?!
CRON was doing exactly the same thing. If you process was still running, it would not restart it over and over again (because that would just fill up memory and possibly break many other things.) However, CRON had no concept of process group. So if your main process did die, it assumed that it could restart it.
What is the systemd solution?
Assuming you cannot just stop the child processes (although since you used the
/usr/bin/timedout, you probably can?), one way it to use the
KillMode option, although I do not recommend it:
This means once the main process died, it is considered that the service stopped.
If set to process, only the main process itself is killed.
You may want to test whether that really works, since according to the documentation it does not say it will consider the whole group as dead... But from my experience, that works.
What is a better solution then?
Since I do not recommend the
KillMode, there should be another solution. The fact is that all your processes either have 20 minutes to run (or whatever amount of time remains at the time they are spawned) or they will prevent the following run to happen, which may be okay once in a while, but certainly not if they stay around forever. So it would be to edit those processes and make sure they quit after a while.
However, after a long while, it may be necessary to kill those processes and using the timeout tool as you've done could be the best solution if the processes themselves cannot just quit on time. Although I would suggest one small modification, which is to use 19 min. for the timeout, because otherwise you may miss the next startup window.
ExecStart=/usr/bin/timeout 19m /path/to/program