I have a number of bash scripts which are carrying out a lot of similar tasks, and they use some external binary programs. The problem is that the binary programs are often not exiting terminating as they should. Since my scripts run them thousands of times, it happens quickly that a lot of idle/nearly dead instances of these processes are accumulating. I cannot fix these programs, therefore I need to make sure my bash scripts are terminating them.
There are some topics here in SE already which deal with this task of terminating processes of bash scripts. I have applied and tested what was written there, and to some extend it works. But it does not work well enough for my case, and I don't understand why, therefore I am opening a new question.
My scripts have a hierarchy, here shown in a simplified manner:
Script A calls script B, and script B calls multiple instances of script C in parallel to use all the CPUs. E.g. script B runs 5 instances of script C in parallel, and when one instance of script C is completed it starts a new one, altogether thousands of runs of script C. And script C calls several external binaries/commands which don't terminate nicely. They are in parallel in the background and communicate with each other.
However, my script C is able to detect when the external commands are done with their work, even if they have not terminated, and then my bash script exits.
In order to terminate all the external programs during completion of the bash script, I have added an exit trap:
# Exit cleanup
cleanup_exit() {
# Running the termination in an own process group to prevent it from preliminary termination. Since it will run in the background it will not cause any delays
setsid nohup bash -c "
# Trapping signals to prevent that this function is terminated preliminary
trap '' SIGINT SIGQUIT SIGTERM SIGHUP ERR
# Terminating the main processes
kill ${pids[@]} 1>/dev/null 2>&1 || true
sleep 5
kill -9 ${pids[@]} 1>/dev/null 2>&1 || true
# Terminating the child processes of the main processes
pkill -P ${pids[@]} 1>/dev/null 2>&1 || true
sleep 1
pkill -9 -P ${pids[@]} 1>/dev/null 2>&1 || true
# Terminating everything else which is still running and which was started by this script
pkill -P $$ || true
sleep 1
pkill -9 -P $$ || true
"
}
trap "cleanup_exit" SIGINT SIGQUIT SIGTERM EXIT
+ echo ' * Snapshot 190 completed after 3 seconds.'
* Snapshot 190 completed after 3 seconds.
+ break
+ cleanup_exit
+ echo
+ echo ' * Cleaning up...'
* Cleaning up...
+ setsid nohup bash -c '
# Trapping signals to prevent that this function is terminated preliminary
trap '\'''\'' SIGINT SIGQUIT SIGTERM SIGHUP ERR
# Terminating the main processes
kill 31678' '32048 1>/dev/null 2>&1 || true
sleep 5
kill -9 31678' '32048 1>/dev/null 2>&1 || true
# Terminating the child processes of the main processes
pkill -P 31678' '32048 1>/dev/null 2>&1 || true
sleep 1
pkill -9 -P 31678' '32048 1>/dev/null 2>&1 || true
# Terminating everything else which is still running and which was started by this script
pkill -P 31623 || true
sleep 1
pkill -9 -P 31623 || true
'
A good sanity check for any shell script is to run ShellCheck on it:
Line 9:
kill ${pids[@]} 1>/dev/null 2>&1 || true
^-- SC2145: Argument mixes string and array. Use * or separate argument.
And indeed, your xtrace does something strange on this line:
kill 31678' '32048 1>/dev/null 2>&1 || true
^^^--- What is this?
The problem here is that your ${pids[@]}
expands into multiple words, and bash -c
only interprets the first word. Here's a simplified example:
pids=(2 3 4)
bash -c "echo killing ${pids[@]}"
This ends up writing killing 2
with no mention of 3 or 4. It's equivalent to running
bash -c "echo killing 2" "3" "4"
where the other pids just become the positional parameters $0
and $1
instead of being part of the executed command.
Instead, like ShellCheck suggested, you wanted *
to concatenate all the pids with spaces and insert them as a single argument:
pids=(2 3 4)
bash -c "echo killing ${pids[*]}"
which prints killing 2 3 4
.