Jadzia Jadzia - 3 years ago 182
Linux Question

Killing all processes started by a Bash script

I have a number of bash scripts which are carrying out a lot of similar tasks, and they use some external binary programs. The problem is that the binary programs are often not exiting terminating as they should. Since my scripts run them thousands of times, it happens quickly that a lot of idle/nearly dead instances of these processes are accumulating. I cannot fix these programs, therefore I need to make sure my bash scripts are terminating them.

There are some topics here in SE already which deal with this task of terminating processes of bash scripts. I have applied and tested what was written there, and to some extend it works. But it does not work well enough for my case, and I don't understand why, therefore I am opening a new question.

My scripts have a hierarchy, here shown in a simplified manner:
Script A calls script B, and script B calls multiple instances of script C in parallel to use all the CPUs. E.g. script B runs 5 instances of script C in parallel, and when one instance of script C is completed it starts a new one, altogether thousands of runs of script C. And script C calls several external binaries/commands which don't terminate nicely. They are in parallel in the background and communicate with each other.

However, my script C is able to detect when the external commands are done with their work, even if they have not terminated, and then my bash script exits.

In order to terminate all the external programs during completion of the bash script, I have added an exit trap:

# Exit cleanup
cleanup_exit() {
# Running the termination in an own process group to prevent it from preliminary termination. Since it will run in the background it will not cause any delays
setsid nohup bash -c "
# Trapping signals to prevent that this function is terminated preliminary
trap '' SIGINT SIGQUIT SIGTERM SIGHUP ERR

# Terminating the main processes
kill ${pids[@]} 1>/dev/null 2>&1 || true
sleep 5
kill -9 ${pids[@]} 1>/dev/null 2>&1 || true

# Terminating the child processes of the main processes
pkill -P ${pids[@]} 1>/dev/null 2>&1 || true
sleep 1
pkill -9 -P ${pids[@]} 1>/dev/null 2>&1 || true

# Terminating everything else which is still running and which was started by this script
pkill -P $$ || true
sleep 1
pkill -9 -P $$ || true
"
}
trap "cleanup_exit" SIGINT SIGQUIT SIGTERM EXIT


Now this seems to work if I run only very few instances of script C in parallel. If I increase the number to more, e.g. 10 (the workstation is powerful and should be able to handle dozens of parallel instances of script C and the external programs in parallel), then it does not work anymore, and hundreds of instances of the external programs are accumulating quickly.

But I don't understand why. For instance the PID of one of those processes which accumulated was 32048. And in the logs I can see the execution of the exit trap:

+ echo ' * Snapshot 190 completed after 3 seconds.'
* Snapshot 190 completed after 3 seconds.
+ break
+ cleanup_exit
+ echo

+ echo ' * Cleaning up...'
* Cleaning up...
+ setsid nohup bash -c '

# Trapping signals to prevent that this function is terminated preliminary
trap '\'''\'' SIGINT SIGQUIT SIGTERM SIGHUP ERR

# Terminating the main processes
kill 31678' '32048 1>/dev/null 2>&1 || true
sleep 5
kill -9 31678' '32048 1>/dev/null 2>&1 || true

# Terminating the child processes of the main processes
pkill -P 31678' '32048 1>/dev/null 2>&1 || true
sleep 1
pkill -9 -P 31678' '32048 1>/dev/null 2>&1 || true

# Terminating everything else which is still running and which was started by this script
pkill -P 31623 || true
sleep 1
pkill -9 -P 31623 || true
'


Clearly, the PID of this process was used in the exit trap, just the process did not quit. Just for testing I run the kill command manually again on this process, and then it quit indeed.

I have two questions:


  1. Why does my mechanism not work well?

  2. Which mechanism do you suggest?



Update: I just found out that even if I run just one instance of script C in parallel, i.e. sequential, it works well only for some time. Suddenly at some time point processes are not getting terminated anymore, but are starting to hang around forever and accumulating. The machine should not be overloaded with one process in parallel. And in my log files the exit trap is still called properly as before, no difference there. Memory is free as well, CPUs are also partially free.

Answer Source

A good sanity check for any shell script is to run ShellCheck on it:

Line 9:
        kill ${pids[@]} 1>/dev/null 2>&1 || true
             ^-- SC2145: Argument mixes string and array. Use * or separate argument.

And indeed, your xtrace does something strange on this line:

kill 31678' '32048 1>/dev/null 2>&1 || true
          ^^^--- What is this?

The problem here is that your ${pids[@]} expands into multiple words, and bash -c only interprets the first word. Here's a simplified example:

pids=(2 3 4)
bash -c "echo killing ${pids[@]}" 

This ends up writing killing 2 with no mention of 3 or 4. It's equivalent to running

bash -c "echo killing 2" "3" "4" 

where the other pids just become the positional parameters $0 and $1 instead of being part of the executed command.

Instead, like ShellCheck suggested, you wanted * to concatenate all the pids with spaces and insert them as a single argument:

pids=(2 3 4)
bash -c "echo killing ${pids[*]}" 

which prints killing 2 3 4.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download