Nullpointer Nullpointer - 16 days ago 4x
Linux Question

Linux context switch internals: What happens when process exits before timer interrupt?

How is context switch made in linux kernel when process exits before timer interrupt?

I know that if the process is running and timer interrupt occurs then

function is called automatically if the flag is set, schedule function then selects next process to run. Basically in this case the schedule function runs in the context of current process but what happens when process exits even before timer interrupt? who calls
function in this case? And in what context does it run?


It's important to understand that the timer interrupt is just one of several hundred different reasons why schedule might get called. Only programs whose runtime is dominated by computation, which is rarer than you'd think, ever exhaust their time slice. It's much more common for programs to run for only a few microseconds — yes, microseconds — at a time, in between "blocking" in system calls, waiting for user input or whatever.

When a process exits in any way, ultimately a call to do_exit always happens, in the (kernel) context of that process. do_exit calls schedule as its last action, and schedule never returns to that context. Note how, at the very end of do_exit, there is a call to schedule, followed immediately by BUG(); and an infinite loop.

Just prior to this, do_exit calls exit_notify, which is responsible for sending SIGCHLD to the parent process and/or releasing it from a call to wait. So, a lot of the time, the parent process will be ready-to-run when schedule gets called, and will be selected.

do_exit also deallocates all of the user-space state and much of the kernel state associated with the process, frees memory, closes file descriptors, etc. The task_struct itself must survive until someone calls wait, and I can't figure out exactly how the kernel decides that it can now be deallocated; this code is too convoluted.

If the process called _exit, the kernel call chain is simply sys_exit_group to do_group_exit to do_exit. If it took a fatal synchronous signal (e.g. SIGSEGV), the call chain is a lot longer and has a tricky diversion in it. The hardware trap is fielded by architecture-specific code (e.g. x86 do_trap) through force_sig_info and send_signal to complete_signal, which adjusts the task state and then tells the scheduler to wake up the offending process. The offending process wakes up, and a maze of architecture-specific signal handling logic eventually delivers it to get_signal, which calls do_group_exit, which calls do_exit. Fatal asynchronous signals (e.g. from typing kill 12345 at a shell prompt) start at sys_kill and go through kill_something_info, group_send_sig_info, do_send_sig_info to send_signal, after which everything proceeds as above. In both cases, all of the steps up to complete_signal may happen in any process context, but everything after "The offending process wakes up" happens in that process's context.

The only parts of this description that are Linux-specific are the names of functions in the kernel's code. Any implementation of Unix will have kernel functions that do more or less what Linux's do_exit and schedule do, and the sequences of operations involved in fielding _exit, fatal synchronous signals, and fatal async signals will be recognizably similar.