[PATCH] signal handling race condition causing reboot hangs
From: Ernie Petrides <petrides@redhat.com> (I can't get anyone to review this, but I'm sure there's a bug in there, and Ernie's patch has been in -mm for some time). There is a long-standing locking hole in the kernel's handling of the signals related to stopping and resuming processes. When a process handles SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU, the "sighand" lock is held while the signal is dequeued and appropriate masks are updated. But the "sighand" lock is dropped in several cases before the task's state is changed to TASK_STOPPED (or before a group-stop is initiated). If a process running on another cpu posts a SIGCONT or SIGKILL just after the "victim" process releases the lock but before its state is set to TASK_STOPPED, the corresponding wakeup will be lost and the victim will remain stopped despite the successive SIGCONT or SIGKILL. In this case, a repeated posting of SIGCONT or SIGKILL will have no effect, since the original one is already pending (and so causes a repeated posting to be discarded). The occurrence of a SIGSTOP/SIGKILL race where the victim has blocked all other signals will result in an unkillable process. Although a fabricated test program can reproduce a SIGSTOP/SIGCONT race hang in less than a minute (on a 2-cpu Dell Precision 450), the scenario that has been most frequently encountered is a hang during reboot or shutdown. This occurs because /sbin/killall5 brackets the scanning of /proc/* and associated signal posting to (most) of the processes still running with kill(-1, SIGSTOP) and kill(-1, SIGCONT) calls to temporarily freeze every process except for "init". Occasionally, its parent (running the /etc/rc6.d/S01reboot shell script) gets stuck in TASK_STOPPED state with pending SIGCONT and SIGCLD signals, but with no other process left to wake it up. In order to fix the race condition, the locking in do_signal_stop() and get_signal_to_deliver() needed reworking to close the hole. Due to lock ordering issues between the "sighand" lock and tasklist_lock, there are two cases where the former lock needs to be released and then reacquired, thus allowing a tiny hole for a SIGCONT/SIGKILL to be posted. These two cases are resolved by rechecking for a pending SIGCONT/SIGKILL after the locks are (re)acquired in the proper order. Anyone wanting a copy of the test program may e-mail me off-list.
Showing
Please register or sign in to comment