-
Roland McGrath authored
This is a fix made almost a month ago, during the flurry of signal changes. I didn't realize until today that this hadn't made it into 2.5. Sorry about the delay. This fix is necessary to avoid sometimes wedging in uninterruptible sleep when doing a multithreaded core dump triggered by a process signal (kill) rather than a trap. You can reproduce the problem by running your favorite multithreaded program (NPTL) and then using "kill -SEGV" on it. It will often wedge. The actual fix could be just a two line diff: + if (current->signal->group_exit) + goto dequeue; after the group_exit_task check. That is the fix that has been used in Ingo's backport for weeks and tested heavily (well, as heavily as core dumping ever gets tested, but it's been in our production systems). But I broke the hair out into a separate function. The patch below has the same effect as the two-liner, and no other difference. I have tested 2.5.64 with this patch and it works for me, though I haven't beat on it. The way the wedge happens is that for a core-dump signal group_send_sig_info does a group stop of other threads before the one thread handles the fatal signal. If the fatal thread gets into do_coredump and coredump_wait first, then other threads see the group stop and suspend with SIGKILL pending. All other fatal cases clear group_stop_count, so this is the only way this ever happens. Checking group_exit fixes it. I didn't make do_coredump clear group_stop_count because doing it with the appropriate ordering and locking doesn't fit the organization that code.
874f2e47