[PATCH] synchronize use of mm->core_waiters
From: Roland McGrath <roland@redhat.com> I believe I have identified a failure mode that Linus saw a couple weeks back when tracking down some other fork/exit sorts of races. We saw this come up on rare occasions with the RHEL3 kernel's backport of the new code (while trying to track down other race failure modes we have yet to fix, sigh). I am talking about the following scenario: > Btw, even with the fix, doing a "while : ; ./crash t 10 ; done" will > eventually result in a stuck process: > > 1415 tty1 D 0:00 ./crash > > This is some kind of deadlock: most of the fifty threads are in "D" > state, with a trace something like > > [<c011fbe3>] schedule+0x360/0x7f8 > [<c0120539>] wait_for_completion+0xd4/0x1c3 > [<c0128c9e>] do_exit+0x627/0x6a4 > [<c0128ddd>] do_group_exit+0x3d/0x177 > [<c0130c13>] dequeue_signal+0x2d/0x84 > [<c0133911>] get_signal_to_deliver+0x390/0x575 > [<c010a541>] do_signal+0x6c/0xf1 > [<c01200be>] default_wake_function+0x0/0x12 > [<c01200be>] default_wake_function+0x0/0x12 > [<c013d50f>] do_futex+0x6d/0x7d > [<c013d635>] sys_futex+0x116/0x12f > [<c010a601>] do_notify_resume+0x3b/0x3d > [<c010a82e>] work_notifysig+0x13/0x15 > > except for one that is trying to core-dump: > > [<c0120539>] wait_for_completion+0xd4/0x1c3 > [<c01200be>] default_wake_function+0x0/0x12 > [<c01200be>] default_wake_function+0x0/0x12 > [<c02101aa>] rwsem_wake+0x86/0x12d > [<c01738af>] coredump_wait+0xa8/0xaa > [<c0173a26>] do_coredump+0x175/0x26c > > and three that are just doing a regular "exit()" system call: > > [<c011fbe3>] schedule+0x360/0x7f8 > [<c011e19a>] recalc_task_prio+0x90/0x1aa > [<c0120539>] wait_for_completion+0xd4/0x1c3 > [<c01200be>] default_wake_function+0x0/0x12 > [<c01200be>] default_wake_function+0x0/0x12 > [<c0210207>] rwsem_wake+0xe3/0x12d > [<c0128c9e>] do_exit+0x627/0x6a4 > [<c0128d4d>] next_thread+0x0/0x53 > [<c010a7e3>] syscall_call+0x7/0xb > > However, the rest of the system is totally unaffected by this deadlock: > it's only deadlocked withing the thread group itself, nobody else cares. What happens here is a race between an exiting thread checking mm->core_waiters in __exit_mm, and the thread taking the core-dump signal (in coredump_wait) examining the first thread's ->mm pointer and incrementing mm->core_waiters to account for it. There is no synchronization at all in __exit_mm's use of mm->core_waiters. If the coredump_wait thread reads tsk->mm when tsk is in __exit_mm between checking mm->core_waiters and clearing tsk->mm, then it will increment mm->core_waiters and the total count will later exceed the number of threads that will ever decrement it and synchronize. Hence it blocks forever. The following patch fixes the problem by using mm->mmap_sem in __exit_mm. The read lock must be held around checking mm->core_waiters and clearing tsk->mm so that coredump_wait (which gets the write lock) cannot come in between and do bogus bookkeeping.
Showing
Please register or sign in to comment