• Andrew Morton's avatar
    [PATCH] synchronize use of mm->core_waiters · 99365bd4
    Andrew Morton authored
    From: Roland McGrath <roland@redhat.com>
    
    I believe I have identified a failure mode that Linus saw a couple weeks
    back when tracking down some other fork/exit sorts of races.  We saw this
    come up on rare occasions with the RHEL3 kernel's backport of the new code
    (while trying to track down other race failure modes we have yet to fix, sigh).
    
    I am talking about the following scenario:
    
    > Btw, even with the fix, doing a "while : ; ./crash t 10 ; done" will
    > eventually result in a stuck process:
    >
    > 	 1415 tty1     D      0:00 ./crash
    >
    > This is some kind of deadlock: most of the fifty threads are in "D"
    > state, with a trace something like
    >
    > 	 [<c011fbe3>] schedule+0x360/0x7f8
    > 	 [<c0120539>] wait_for_completion+0xd4/0x1c3
    > 	 [<c0128c9e>] do_exit+0x627/0x6a4
    > 	 [<c0128ddd>] do_group_exit+0x3d/0x177
    > 	 [<c0130c13>] dequeue_signal+0x2d/0x84
    > 	 [<c0133911>] get_signal_to_deliver+0x390/0x575
    > 	 [<c010a541>] do_signal+0x6c/0xf1
    > 	 [<c01200be>] default_wake_function+0x0/0x12
    > 	 [<c01200be>] default_wake_function+0x0/0x12
    > 	 [<c013d50f>] do_futex+0x6d/0x7d
    > 	 [<c013d635>] sys_futex+0x116/0x12f
    > 	 [<c010a601>] do_notify_resume+0x3b/0x3d
    > 	 [<c010a82e>] work_notifysig+0x13/0x15
    >
    > except for one that is trying to core-dump:
    >
    > 	 [<c0120539>] wait_for_completion+0xd4/0x1c3
    > 	 [<c01200be>] default_wake_function+0x0/0x12
    > 	 [<c01200be>] default_wake_function+0x0/0x12
    > 	 [<c02101aa>] rwsem_wake+0x86/0x12d
    > 	 [<c01738af>] coredump_wait+0xa8/0xaa
    > 	 [<c0173a26>] do_coredump+0x175/0x26c
    >
    > and three that are just doing a regular "exit()" system call:
    >
    > 	 [<c011fbe3>] schedule+0x360/0x7f8
    > 	 [<c011e19a>] recalc_task_prio+0x90/0x1aa
    > 	 [<c0120539>] wait_for_completion+0xd4/0x1c3
    > 	 [<c01200be>] default_wake_function+0x0/0x12
    > 	 [<c01200be>] default_wake_function+0x0/0x12
    > 	 [<c0210207>] rwsem_wake+0xe3/0x12d
    > 	 [<c0128c9e>] do_exit+0x627/0x6a4
    > 	 [<c0128d4d>] next_thread+0x0/0x53
    > 	 [<c010a7e3>] syscall_call+0x7/0xb
    >
    > However, the rest of the system is totally unaffected by this deadlock:
    > it's only deadlocked withing the thread group itself, nobody else cares.
    
    What happens here is a race between an exiting thread checking
    mm->core_waiters in __exit_mm, and the thread taking the core-dump signal
    (in coredump_wait) examining the first thread's ->mm pointer and
    incrementing mm->core_waiters to account for it.  There is no
    synchronization at all in __exit_mm's use of mm->core_waiters.  If the
    coredump_wait thread reads tsk->mm when tsk is in __exit_mm between
    checking mm->core_waiters and clearing tsk->mm, then it will increment
    mm->core_waiters and the total count will later exceed the number of
    threads that will ever decrement it and synchronize.  Hence it blocks forever.
    
    The following patch fixes the problem by using mm->mmap_sem in __exit_mm.
    The read lock must be held around checking mm->core_waiters and clearing
    tsk->mm so that coredump_wait (which gets the write lock) cannot come in
    between and do bogus bookkeeping.
    99365bd4
exit.c 28.6 KB