• Thomas Gleixner's avatar
    futexes: fix fault handling in futex_lock_pi · 1b7558e4
    Thomas Gleixner authored
    This patch addresses a very sporadic pi-futex related failure in
    highly threaded java apps on large SMP systems.
    
    David Holmes reported that the pi_state consistency check in
    lookup_pi_state triggered with his test application. This means that
    the kernel internal pi_state and the user space futex variable are out
    of sync. First we assumed that this is a user space data corruption,
    but deeper investigation revieled that the problem happend because the
    pi-futex code is not handling a fault in the futex_lock_pi path when
    the user space variable needs to be fixed up.
    
    The fault happens when a fork mapped the anon memory which contains
    the futex readonly for COW or the page got swapped out exactly between
    the unlock of the futex and the return of either the new futex owner
    or the task which was the expected owner but failed to acquire the
    kernel internal rtmutex. The current futex_lock_pi() code drops out
    with an inconsistent in case it faults and returns -EFAULT to user
    space. User space has no way to fixup that state.
    
    When we wrote this code we thought that we could not drop the hash
    bucket lock at this point to handle the fault.
    
    After analysing the code again it turned out to be wrong because there
    are only two tasks involved which might modify the pi_state and the
    user space variable:
    
     - the task which acquired the rtmutex
     - the pending owner of the pi_state which did not get the rtmutex
    
    Both tasks drop into the fixup_pi_state() function before returning to
    user space. The first task which acquired the hash bucket lock faults
    in the fixup of the user space variable, drops the spinlock and calls
    futex_handle_fault() to fault in the page. Now the second task could
    acquire the hash bucket lock and tries to fixup the user space
    variable as well. It either faults as well or it succeeds because the
    first task already faulted the page in.
    
    One caveat is to avoid a double fixup. After returning from the fault
    handling we reacquire the hash bucket lock and check whether the
    pi_state owner has been modified already.
    Reported-by: default avatarDavid Holmes <david.holmes@sun.com>
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: David Holmes <david.holmes@sun.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: <stable@kernel.org>
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    
     kernel/futex.c |   93 ++++++++++++++++++++++++++++++++++++++++++++-------------
     1 file changed, 73 insertions(+), 20 deletions(-)
    1b7558e4
futex.c 50.3 KB