• Andrea Arcangeli's avatar
    mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition · 2d363e95
    Andrea Arcangeli authored
    commit 26c19178 upstream.
    
    When holding the mmap_sem for reading, pmd_offset_map_lock should only
    run on a pmd_t that has been read atomically from the pmdp pointer,
    otherwise we may read only half of it leading to this crash.
    
    PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
     #0 [f06a9dd8] crash_kexec at c049b5ec
     #1 [f06a9e2c] oops_end at c083d1c2
     #2 [f06a9e40] no_context at c0433ded
     #3 [f06a9e64] bad_area_nosemaphore at c043401a
     #4 [f06a9e6c] __do_page_fault at c0434493
     #5 [f06a9eec] do_page_fault at c083eb45
     #6 [f06a9f04] error_code (via page_fault) at c083c5d5
        EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
        00000000
        DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
        CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
     #7 [f06a9f38] _spin_lock at c083bc14
     #8 [f06a9f44] sys_mincore at c0507b7d
     #9 [f06a9fb0] system_call at c083becd
                             start           len
        EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
        DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
        SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
        CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286
    
    This should be a longstanding bug affecting x86 32bit PAE without THP.
    Only archs with 64bit large pmd_t and 32bit unsigned long should be
    affected.
    
    With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
    would partly hide the bug when the pmd transition from none to stable,
    by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
    enabled a new set of problem arises by the fact could then transition
    freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
    So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
    unconditional isn't good idea and it would be a flakey solution.
    
    This should be fully fixed by introducing a pmd_read_atomic that reads
    the pmd in order with THP disabled, or by reading the pmd atomically
    with cmpxchg8b with THP enabled.
    
    Luckily this new race condition only triggers in the places that must
    already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
    is localized there but this bug is not related to THP.
    
    NOTE: this can trigger on x86 32bit systems with PAE enabled with more
    than 4G of ram, otherwise the high part of the pmd will never risk to be
    truncated because it would be zero at all times, in turn so hiding the
    SMP race.
    
    This bug was discovered and fully debugged by Ulrich, quote:
    
    ----
    [..]
    pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
    eax.
    
        496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
        *pmd)
        497 {
        498         /* depend on compiler for an atomic pmd read */
        499         pmd_t pmdval = *pmd;
    
                                    // edi = pmd pointer
    0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
    ...
                                    // edx = PTE page table high address
    0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
    ...
                                    // eax = PTE page table low address
    0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax
    
    [..]
    
    Please note that the PMD is not read atomically. These are two "mov"
    instructions where the high order bits of the PMD entry are fetched
    first. Hence, the above machine code is prone to the following race.
    
    -  The PMD entry {high|low} is 0x0000000000000000.
       The "mov" at 0xc0507a84 loads 0x00000000 into edx.
    
    -  A page fault (on another CPU) sneaks in between the two "mov"
       instructions and instantiates the PMD.
    
    -  The PMD entry {high|low} is now 0x00000003fda38067.
       The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
    ----
    Reported-by: default avatarUlrich Obergfell <uobergfe@redhat.com>
    Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Larry Woodman <lwoodman@redhat.com>
    Cc: Petr Matousek <pmatouse@redhat.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    2d363e95
pgtable-3level.h 5.65 KB