• Jann Horn's avatar
    userfaultfd: fix checks for huge PMDs · 71c186ef
    Jann Horn authored
    Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
    
    The pmd_trans_huge() code in mfill_atomic() is wrong in three different
    ways depending on kernel version:
    
    1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
       the right two race windows) - I've tested this in a kernel build with
       some extra mdelay() calls. See the commit message for a description
       of the race scenario.
       On older kernels (before 6.5), I think the same bug can even
       theoretically lead to accessing transhuge page contents as a page table
       if you hit the right 5 narrow race windows (I haven't tested this case).
    2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
       detecting PMDs that don't point to page tables.
       On older kernels (before 6.5), you'd just have to win a single fairly
       wide race to hit this.
       I've tested this on 6.1 stable by racing migration (with a mdelay()
       patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
       VM, that causes a kernel oops in ptlock_ptr().
    3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
       to yank page tables out from under us (though I haven't tested that),
       so I think the BUG_ON() checks in mfill_atomic() are just wrong.
    
    I decided to write two separate fixes for these (one fix for bugs 1+2, one
    fix for bug 3), so that the first fix can be backported to kernels
    affected by bugs 1+2.
    
    
    This patch (of 2):
    
    This fixes two issues.
    
    I discovered that the following race can occur:
    
      mfill_atomic                other thread
      ============                ============
                                  <zap PMD>
      pmdp_get_lockless() [reads none pmd]
      <bail if trans_huge>
      <if none:>
                                  <pagefault creates transhuge zeropage>
        __pte_alloc [no-op]
                                  <zap PMD>
      <bail if pmd_trans_huge(*dst_pmd)>
      BUG_ON(pmd_none(*dst_pmd))
    
    I have experimentally verified this in a kernel with extra mdelay() calls;
    the BUG_ON(pmd_none(*dst_pmd)) triggers.
    
    On kernels newer than commit 0d940a9b ("mm/pgtable: allow
    pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
    a BUG_ON(), since the page table access helpers are actually designed to
    deal with page tables concurrently disappearing; but on older kernels
    (<=6.4), I think we could probably theoretically race past the two
    BUG_ON() checks and end up treating a hugepage as a page table.
    
    The second issue is that, as Qi Zheng pointed out, there are other types
    of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
    (in particular, migration PMDs).
    
    On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
    PMD that contains a migration entry (which just requires winning a single,
    fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
    assumes that the PMD points to a page table.
    
    Breakage follows: First, the kernel tries to take the PTE lock (which will
    crash or maybe worse if there is no "struct page" for the address bits in
    the migration entry PMD - I think at least on X86 there usually is no
    corresponding "struct page" thanks to the PTE inversion mitigation, amd64
    looks different).
    
    If that didn't crash, the kernel would next try to write a PTE into what
    it wrongly thinks is a page table.
    
    As part of fixing these issues, get rid of the check for pmd_trans_huge()
    before __pte_alloc() - that's redundant, we're going to have to check for
    that after the __pte_alloc() anyway.
    
    Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
    
    Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@google.com
    Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@google.com
    Fixes: c1a4de99 ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
    Signed-off-by: default avatarJann Horn <jannh@google.com>
    Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Pavel Emelyanov <xemul@virtuozzo.com>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    71c186ef
userfaultfd.c 45.8 KB