• Zach O'Keefe's avatar
    mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups · edb5d0cf
    Zach O'Keefe authored
    In commit 34488399 ("mm/madvise: add file and shmem support to
    MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():
    
    	-       if (!pmd_present(pmde))
    	-               return SCAN_PMD_NULL;
    	+       if (pmd_none(pmde))
    	+               return SCAN_PMD_NONE;
    
    This was for-use by MADV_COLLAPSE file/shmem codepaths, where
    MADV_COLLAPSE might identify a pte-mapped hugepage, only to have
    khugepaged race-in, free the pte table, and clear the pmd.  Such codepaths
    include:
    
    A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
       already in the pagecache.
    B) In retract_page_tables(), if we fail to grab mmap_lock for the target
       mm/address.
    
    In these cases, collapse_pte_mapped_thp() really does expect a none (not
    just !present) pmd, and we want to suitably identify that case separate
    from the case where no pmd is found, or it's a bad-pmd (of course, many
    things could happen once we drop mmap_lock, and the pmd could plausibly
    undergo multiple transitions due to intervening fault, split, etc). 
    Regardless, the code is prepared install a huge-pmd only when the existing
    pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
    
    However, the commit introduces a logical hole; namely, that we've allowed
    !none- && !huge- && !bad-pmds to be classified as genuine
    pte-table-mapping-pmds.  One such example that could leak through are swap
    entries.  The pmd values aren't checked again before use in
    pte_offset_map_lock(), which is expecting nothing less than a genuine
    pte-table-mapping-pmd.
    
    We want to put back the !pmd_present() check (below the pmd_none() check),
    but need to be careful to deal with subtleties in pmd transitions and
    treatments by various arch.
    
    The issue is that __split_huge_pmd_locked() temporarily clears the present
    bit (or otherwise marks the entry as invalid), but pmd_present() and
    pmd_trans_huge() still need to return true while the pmd is in this
    transitory state.  For example, x86's pmd_present() also checks the
    _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
    checks a PMD_PRESENT_INVALID bit.
    
    Covering all 4 cases for x86 (all checks done on the same pmd value):
    
    1) pmd_present() && pmd_trans_huge()
       All we actually know here is that the PSE bit is set. Either:
       a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
          is set.
          => huge-pmd
       b) We are currently racing with __split_huge_page().  The danger here
          is that we proceed as-if we have a huge-pmd, but really we are
          looking at a pte-mapping-pmd.  So, what is the risk of this
          danger?
    
          The only relevant path is:
    
    	madvise_collapse() -> collapse_pte_mapped_thp()
    
          Where we might just incorrectly report back "success", when really
          the memory isn't pmd-backed.  This is fine, since split could
          happen immediately after (actually) successful madvise_collapse().
          So, it should be safe to just assume huge-pmd here.
    
    2) pmd_present() && !pmd_trans_huge()
       Either:
       a) PSE not set and either PRESENT or PROTNONE is.
          => pte-table-mapping pmd (or PROT_NONE)
       b) devmap.  This routine can be called immediately after
          unlocking/locking mmap_lock -- or called with no locks held (see
          khugepaged_scan_mm_slot()), so previous VMA checks have since been
          invalidated.
    
    3) !pmd_present() && pmd_trans_huge()
      Not possible.
    
    4) !pmd_present() && !pmd_trans_huge()
      Neither PRESENT nor PROTNONE set
      => not present
    
    I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
    powerpc, longarch, x86, mips, s390) and this logic roughly translates
    (though devmap treatment is unique to x86 and powerpc, and (3) doesn't
    necessarily hold in general -- but that doesn't matter since
    !pmd_present() always takes failure path).
    
    Also, add a comment above find_pmd_or_thp_or_none() to help future
    travelers reason about the validity of the code; namely, the possible
    mutations that might happen out from under us, depending on how mmap_lock
    is held (if at all).
    
    Link: https://lkml.kernel.org/r/20230125225358.2576151-1-zokeefe@google.com
    Fixes: 34488399 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
    Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
    Reported-by: default avatarHugh Dickins <hughd@google.com>
    Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    edb5d0cf
khugepaged.c 70.8 KB