• Andrea Arcangeli's avatar
    mm: thp: kvm: fix memory corruption in KVM with THP enabled · 127393fb
    Andrea Arcangeli authored
    After the THP refcounting change, obtaining a compound pages from
    get_user_pages() no longer allows us to assume the entire compound page
    is immediately mappable from a secondary MMU.
    
    A secondary MMU doesn't want to call get_user_pages() more than once for
    each compound page, in order to know if it can map the whole compound
    page.  So a secondary MMU needs to know from a single get_user_pages()
    invocation when it can map immediately the entire compound page to avoid
    a flood of unnecessary secondary MMU faults and spurious
    atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
    users).
    
    Ideally instead of the page->_mapcount < 1 check, get_user_pages()
    should return the granularity of the "page" mapping in the "mm" passed
    to get_user_pages().  However it's non trivial change to pass the "pmd"
    status belonging to the "mm" walked by get_user_pages up the stack (up
    to the caller of get_user_pages).  So the fix just checks if there is
    not a single pte mapping on the page returned by get_user_pages, and in
    turn if the caller can assume that the whole compound page is mapped in
    the current "mm" (in a pmd_trans_huge()).  In such case the entire
    compound page is safe to map into the secondary MMU without additional
    get_user_pages() calls on the surrounding tail/head pages.  In addition
    of being faster, not having to run other get_user_pages() calls also
    reduces the memory footprint of the secondary MMU fault in case the pmd
    split happened as result of memory pressure.
    
    Without this fix after a MADV_DONTNEED (like invoked by QEMU during
    postcopy live migration or balloning) or after generic swapping (with a
    failure in split_huge_page() that would only result in pmd splitting and
    not a physical page split), KVM would map the whole compound page into
    the shadow pagetables, despite regular faults or userfaults (like
    UFFDIO_COPY) may map regular pages into the primary MMU as result of the
    pte faults, leading to the guest mode and userland mode going out of
    sync and not working on the same memory at all times.
    
    Any other secondary MMU notifier manager (KVM is just one of the many
    MMU notifier users) will need the same information if it doesn't want to
    run a flood of get_user_pages_fast and it can support multiple
    granularity in the secondary MMU mappings, so I think it is justified to
    be exposed not just to KVM.
    
    The other option would be to move transparent_hugepage_adjust to
    mm/huge_memory.c but that currently has all kind of KVM data structures
    in it, so it's definitely not a cut-and-paste work, so I couldn't do a
    fix as cleaner as this one for 4.6.
    Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
    Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Cc: "Li, Liang Z" <liang.z.li@intel.com>
    Cc: Amit Shah <amit.shah@redhat.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    127393fb
mmu.c 129 KB