• Nadav Amit's avatar
    mm/userfaultfd: fix memory corruption due to writeprotect · 6ce64428
    Nadav Amit authored
    Userfaultfd self-test fails occasionally, indicating a memory corruption.
    
    Analyzing this problem indicates that there is a real bug since mmap_lock
    is only taken for read in mwriteprotect_range() and defers flushes, and
    since there is insufficient consideration of concurrent deferred TLB
    flushes in wp_page_copy().  Although the PTE is flushed from the TLBs in
    wp_page_copy(), this flush takes place after the copy has already been
    performed, and therefore changes of the page are possible between the time
    of the copy and the time in which the PTE is flushed.
    
    To make matters worse, memory-unprotection using userfaultfd also poses a
    problem.  Although memory unprotection is logically a promotion of PTE
    permissions, and therefore should not require a TLB flush, the current
    userrfaultfd code might actually cause a demotion of the architectural PTE
    permission: when userfaultfd_writeprotect() unprotects memory region, it
    unintentionally *clears* the RW-bit if it was already set.  Note that this
    unprotecting a PTE that is not write-protected is a valid use-case: the
    userfaultfd monitor might ask to unprotect a region that holds both
    write-protected and write-unprotected PTEs.
    
    The scenario that happens in selftests/vm/userfaultfd is as follows:
    
    cpu0				cpu1			cpu2
    ----				----			----
    							[ Writable PTE
    							  cached in TLB ]
    userfaultfd_writeprotect()
    [ write-*unprotect* ]
    mwriteprotect_range()
    mmap_read_lock()
    change_protection()
    
    change_protection_range()
    ...
    change_pte_range()
    [ *clear* “write”-bit ]
    [ defer TLB flushes ]
    				[ page-fault ]
    				...
    				wp_page_copy()
    				 cow_user_page()
    				  [ copy page ]
    							[ write to old
    							  page ]
    				...
    				 set_pte_at_notify()
    
    A similar scenario can happen:
    
    cpu0		cpu1		cpu2		cpu3
    ----		----		----		----
    						[ Writable PTE
    				  		  cached in TLB ]
    userfaultfd_writeprotect()
    [ write-protect ]
    [ deferred TLB flush ]
    		userfaultfd_writeprotect()
    		[ write-unprotect ]
    		[ deferred TLB flush]
    				[ page-fault ]
    				wp_page_copy()
    				 cow_user_page()
    				 [ copy page ]
    				 ...		[ write to page ]
    				set_pte_at_notify()
    
    This race exists since commit 292924b2 ("userfaultfd: wp: apply
    _PAGE_UFFD_WP bit").  Yet, as Yu Zhao pointed, these races became apparent
    since commit 09854ba9 ("mm: do_wp_page() simplification") which made
    wp_page_copy() more likely to take place, specifically if page_count(page)
    > 1.
    
    To resolve the aforementioned races, check whether there are pending
    flushes on uffd-write-protected VMAs, and if there are, perform a flush
    before doing the COW.
    
    Further optimizations will follow to avoid during uffd-write-unprotect
    unnecassary PTE write-protection and TLB flushes.
    
    Link: https://lkml.kernel.org/r/20210304095423.3825684-1-namit@vmware.com
    Fixes: 09854ba9 ("mm: do_wp_page() simplification")
    Signed-off-by: default avatarNadav Amit <namit@vmware.com>
    Suggested-by: default avatarYu Zhao <yuzhao@google.com>
    Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
    Tested-by: default avatarPeter Xu <peterx@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Pavel Emelyanov <xemul@openvz.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Will Deacon <will@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: <stable@vger.kernel.org>	[5.9+]
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    6ce64428
memory.c 142 KB