• Mel Gorman's avatar
    mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · f1181047
    Mel Gorman authored
    commit 3ea27719 upstream.
    
    Stable note for 4.4: The upstream patch patches madvise(MADV_FREE) but 4.4
    	does not have support for that feature. The changelog is left
    	as-is but the hunk related to madvise is omitted from the backport.
    
    Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.
    
    He described the race as follows:
    
            CPU0                            CPU1
            ----                            ----
                                            user accesses memory using RW PTE
                                            [PTE now cached in TLB]
            try_to_unmap_one()
            ==> ptep_get_and_clear()
            ==> set_tlb_ubc_flush_pending()
                                            mprotect(addr, PROT_READ)
                                            ==> change_pte_range()
                                            ==> [ PTE non-present - no flush ]
    
                                            user writes using cached RW PTE
            ...
    
            try_to_unmap_flush()
    
    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.
    
    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access.  For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB.  However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.
    
    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption.  In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch.  Other users such as gup as a race with
    reclaim looks just at PTEs.  huge page variants should be ok as they
    don't race with reclaim.  mincore only looks at PTEs.  userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.
    
    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected.  His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.
    
    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.deReported-by: default avatarNadav Amit <nadav.amit@gmail.com>
    Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
    Cc: Andy Lutomirski <luto@kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    f1181047
rmap.c 48.1 KB