• Jann Horn's avatar
    mm: fix memory ordering for mm_lock_seq and vm_lock_seq · b1f02b95
    Jann Horn authored
    mm->mm_lock_seq effectively functions as a read/write lock; therefore it
    must be used with acquire/release semantics.
    
    A specific example is the interaction between userfaultfd_register() and
    lock_vma_under_rcu().
    
    userfaultfd_register() does the following from the point where it changes
    a VMA's flags to the point where concurrent readers are permitted again
    (in a simple scenario where only a single private VMA is accessed and no
    merging/splitting is involved):
    
    userfaultfd_register
      userfaultfd_set_vm_flags
        vm_flags_reset
          vma_start_write
            down_write(&vma->vm_lock->lock)
            vma->vm_lock_seq = mm_lock_seq [marks VMA as busy]
            up_write(&vma->vm_lock->lock)
          vm_flags_init
            [sets VM_UFFD_* in __vm_flags]
      vma->vm_userfaultfd_ctx.ctx = ctx
      mmap_write_unlock
        vma_end_write_all
          WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1) [unlocks VMA]
    
    There are no memory barriers in between the __vm_flags update and the
    mm->mm_lock_seq update that unlocks the VMA, so the unlock can be
    reordered to above the `vm_flags_init()` call, which means from the
    perspective of a concurrent reader, a VMA can be marked as a userfaultfd
    VMA while it is not VMA-locked.  That's bad, we definitely need a
    store-release for the unlock operation.
    
    The non-atomic write to vma->vm_lock_seq in vma_start_write() is mostly
    fine because all accesses to vma->vm_lock_seq that matter are always
    protected by the VMA lock.  There is a racy read in vma_start_read()
    though that can tolerate false-positives, so we should be using
    WRITE_ONCE() to keep things tidy and data-race-free (including for KCSAN).
    
    On the other side, lock_vma_under_rcu() works as follows in the relevant
    region for locking and userfaultfd check:
    
    lock_vma_under_rcu
      vma_start_read
        vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [early bailout]
        down_read_trylock(&vma->vm_lock->lock)
        vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [main check]
      userfaultfd_armed
        checks vma->vm_flags & __VM_UFFD_FLAGS
    
    Here, the interesting aspect is how far down the mm->mm_lock_seq read can
    be reordered - if this read is reordered down below the vma->vm_flags
    access, this could cause lock_vma_under_rcu() to partly operate on
    information that was read while the VMA was supposed to be locked.  To
    prevent this kind of downwards bleeding of the mm->mm_lock_seq read, we
    need to read it with a load-acquire.
    
    Some of the comment wording is based on suggestions by Suren.
    
    BACKPORT WARNING: One of the functions changed by this patch (which I've
    written against Linus' tree) is vma_try_start_write(), but this function
    no longer exists in mm/mm-everything.  I don't know whether the merged
    version of this patch will be ordered before or after the patch that
    removes vma_try_start_write().  If you're backporting this patch to a tree
    with vma_try_start_write(), make sure this patch changes that function.
    
    Link: https://lkml.kernel.org/r/20230721225107.942336-1-jannh@google.com
    Fixes: 5e31275c
    
     ("mm: add per-VMA lock and helper functions to control it")
    Signed-off-by: default avatarJann Horn <jannh@google.com>
    Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    b1f02b95
mm.h 118 KB