• Peter Xu's avatar
    mm/uffd: UFFD_FEATURE_WP_UNPOPULATED · 2bad466c
    Peter Xu authored
    Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4.
    
    The new feature bit makes anonymous memory acts the same as file memory on
    userfaultfd-wp in that it'll also wr-protect none ptes.
    
    It can be useful in two cases:
    
    (1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot,
        so pre-fault can be replaced by enabling this flag and speed up
        protections
    
    (2) It helps to implement async uffd-wp mode that Muhammad is working on [1]
    
    It's debatable whether this is the most ideal solution because with the
    new feature bit set, wr-protect none pte needs to pre-populate the
    pgtables to the last level (PAGE_SIZE).  But it seems fine so far to
    service either purpose above, so we can leave optimizations for later.
    
    The series brings pte markers to anonymous memory too.  There's some
    change in the common mm code path in the 1st patch, great to have some eye
    looking at it, but hopefully they're still relatively straightforward.
    
    
    This patch (of 2):
    
    This is a new feature that controls how uffd-wp handles none ptes.  When
    it's set, the kernel will handle anonymous memory the same way as file
    memory, by allowing the user to wr-protect unpopulated ptes.
    
    File memories handles none ptes consistently by allowing wr-protecting of
    none ptes because of the unawareness of page cache being exist or not. 
    For anonymous it was not as persistent because we used to assume that we
    don't need protections on none ptes or known zero pages.
    
    One use case of such a feature bit was VM live snapshot, where if without
    wr-protecting empty ptes the snapshot can contain random rubbish in the
    holes of the anonymous memory, which can cause misbehave of the guest when
    the guest OS assumes the pages should be all zeros.
    
    QEMU worked it around by pre-populate the section with reads to fill in
    zero page entries before starting the whole snapshot process [1].
    
    Recently there's another need raised on using userfaultfd wr-protect for
    detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
    case if without being able to wr-protect none ptes by default, the dirty
    info can get lost, since we cannot treat every none pte to be dirty (the
    current design is identify a page dirty based on uffd-wp bit being
    cleared).
    
    In general, we want to be able to wr-protect empty ptes too even for
    anonymous.
    
    This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
    uffd-wp handling on none ptes being consistent no matter what the memory
    type is underneath.  It doesn't have any impact on file memories so far
    because we already have pte markers taking care of that.  So it only
    affects anonymous.
    
    The feature bit is by default off, so the old behavior will be maintained.
    Sometimes it may be wanted because the wr-protect of none ptes will
    contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte
    markers to anonymous), but also on creating the pgtables to store the pte
    markers.  So there's potentially less chance of using thp on the first
    fault for a none pmd or larger than a pmd.
    
    The major implementation part is teaching the whole kernel to understand
    pte markers even for anonymously mapped ranges, meanwhile allowing the
    UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
    new feature bit is set.
    
    Note that even if the patch subject starts with mm/uffd, there're a few
    small refactors to major mm path of handling anonymous page faults.  But
    they should be straightforward.
    
    With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all
    the memory before wr-protect during taking a live snapshot.  Quotting from
    Muhammad's test result here [3] based on a simple program [4]:
    
      (1) With huge page disabled
      echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
      ./uffd_wp_perf
      Test DEFAULT: 4
      Test PRE-READ: 1111453 (pre-fault 1101011)
      Test MADVISE: 278276 (pre-fault 266378)
      Test WP-UNPOPULATE: 11712
    
      (2) With Huge page enabled
      echo always > /sys/kernel/mm/transparent_hugepage/enabled
      ./uffd_wp_perf
      Test DEFAULT: 4
      Test PRE-READ: 22521 (pre-fault 22348)
      Test MADVISE: 4909 (pre-fault 4743)
      Test WP-UNPOPULATE: 14448
    
    There'll be a great perf boost for no-thp case, while for thp enabled with
    extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE,
    but that's low possibility in reality, also the overhead was not reduced
    but postponed until a follow up write on any huge zero thp, so potentially
    it is faster by making the follow up writes slower.
    
    [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
    [2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
    [3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/
    [4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c
    
    [peterx@redhat.com: comment changes, oneliner fix to khugepaged]
      Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n
    Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
    Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Paul Gofman <pgofman@codeweavers.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    2bad466c
memory.c 160 KB