1. 12 Jul, 2024 14 commits
    • Andrii Nakryiko's avatar
      fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps · ed5d583a
      Andrii Nakryiko authored
      /proc/<pid>/maps file is extremely useful in practice for various tasks
      involving figuring out process memory layout, what files are backing any
      given memory range, etc.  One important class of applications that
      absolutely rely on this are profilers/stack symbolizers (perf tool being
      one of them).  Patterns of use differ, but they generally would fall into
      two categories.
      
      In on-demand pattern, a profiler/symbolizer would normally capture stack
      trace containing absolute memory addresses of some functions, and would
      then use /proc/<pid>/maps file to find corresponding backing ELF files
      (normally, only executable VMAs are of interest), file offsets within
      them, and then continue from there to get yet more information (ELF
      symbols, DWARF information) to get human-readable symbolic information. 
      This pattern is used by Meta's fleet-wide profiler, as one example.
      
      In preprocessing pattern, application doesn't know the set of addresses of
      interest, so it has to fetch all relevant VMAs (again, probably only
      executable ones), store or cache them, then proceed with profiling and
      stack trace capture.  Once done, it would do symbolization based on stored
      VMA information.  This can happen at much later point in time.  This
      patterns is used by perf tool, as an example.
      
      In either case, there are both performance and correctness requirement
      involved.  This address to VMA information translation has to be done as
      efficiently as possible, but also not miss any VMA (especially in the case
      of loading/unloading shared libraries).  In practice, correctness can't be
      guaranteed (due to process dying before VMA data can be captured, or
      shared library being unloaded, etc), but any effort to maximize the chance
      of finding the VMA is appreciated.
      
      Unfortunately, for all the /proc/<pid>/maps file universality and
      usefulness, it doesn't fit the above use cases 100%.
      
      First, it's main purpose is to emit all VMAs sequentially, but in practice
      captured addresses would fall only into a smaller subset of all process'
      VMAs, mainly containing executable text.  Yet, library would need to parse
      most or all of the contents to find needed VMAs, as there is no way to
      skip VMAs that are of no use.  Efficient library can do the linear pass
      and it is still relatively efficient, but it's definitely an overhead that
      can be avoided, if there was a way to do more targeted querying of the
      relevant VMA information.
      
      Second, it's a text based interface, which makes its programmatic use from
      applications and libraries more cumbersome and inefficient due to the need
      to handle text parsing to get necessary pieces of information.  The
      overhead is actually payed both by kernel, formatting originally binary
      VMA data into text, and then by user space application, parsing it back
      into binary data for further use.
      
      For the on-demand pattern of usage, described above, another problem when
      writing generic stack trace symbolization library is an unfortunate
      performance-vs-correctness tradeoff that needs to be made.  Library has to
      make a decision to either cache parsed contents of /proc/<pid>/maps (after
      initial processing) to service future requests (if application requests to
      symbolize another set of addresses (for the same process), captured at
      some later time, which is typical for periodic/continuous profiling cases)
      to avoid higher costs of re-parsing this file.  Or it has to choose to
      cache the contents in memory to speed up future requests.  In the former
      case, more memory is used for the cache and there is a risk of getting
      stale data if application loads or unloads shared libraries, or otherwise
      changed its set of VMAs somehow, e.g., through additional mmap() calls. 
      In the latter case, it's the performance hit that comes from re-opening
      the file and re-parsing its contents all over again.
      
      This patch aims to solve this problem by providing a new API built on top
      of /proc/<pid>/maps.  It's meant to address both non-selectiveness and
      text nature of /proc/<pid>/maps, by giving user more control of what sort
      of VMA(s) needs to be queried, and being binary-based interface eliminates
      the overhead of text formatting (on kernel side) and parsing (on user
      space side).
      
      It's also designed to be extensible and forward/backward compatible by
      including required struct size field, which user has to provide.  We use
      established copy_struct_from_user() approach to handle extensibility.
      
      User has a choice to pick either getting VMA that covers provided address
      or -ENOENT if none is found (exact, least surprising, case).  Or, with an
      extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can get either
      VMA that covers the address (if there is one), or the closest next VMA
      (i.e., VMA with the smallest vm_start > addr).  The latter allows more
      efficient use, but, given it could be a surprising behavior, requires an
      explicit opt-in.
      
      There is another query flag that is useful for some use cases. 
      PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return
      file-backed VMAs.  Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
      makes it possible to efficiently iterate only file-backed VMAs of the
      process, which is what profilers/symbolizers are normally interested in.
      
      All the above querying flags can be combined with (also optional) set of
      desired VMA permissions flags.  This allows to, for example, iterate only
      an executable subset of VMAs, which is what preprocessing pattern, used by
      perf tool, would benefit from, as the assumption is that captured stack
      traces would have addresses of executable code.  This saves time by
      skipping non-executable VMAs altogether efficienty.
      
      All these querying flags (modifiers) are orthogonal and can be combined in
      a semantically meaningful and natural way.
      
      Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes sense
      given it's querying the same set of VMA data.  It's also benefitial
      because permission checks for /proc/<pid>/maps is performed at open time
      once, and the actual data read of text contents of /proc/<pid>/maps is
      done without further permission checks.  We piggyback on this pattern with
      ioctl()-based API as well, as that's a desired property.  Both for
      performance reasons, but also for security and flexibility reasons.
      
      Allowing application to open an FD for /proc/self/maps without any extra
      capabilities, and then passing it to some sort of profiling agent through
      Unix-domain socket, would allow such profiling agent to not require some
      of the capabilities that are otherwise expected when opening
      /proc/<pid>/maps file for *another* process.  This is a desirable property
      for some more restricted setups.
      
      This new ioctl-based implementation doesn't interfere with seq_file-based
      implementation of /proc/<pid>/maps textual interface, and so could be used
      together or independently without paying any price for that.
      
      Note also, that fetching VMA name (e.g., backing file path, or special
      hard-coded or user-provided names) is optional just like build ID.  If
      user sets vma_name_size to zero, kernel code won't attempt to retrieve it,
      saving resources.
      
      Earlier versions of this patch set were adding per-VMA locking, which is
      why we have a code structure that is ready for abstracting mmap_lock vs
      vm_lock differences (query_vma_setup(), query_vma_teardown(), and
      query_vma_find_by_addr()), but given anon_vma_name() is not yet compatible
      with per-VMA locking, initial implementation sticks to using only
      mmap_lock for now.  It will be easy to add back per-VMA locking once all
      the pieces are ready later on.  Which is why we keep existing code
      structure with setup/teardown/query helper functions.
      
      [andrii@kernel.org: improve PROCMAP_QUERY's compat mode handling]
        Link: https://lkml.kernel.org/r/20240701174805.1897344-2-andrii@kernel.org
      Link: https://lkml.kernel.org/r/20240627170900.1672542-3-andrii@kernel.orgSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed5d583a
    • Andrii Nakryiko's avatar
      fs/procfs: extract logic for getting VMA name constituents · acd4b2ec
      Andrii Nakryiko authored
      Patch series "ioctl()-based API to query VMAs from /proc/<pid>/maps", v6.
      
      Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
      applications to query VMA information more efficiently than reading *all*
      VMAs nonselectively through text-based interface of /proc/<pid>/maps file.
      
      Patch #2 goes into a lot of details and background on some common patterns
      of using /proc/<pid>/maps in the area of performance profiling and
      subsequent symbolization of captured stack traces.  As mentioned in that
      patch, patterns of VMA querying can differ depending on specific use case,
      but can generally be grouped into two main categories: the need to query a
      small subset of VMAs covering a given batch of addresses, or
      reading/storing/caching all (typically, executable) VMAs upfront for later
      processing.
      
      The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by
      the former pattern of usage.  Earlier revisions had a patch adding a tool
      that faithfully reproduces an efficient VMA matching pass of a symbolizer,
      collecting a subset of covering VMAs for a given set of addresses as
      efficiently as possible.  This tool served both as a testing ground, as
      well as a benchmarking tool.  It implements everything both for currently
      existing text-based /proc/<pid>/maps interface, as well as for newly-added
      PROCMAP_QUERY ioctl().  This revision dropped the tool from the patch set
      and, once the API lands upstream, this tool might be added separately on
      Github as an example.
      
      Based on discussion on earlier revisions of this patch set, it turned out
      that this ioctl() API is competitive with highly-optimized text-based
      pre-processing pattern that perf tool is using.  Based on perf discussion,
      this revision adds more flexibility in specifying a subset of VMAs that
      are of interest.  Now it's possible to specify desired permissions of VMAs
      (e.g., request only executable ones) and/or restrict to only a subset of
      VMAs that have file backing.  This further improves the efficiency when
      using this new API thanks to more selective (executable VMAs only)
      querying.
      
      In addition to a custom benchmarking tool, and experimental perf
      integration (available at [0]), Daniel Mueller has since also implemented
      an experimental integration into blazesym (see [1]), a library used for
      stack trace symbolization by our server fleet-wide profiler and another
      on-device profiler agent that runs on weaker ARM devices.  The latter
      ARM-based device profiler is especially sensitive to performance, and so
      we benchmarked and compared text-based /proc/<pid>/maps solution to the
      equivalent one using PROCMAP_QUERY ioctl().
      
      Results are very encouraging, giving us 5x improvement for end-to-end
      so-called "address normalization" pass, which is the part of the
      symbolization process that happens locally on ARM device, before being
      sent out for further heavier-weight processing on more powerful remote
      server.  Note that this is not an artificial microbenchmark.  It's a full
      end-to-end API call being measured with real-world data on real-world
      device.
      
        TEXT-BASED
        ==========
        Benchmarking main/normalize_process_no_build_ids_uncached_maps
        main/normalize_process_no_build_ids_uncached_maps
      	  time:   [49.777 µs 49.982 µs 50.250 µs]
      
        IOCTL-BASED
        ===========
        Benchmarking main/normalize_process_no_build_ids_uncached_maps
        main/normalize_process_no_build_ids_uncached_maps
      	  time:   [10.328 µs 10.391 µs 10.457 µs]
      	  change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02)
      	  Performance has improved.
      
      You can see above that we see the drop from 50µs down to 10µs for
      exactly the same amount of work, with the same data and target process.
      
      With the aforementioned custom tool, we see about ~40x improvement (it
      might vary a bit, depending on a specific captured set of addresses).  And
      even for perf-based benchmark it's on par or slightly ahead when using
      permission-based filtering (fetching only executable VMAs).
      
      Earlier revisions attempted to use per-VMA locking, if kernel was compiled
      with CONFIG_PER_VMA_LOCK=y, but it turned out that anon_vma_name() is not
      yet compatible with per-VMA locking and assumes mmap_lock to be taken,
      which makes the use of per-VMA locking for this API premature.  It was
      agreed ([2]) to continue for now with just mmap_lock, but the code
      structure is such that it should be easy to add per-VMA locking support
      once all the pieces are ready.
      
      One thing that did not change was basing this new API as an ioctl()
      command on /proc/<pid>/maps file.  An ioctl-based API on top of pidfd was
      considered, but has its own downsides.  Implementing ioctl() directly on
      pidfd will cause access permission checks on every single ioctl(), which
      leads to performance concerns and potential spam of capable() audit
      messages.  It also prevents a nice pattern, possible with
      /proc/<pid>/maps, in which application opens /proc/self/maps FD (requiring
      no additional capabilities) and passed this FD to profiling agent for
      querying.  To achieve similar pattern, a new file would have to be created
      from pidf just for VMA querying, which is considered to be inferior to
      just querying /proc/<pid>/maps FD as proposed in current approach.  These
      aspects were discussed in the hallway track at recent LSF/MM/BPF 2024 and
      sticking to procfs ioctl() was the final agreement we arrived at.
      
        [0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/
        [1] https://github.com/libbpf/blazesym/pull/675
        [2] https://lore.kernel.org/bpf/7rm3izyq2vjp5evdjc7c6z4crdd3oerpiknumdnmmemwyiwx7t@hleldw7iozi3/
      
      
      This patch (of 6):
      
      Extract generic logic to fetch relevant pieces of data to describe VMA
      name.  This could be just some string (either special constant or
      user-provided), or a string with some formatted wrapping text (e.g.,
      "[anon_shmem:<something>]"), or, commonly, file path.  seq_file-based
      logic has different methods to handle all three cases, but they are
      currently mixed in with extracting underlying sources of data.
      
      This patch splits this into data fetching and data formatting, so that
      data fetching can be reused later on.
      
      There should be no functional changes.
      
      Link: https://lkml.kernel.org/r/20240627170900.1672542-1-andrii@kernel.org
      Link: https://lkml.kernel.org/r/20240627170900.1672542-2-andrii@kernel.orgSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      acd4b2ec
    • Vivek Kasireddy's avatar
      selftests/udmabuf: add tests to verify data after page migration · 8d42e2a9
      Vivek Kasireddy authored
      Since the memfd pages associated with a udmabuf may be migrated as part of
      udmabuf create, we need to verify the data coherency after successful
      migration.  The new tests added in this patch try to do just that using 4k
      sized pages and also 2 MB sized huge pages for the memfd.
      
      Successful completion of the tests would mean that there is no disconnect
      between the memfd pages and the ones associated with a udmabuf.  And,
      these tests can also be augmented in the future to test newer udmabuf
      features (such as handling memfd hole punch).
      
      The idea for these tests comes from a patch by Mike Kravetz here:
      https://lists.freedesktop.org/archives/dri-devel/2023-June/410623.html
      
      v1->v2: (suggestions from Shuah)
      - Use ksft_* functions to print and capture results of tests
      - Use appropriate KSFT_* status codes for exit()
      - Add Mike Kravetz's suggested-by tag
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-10-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d42e2a9
    • Vivek Kasireddy's avatar
      udmabuf: pin the pages using memfd_pin_folios() API · c6a3194c
      Vivek Kasireddy authored
      Using memfd_pin_folios() will ensure that the pages are pinned
      correctly using FOLL_PIN. And, this also ensures that we don't
      accidentally break features such as memory hotunplug as it would
      not allow pinning pages in the movable zone.
      
      Using this new API also simplifies the code as we no longer have
      to deal with extracting individual pages from their mappings or
      handle shmem and hugetlb cases separately.
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-9-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6a3194c
    • Vivek Kasireddy's avatar
      udmabuf: convert udmabuf driver to use folios · 5e72b2b4
      Vivek Kasireddy authored
      This is mainly a preparatory patch to use memfd_pin_folios() API for
      pinning folios.  Using folios instead of pages makes sense as the udmabuf
      driver needs to handle both shmem and hugetlb cases.  And, using the
      memfd_pin_folios() API makes this easier as we no longer need to
      separately handle shmem vs hugetlb cases in the udmabuf driver.
      
      Note that, the function vmap_udmabuf() still needs a list of pages; so, we
      collect all the head pages into a local array in this case.
      
      Other changes in this patch include the addition of helpers for checking
      the memfd seals and exporting dmabuf.  Moving code from udmabuf_create()
      into these helpers improves readability given that udmabuf_create() is a
      bit long.
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-8-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e72b2b4
    • Vivek Kasireddy's avatar
      udmabuf: add back support for mapping hugetlb pages · 0c8b91ef
      Vivek Kasireddy authored
      A user or admin can configure a VMM (Qemu) Guest's memory to be backed by
      hugetlb pages for various reasons.  However, a Guest OS would still
      allocate (and pin) buffers that are backed by regular 4k sized pages.  In
      order to map these buffers and create dma-bufs for them on the Host, we
      first need to find the hugetlb pages where the buffer allocations are
      located and then determine the offsets of individual chunks (within those
      pages) and use this information to eventually populate a scatterlist.
      
      Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
      were passed to the Host kernel and Qemu was launched with these
      relevant options: qemu-system-x86_64 -m 4096m....
      -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
      -display gtk,gl=on
      -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
      -machine memory-backend=mem1
      
      Replacing -display gtk,gl=on with -display gtk,gl=off above would
      exercise the mmap handler.
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-7-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Acked-by: Mike Kravetz <mike.kravetz@oracle.com> (v2)
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0c8b91ef
    • Vivek Kasireddy's avatar
      udmabuf: use vmf_insert_pfn and VM_PFNMAP for handling mmap · 7d79cd78
      Vivek Kasireddy authored
      Add VM_PFNMAP to vm_flags in the mmap handler to ensure that the mappings
      would be managed without using struct page.
      
      And, in the vm_fault handler, use vmf_insert_pfn to share the page's pfn
      to userspace instead of directly sharing the page (via struct page *).
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-6-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d79cd78
    • Arnd Bergmann's avatar
      udmabuf: add CONFIG_MMU dependency · 725553d2
      Arnd Bergmann authored
      There is no !CONFIG_MMU version of vmf_insert_pfn():
      
      arm-linux-gnueabi-ld: drivers/dma-buf/udmabuf.o: in function `udmabuf_vm_fault':
      udmabuf.c:(.text+0xaa): undefined reference to `vmf_insert_pfn'
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-5-vivek.kasireddy@intel.comSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Gerd Hoffmann <kraxel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      725553d2
    • Vivek Kasireddy's avatar
      mm/gup: introduce memfd_pin_folios() for pinning memfd folios · 89c1905d
      Vivek Kasireddy authored
      For drivers that would like to longterm-pin the folios associated with a
      memfd, the memfd_pin_folios() API provides an option to not only pin the
      folios via FOLL_PIN but also to check and migrate them if they reside in
      movable zone or CMA block.  This API currently works with memfds but it
      should work with any files that belong to either shmemfs or hugetlbfs. 
      Files belonging to other filesystems are rejected for now.
      
      The folios need to be located first before pinning them via FOLL_PIN.  If
      they are found in the page cache, they can be immediately pinned. 
      Otherwise, they need to be allocated using the filesystem specific APIs
      and then pinned.
      
      [akpm@linux-foundation.org: improve the CONFIG_MMU=n situation, per SeongJae]
      [vivek.kasireddy@intel.com: return -EINVAL if the end offset is greater than the size of memfd]
        Link: https://lkml.kernel.org/r/IA0PR11MB71850525CBC7D541CAB45DF1F8DB2@IA0PR11MB7185.namprd11.prod.outlook.com
      Link: https://lkml.kernel.org/r/20240624063952.1572359-4-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2)
      Reviewed-by: David Hildenbrand <david@redhat.com> (v3)
      Reviewed-by: Christoph Hellwig <hch@lst.de> (v6)
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      89c1905d
    • Vivek Kasireddy's avatar
      mm/gup: introduce check_and_migrate_movable_folios() · 53ba78de
      Vivek Kasireddy authored
      This helper is the folio equivalent of check_and_migrate_movable_pages(). 
      Therefore, all the rules that apply to check_and_migrate_movable_pages()
      also apply to this one as well.  Currently, this helper is only used by
      memfd_pin_folios().
      
      This patch also includes changes to rename and convert the internal
      functions collect_longterm_unpinnable_pages() and
      migrate_longterm_unpinnable_pages() to work on folios.  As a result,
      check_and_migrate_movable_pages() is now a wrapper around
      check_and_migrate_movable_folios().
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-3-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53ba78de
    • Vivek Kasireddy's avatar
      mm/gup: introduce unpin_folio/unpin_folios helpers · 6cc04054
      Vivek Kasireddy authored
      Patch series "mm/gup: Introduce memfd_pin_folios() for pinning memfd
      folios", v16.
      
      Currently, some drivers (e.g, Udmabuf) that want to longterm-pin the
      pages/folios associated with a memfd, do so by simply taking a reference
      on them.  This is not desirable because the pages/folios may reside in
      Movable zone or CMA block.
      
      Therefore, having drivers use memfd_pin_folios() API ensures that the
      folios are appropriately pinned via FOLL_PIN for longterm DMA.
      
      This patchset also introduces a few helpers and converts the Udmabuf
      driver to use folios and memfd_pin_folios() API to longterm-pin the folios
      for DMA.  Two new Udmabuf selftests are also included to test the driver
      and the new API.
      
      
      This patch (of 9):
      
      These helpers are the folio versions of unpin_user_page/unpin_user_pages. 
      They are currently only useful for unpinning folios pinned by
      memfd_pin_folios() or other associated routines.  However, they could find
      new uses in the future, when more and more folio-only helpers are added to
      GUP.
      
      We should probably sanity check the folio as part of unpin similar to how
      it is done in unpin_user_page/unpin_user_pages but we cannot cleanly do
      that at the moment without also checking the subpage.  Therefore, sanity
      checking needs to be added to these routines once we have a way to
      determine if any given folio is anon-exclusive (via a per folio
      AnonExclusive flag).
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-1-vivek.kasireddy@intel.com
      Link: https://lkml.kernel.org/r/20240624063952.1572359-2-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cc04054
    • Chengming Zhou's avatar
      mm/zswap: use only one pool in zswap · 8edc9c4e
      Chengming Zhou authored
      Zswap uses 32 pools to workaround the locking scalability problem in zswap
      backends (mainly zsmalloc nowadays), which brings its own problems like
      memory waste and more memory fragmentation.
      
      Testing results show that we can have near performance with only one pool
      in zswap after changing zsmalloc to use per-size_class lock instead of
      pool spinlock.
      
      Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and
      zswap shrinker enabled with 10GB swapfile on ext4.
      
                                      real    user    sys
      6.10.0-rc3                      138.18  1241.38 1452.73
      6.10.0-rc3-onepool              149.45  1240.45 1844.69
      6.10.0-rc3-onepool-perclass     138.23  1242.37 1469.71
      
      And do the same testing using zbud, which shows a little worse performance
      as expected since we don't do any locking optimization for zbud.  I think
      it's acceptable since zsmalloc became a lot more popular than other
      backends, and we may want to support only zsmalloc in the future.
      
                                      real    user    sys
      6.10.0-rc3-zbud			138.23  1239.58 1430.09
      6.10.0-rc3-onepool-zbud		139.64  1241.37 1516.59
      
      [chengming.zhou@linux.dev: fix error handling in zswap_pool_create(), per Dan Carpenter]
        Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-2-d30e9cd2b793@linux.dev
      [chengming.zhou@linux.dev: fix error handling again in zswap_pool_create(), per Yosry]
        Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-2-ad941699cb61@linux.dev
      Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-2-5e5081ea11b3@linux.devSigned-off-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Chengming Zhou <zhouchengming@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8edc9c4e
    • Chengming Zhou's avatar
      mm/zsmalloc: change back to per-size_class lock · 64bd0197
      Chengming Zhou authored
      Patch series "mm/zsmalloc: change back to per-size_class lock, v2".
      
      Commit c0547d0b ("zsmalloc: consolidate zs_pool's migrate_lock and
      size_class's locks") changed per-size_class lock to pool spinlock to
      prepare reclaim support in zsmalloc.  Then reclaim support in zsmalloc had
      been dropped in favor of LRU reclaim in zswap, but this locking change had
      been left there.
      
      Obviously, the scalability of pool spinlock is worse than per-size_class. 
      And we have a workaround that using 32 pools in zswap to avoid this
      scalability problem, which brings its own problems like memory waste and
      more memory fragmentation.
      
      So this series changes back to use per-size_class lock and using testing
      data in much stressed situation to verify that we can use only one pool in
      zswap.  Note we only test and care about the zsmalloc backend, which makes
      sense now since zsmalloc became a lot more popular than other backends.
      
      Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and
      zswap shrinker enabled with 10GB swapfile on ext4.
      
      				real	user    sys
      6.10.0-rc3			138.18	1241.38 1452.73
      6.10.0-rc3-onepool		149.45	1240.45 1844.69
      6.10.0-rc3-onepool-perclass	138.23	1242.37 1469.71
      
      We can see from "sys" column that per-size_class locking with only one
      pool in zswap can have near performance with the current 32 pools.
      
      
      This patch (of 2):
      
      This patch is almost the revert of the commit c0547d0b ("zsmalloc:
      consolidate zs_pool's migrate_lock and size_class's locks"), which changed
      to use a global pool->lock instead of per-size_class lock and
      pool->migrate_lock, was preparation for suppporting reclaim in zsmalloc. 
      Then reclaim in zsmalloc had been dropped in favor of LRU reclaim in
      zswap.
      
      In theory, per-size_class is more fine-grained than the pool->lock, since
      a pool can have many size_classes.  As for the additional
      pool->migrate_lock, only free() and map() need to grab it to access stable
      handle to get zspage, and only in read lock mode.
      
      Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-0-ad941699cb61@linux.dev
      Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-0-d30e9cd2b793@linux.dev
      Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-0-5e5081ea11b3@linux.dev
      Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-1-5e5081ea11b3@linux.devSigned-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      64bd0197
    • Andrew Morton's avatar
      mm/hugetlb.c: undo errant change · 998d4e2c
      Andrew Morton authored
      During conflict resolution a line was unintentionally removed by a ksm.c
      patch.
      
      Link: https://lkml.kernel.org/r/85b0d694-d1ac-8e7a-2e50-1edc03eee21a@google.com
      Fixes: ac90c56b ("mm/ksm: refactor out try_to_merge_with_zero_page()")
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      998d4e2c
  2. 10 Jul, 2024 20 commits
  3. 06 Jul, 2024 6 commits
    • Kefeng Wang's avatar
      mm: migrate: remove folio_migrate_copy() · 3f594937
      Kefeng Wang authored
      The folio_migrate_copy() is just a wrapper of folio_copy() and
      folio_migrate_flags(), it is simple and only aio use it for now, unfold it
      and remove folio_migrate_copy().
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-7-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f594937
    • Kefeng Wang's avatar
      fs: hugetlbfs: support poisoned recover from hugetlbfs_migrate_folio() · f00b295b
      Kefeng Wang authored
      This is similar to __migrate_folio(), use folio_mc_copy() in HugeTLB folio
      migration to avoid panic when copy from poisoned folio.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-6-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f00b295b
    • Kefeng Wang's avatar
      mm: migrate: support poisoned recover from migrate folio · 06091399
      Kefeng Wang authored
      The folio migration is widely used in kernel, memory compaction, memory
      hotplug, soft offline page, numa balance, memory demote/promotion, etc,
      but once access a poisoned source folio when migrating, the kerenl will
      panic.
      
      There is a mechanism in the kernel to recover from uncorrectable memory
      errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths,
      eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel},
      copy_mc_{user_}highpage callers.
      
      In order to support poisoned folio copy recover from migrate folio, we
      chose to make folio migration tolerant of memory failures and return error
      for folio migration, because folio migration is no guarantee of success,
      this could avoid the similar panic shown below.
      
        CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
        pc : copy_page+0x10/0xc0
        lr : copy_highpage+0x38/0x50
        ...
        Call trace:
         copy_page+0x10/0xc0
         folio_copy+0x78/0x90
         migrate_folio_extra+0x54/0xa0
         move_to_new_folio+0xd8/0x1f0
         migrate_folio_move+0xb8/0x300
         migrate_pages_batch+0x528/0x788
         migrate_pages_sync+0x8c/0x258
         migrate_pages+0x440/0x528
         soft_offline_in_use_page+0x2ec/0x3c0
         soft_offline_page+0x238/0x310
         soft_offline_page_store+0x6c/0xc0
         dev_attr_store+0x20/0x40
         sysfs_kf_write+0x4c/0x68
         kernfs_fop_write_iter+0x130/0x1c8
         new_sync_write+0xa4/0x138
         vfs_write+0x238/0x2d8
         ksys_write+0x74/0x110
      
      Note, folio copy is moved in the begin of the __migrate_folio(), which
      could simplify the error handling since there is no turning back if
      folio_migrate_mapping() return success, the downside is the folio copied
      even though folio_migrate_mapping() return fail, an optimization is to
      check whether source folio does not have extra refs before we do folio
      copy.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06091399
    • Kefeng Wang's avatar
      mm: migrate: split folio_migrate_mapping() · 52881539
      Kefeng Wang authored
      The folio refcount check is moved out for both !mapping and mapping folio,
      also update comment from page to folio for folio_migrate_mapping().
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52881539
    • Kefeng Wang's avatar
      mm: add folio_mc_copy() · 02f4ee5a
      Kefeng Wang authored
      Add a #MC variant of folio_copy() which uses copy_mc_highpage() to support
      #MC handled during folio copy, it will be used in folio migration soon.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      02f4ee5a
    • Kefeng Wang's avatar
      mm: move memory_failure_queue() into copy_mc_[user]_highpage() · 28bdacbc
      Kefeng Wang authored
      Patch series "mm: migrate: support poison recover from migrate folio", v5.
      
      The folio migration is widely used in kernel, memory compaction, memory
      hotplug, soft offline page, numa balance, memory demote/promotion, etc,
      but once access a poisoned source folio when migrating, the kernel will
      panic.
      
      There is a mechanism in the kernel to recover from uncorrectable memory
      errors, ARCH_HAS_COPY_MC(eg, Machine Check Safe Memory Copy on x86), which
      is already used in NVDIMM or core-mm paths(eg, CoW, khugepaged, coredump,
      ksm copy), see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers.
      
      This series of patches provide the recovery mechanism from folio copy for
      the widely used folio migration.  Please note, because folio migration is
      no guarantee of success, so we could chose to make folio migration
      tolerant of memory failures, adding folio_mc_copy() which is a #MC
      versions of folio_copy(), once accessing a poisoned source folio, we could
      return error and make the folio migration fail, and this could avoid the
      similar panic shown below.
      
        CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
        pc : copy_page+0x10/0xc0
        lr : copy_highpage+0x38/0x50
        ...
        Call trace:
         copy_page+0x10/0xc0
         folio_copy+0x78/0x90
         migrate_folio_extra+0x54/0xa0
         move_to_new_folio+0xd8/0x1f0
         migrate_folio_move+0xb8/0x300
         migrate_pages_batch+0x528/0x788
         migrate_pages_sync+0x8c/0x258
         migrate_pages+0x440/0x528
         soft_offline_in_use_page+0x2ec/0x3c0
         soft_offline_page+0x238/0x310
         soft_offline_page_store+0x6c/0xc0
         dev_attr_store+0x20/0x40
         sysfs_kf_write+0x4c/0x68
         kernfs_fop_write_iter+0x130/0x1c8
         new_sync_write+0xa4/0x138
         vfs_write+0x238/0x2d8
         ksys_write+0x74/0x110
      
      
      This patch (of 5):
      
      There is a memory_failure_queue() call after copy_mc_[user]_highpage(),
      see callers, eg, CoW/KSM page copy, it is used to mark the source page as
      h/w poisoned and unmap it from other tasks, and the upcomming poison
      recover from migrate folio will do the similar thing, so let's move the
      memory_failure_queue() into the copy_mc_[user]_highpage() instead of
      adding it into each user, this should also enhance the handling of
      poisoned page in khugepaged.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240626085328.608006-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28bdacbc