1. 17 Sep, 2024 35 commits
    • Oleg Nesterov's avatar
      uprobes: turn xol_area->pages[2] into xol_area->page · 2abbcc09
      Oleg Nesterov authored
      Now that xol_mapping has its own ->fault() method we no longer need
      xol_area->pages[1] == NULL, we need a single page.
      
      Link: https://lkml.kernel.org/r/20240911131437.GC3448@redhat.comSigned-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2abbcc09
    • Oleg Nesterov's avatar
      uprobes: introduce the global struct vm_special_mapping xol_mapping · 6d27a31e
      Oleg Nesterov authored
      Currently each xol_area has its own instance of vm_special_mapping, this
      is suboptimal and ugly.  Kill xol_area->xol_mapping and add a single
      global instance of vm_special_mapping, the ->fault() method can use
      area->pages rather than xol_mapping->pages.
      
      As a side effect this fixes the problem introduced by the recent commit
      223febc6 ("mm: add optional close() to struct vm_special_mapping"), if
      special_mapping_close() is called from the __mmput() paths, it will use
      vma->vm_private_data = &area->xol_mapping freed by uprobe_clear_state().
      
      Link: https://lkml.kernel.org/r/20240911131407.GB3448@redhat.com
      Fixes: 223febc6 ("mm: add optional close() to struct vm_special_mapping")
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Closes: https://lore.kernel.org/all/yt9dy149vprr.fsf@linux.ibm.com/
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d27a31e
    • Oleg Nesterov's avatar
      Revert "uprobes: use vm_special_mapping close() functionality" · ed8d5b0c
      Oleg Nesterov authored
      This reverts commit 08e28de1.
      
      A malicious application can munmap() its "[uprobes]" vma and in this case
      xol_mapping.close == uprobe_clear_state() will free the memory which can
      be used by another thread, or the same thread when it hits the uprobe bp
      afterwards.
      
      Link: https://lkml.kernel.org/r/20240911131320.GA3448@redhat.comSigned-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed8d5b0c
    • Chuanhua Han's avatar
      mm: support large folios swap-in for sync io devices · 242d12c9
      Chuanhua Han authored
      Currently, we have mTHP features, but unfortunately, without support for
      large folio swap-ins, once these large folios are swapped out, they are
      lost because mTHP swap is a one-way process.  The lack of mTHP swap-in
      functionality prevents mTHP from being used on devices like Android that
      heavily rely on swap.
      
      This patch introduces mTHP swap-in support.  It starts from sync devices
      such as zRAM.  This is probably the simplest and most common use case,
      benefiting billions of Android phones and similar devices with minimal
      implementation cost.  In this straightforward scenario, large folios are
      always exclusive, eliminating the need to handle complex rmap and
      swapcache issues.
      
      It offers several benefits:
      1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
         swap-out and swap-in. Large folios in the buddy system are also
         preserved as much as possible, rather than being fragmented due
         to swap-in.
      
      2. Eliminates fragmentation in swap slots and supports successful
         THP_SWPOUT.
      
         w/o this patch (Refer to the data from Chris's and Kairui's latest
         swap allocator optimization while running ./thp_swap_allocator_test
         w/o "-a" option [1]):
      
         ./thp_swap_allocator_test
         Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
         Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
         Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
         Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
         Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
         Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
         Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
         Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
         Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
         Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
         Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
         Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
         Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
         Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
         Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
         Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
      
         w/ this patch (always 0%):
         Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
         Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
         ...
      
      3. With both mTHP swap-out and swap-in supported, we offer the option to enable
         zsmalloc compression/decompression with larger granularity[2]. The upcoming
         optimization in zsmalloc will significantly increase swap speed and improve
         compression efficiency. Tested by running 100 iterations of swapping 100MiB
         of anon memory, the swap speed improved dramatically:
                      time consumption of swapin(ms)   time consumption of swapout(ms)
           lz4 4k                  45274                    90540
           lz4 64k                 22942                    55667
           zstdn 4k                85035                    186585
           zstdn 64k               46558                    118533
      
          The compression ratio also improved, as evaluated with 1 GiB of data:
           granularity   orig_data_size   compr_data_size
           4KiB-zstd      1048576000       246876055
           64KiB-zstd     1048576000       199763892
      
         Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
         realized.
      
      4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
         of nr_pages. Swapping in content filled with the same data 0x11, w/o
         and w/ the patch for five rounds (Since the content is the same,
         decompression will be very fast. This primarily assesses the impact of
         reduced page faults):
      
        swp in bandwidth(bytes/ms)    w/o              w/
         round1                     624152          1127501
         round2                     631672          1127501
         round3                     620459          1139756
         round4                     606113          1139756
         round5                     624152          1152281
         avg                        621310          1137359      +83%
      
      5. With both mTHP swap-out and swap-in supported, we offer the option to enable
         hardware accelerators(Intel IAA) to do parallel decompression with which
         Kanchana reported 7X improvement on zRAM read latency[3].
      
      [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
      [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
      [3] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/
      
      Link: https://lkml.kernel.org/r/20240908232119.2157-4-21cnbao@gmail.comSigned-off-by: default avatarChuanhua Han <hanchuanhua@oppo.com>
      Co-developed-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Signed-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Usama Arif <usamaarif642@gmail.com>
      Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
      Cc: Kairui Song <ryncsn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      242d12c9
    • Barry Song's avatar
      mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios · 325efb16
      Barry Song authored
      With large folios swap-in, we might need to uncharge multiple entries all
      together, add nr argument in mem_cgroup_swapin_uncharge_swap().
      
      For the existing two users, just pass nr=1.
      
      Link: https://lkml.kernel.org/r/20240908232119.2157-3-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Acked-by: default avatarChris Li <chrisl@kernel.org>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Kairui Song <ryncsn@gmail.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Chuanhua Han <hanchuanhua@oppo.com>
      Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
      Cc: Usama Arif <usamaarif642@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      325efb16
    • Barry Song's avatar
      mm: fix swap_read_folio_zeromap() for large folios with partial zeromap · 9d57090e
      Barry Song authored
      Patch series "mm: enable large folios swap-in support", v9.
      
      Currently, we support mTHP swapout but not swapin.  This means that once
      mTHP is swapped out, it will come back as small folios when swapped in. 
      This is particularly detrimental for devices like Android, where more than
      half of the memory is in swap.
      
      The lack of mTHP swapin functionality makes mTHP a showstopper in
      scenarios that heavily rely on swap.  This patchset introduces mTHP
      swap-in support.  It starts with synchronous devices similar to zRAM,
      aiming to benefit as many users as possible with minimal changes.
      
      
      This patch (of 3):
      
      There could be a corner case where the first entry is non-zeromap, but a
      subsequent entry is zeromap.  In this case, we should not let
      swap_read_folio_zeromap() return false since we will still read corrupted
      data.
      
      Additionally, the iteration of test_bit() is unnecessary and can be
      replaced with bitmap operations, which are more efficient.
      
      We can adopt the style of swap_pte_batch() and folio_pte_batch() to
      introduce swap_zeromap_batch() which seems to provide the greatest
      flexibility for the caller.  This approach allows the caller to either
      check if the zeromap status of all entries is consistent or determine the
      number of contiguous entries with the same status.
      
      Since swap_read_folio() can't handle reading a large folio that's
      partially zeromap and partially non-zeromap, we've moved the code to
      mm/swap.h so that others, like those working on swap-in, can access it.
      
      Link: https://lkml.kernel.org/r/20240908232119.2157-1-21cnbao@gmail.com
      Link: https://lkml.kernel.org/r/20240908232119.2157-2-21cnbao@gmail.com
      Fixes: 0ca0c24e ("mm: store zero pages to be swapped out in a bitmap")
      Signed-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarUsama Arif <usamaarif642@gmail.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chuanhua Han <hanchuanhua@oppo.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Kairui Song <ryncsn@gmail.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9d57090e
    • Anshuman Khandual's avatar
      mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries · a0c9fd22
      Anshuman Khandual authored
      This replaces all the existing READ_ONCE() based page table accesses with
      respective pxdp_get() helpers. Although these helpers might also fallback
      to READ_ONCE() as default, but they do provide an opportunity for various
      platforms to override when required. This change is a step in direction to
      replace all page table entry accesses with respective pxdp_get() helpers.
      
      Link: https://lkml.kernel.org/r/20240910115746.514454-1-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a0c9fd22
    • Christophe Leroy's avatar
      set_memory: add __must_check to generic stubs · 82ce8e2f
      Christophe Leroy authored
      Following query shows that architectures that don't provide
      asm/set_memory.h don't use set_memory_...() functions.
      
        $ git grep set_memory_ alpha arc csky hexagon loongarch m68k microblaze mips nios2 openrisc parisc sh sparc um xtensa
      
      Following query shows that all core users of set_memory_...()
      functions always take returned value into account:
      
        $ git grep -w -e set_memory_ro -e set_memory_rw -e set_memory_x -e set_memory_nx -e set_memory_rox `find . -maxdepth 1 -type d | grep -v arch | grep /`
      
      set_memory_...() functions can fail, leaving the memory attributes
      unchanged. Make sure all callers check the returned code.
      
      Link: https://github.com/KSPP/linux/issues/7
      Link: https://lkml.kernel.org/r/6a89ffc69666de84721216947c6b6c7dcca39d7d.1725723347.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <kees@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      82ce8e2f
    • Xiao Yang's avatar
      mm/vma: return the exact errno in vms_gather_munmap_vmas() · 659c55ef
      Xiao Yang authored
      __split_vma() and mas_store_gfp() returns several types of errno on
      failure so don't ignore them in vms_gather_munmap_vmas().  For example,
      __split_vma() returns -EINVAL when an unaligned huge page is unmapped. 
      This issue is reproduced by ltp memfd_create03 test.
      
      Don't initialise the error variable and assign it when a failure actually
      occurs.
      
      [akpm@linux-foundation.org: fix whitespace, per Liam]
      Link: https://lkml.kernel.org/r/20240909125621.1994-1-ice_yangxiao@163.com
      Fixes: 6898c903 ("mm/vma: extract the gathering of vmas from do_vmi_align_munmap()")
      Signed-off-by: default avatarXiao Yang <ice_yangxiao@163.com>
      Suggested-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/oe-lkp/202409081536.d283a0fb-oliver.sang@intel.com
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      659c55ef
    • Michal Koutný's avatar
      memcg: cleanup with !CONFIG_MEMCG_V1 · f2c5101b
      Michal Koutný authored
      Extern declarations have no definitions with !CONFIG_MEMCG_V1 and no
      users, drop them altogether.
      
      Link: https://lkml.kernel.org/r/20240909163223.3693529-1-mkoutny@suse.com
      Link: https://lkml.kernel.org/r/20240909163223.3693529-2-mkoutny@suse.comSigned-off-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Chen Ridong <chenridong@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f2c5101b
    • Kent Overstreet's avatar
      mm/show_mem.c: report alloc tags in human readable units · fd00be9a
      Kent Overstreet authored
      We already do this when reporting slab info - more consistent and more
      readable.
      
      Link: https://lkml.kernel.org/r/20240906005337.1220091-1-kent.overstreet@linux.devSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fd00be9a
    • Kefeng Wang's avatar
      mm: support poison recovery from copy_present_page() · 658be465
      Kefeng Wang authored
      Similar to other poison recovery, use copy_mc_user_highpage() to avoid
      potentially kernel panic during copy page in copy_present_page() from
      fork, once copy failed due to hwpoison in source page, we need to break
      out of copy in copy_pte_range() and release prealloc folio, so
      copy_mc_user_highpage() is moved ahead before set *prealloc to NULL.
      
      Link: https://lkml.kernel.org/r/20240906024201.1214712-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      658be465
    • Kefeng Wang's avatar
      mm: support poison recovery from do_cow_fault() · aa549f92
      Kefeng Wang authored
      Patch series "mm: hwpoison: two more poison recovery".
      
      One more CoW path to support poison recorvery in do_cow_fault(), and the
      last copy_user_highpage() user is replaced to copy_mc_user_highpage() from
      copy_present_page() during fork to support poison recorvery too.
      
      
      This patch (of 2):
      
      Like commit a873dfe1 ("mm, hwpoison: try to recover from copy-on
      write faults"), there is another path which could crash because it does
      not have recovery code where poison is consumed by the kernel in
      do_cow_fault(), a crash calltrace shown below on old kernel, but it
      could be happened in the lastest mainline code,
      
        CPU: 7 PID: 3248 Comm: mpi Kdump: loaded Tainted: G           OE     5.10.0 #1
        pc : copy_page+0xc/0xbc
        lr : copy_user_highpage+0x50/0x9c
        Call trace:
          copy_page+0xc/0xbc
          do_cow_fault+0x118/0x2bc
          do_fault+0x40/0x1a4
          handle_pte_fault+0x154/0x230
          __handle_mm_fault+0x1a8/0x38c
          handle_mm_fault+0xf0/0x250
          do_page_fault+0x184/0x454
          do_translation_fault+0xac/0xd4
          do_mem_abort+0x44/0xbc
      
      Fix it by using copy_mc_user_highpage() to handle this case and return
      VM_FAULT_HWPOISON for cow fault.
      
      [wangkefeng.wang@huawei.com: unlock/put vmf->page, per Miaohe]
        Link: https://lkml.kernel.org/r/20240910021541.234300-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240906024201.1214712-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240906024201.1214712-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa549f92
    • Huang Ying's avatar
      resource, kunit: add test case for region_intersects() · 99185c10
      Huang Ying authored
      Patch series "resource: Fix region_intersects() vs
      add_memory_driver_managed()", v3.
      
      The patchset fixes a bug of region_intersects() for systems with CXL
      memory.  The details of the bug can be found in [1/3].  To avoid similar
      bugs in the future.  A kunit test case for region_intersects() is added in
      [3/3].  [2/3] is a preparation patch for [3/3].
      
      
      This patch (of 3):
      
      region_intersects() is important because it's used for /dev/mem permission
      checking.  To avoid possible bug of region_intersects() in the future, a
      kunit test case for region_intersects() is added.
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20240906030713.204292-4-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99185c10
    • Huang Ying's avatar
      resource: make alloc_free_mem_region() works for iomem_resource · bacf9c3c
      Huang Ying authored
      During developing a kunit test case for region_intersects(), some fake
      resources need to be inserted into iomem_resource.  To do that, a resource
      hole needs to be found first in iomem_resource.
      
      However, alloc_free_mem_region() cannot work for iomem_resource now. 
      Because the start address to check cannot be 0 to detect address wrapping
      0 in gfr_continue(), while iomem_resource.start == 0.  To make
      alloc_free_mem_region() works for iomem_resource, gfr_start() is changed
      to avoid to return 0 even if base->start == 0.  We don't need to check 0
      as start address.
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-3-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bacf9c3c
    • Yosry Ahmed's avatar
      mm: z3fold: deprecate CONFIG_Z3FOLD · 7a2369b7
      Yosry Ahmed authored
      The z3fold compressed pages allocator is rarely used, most users use
      zsmalloc.  The only disadvantage of zsmalloc in comparison is the
      dependency on MMU, and zbud is a more common option for !MMU as it was the
      default zswap allocator for a long time.
      
      Historically, zsmalloc had worse latency than zbud and z3fold but offered
      better memory savings.  This is no longer the case as shown by a simple
      recent analysis [1].  That analysis showed that z3fold does not have any
      advantage over zsmalloc or zbud considering both performance and memory
      usage.  In a kernel build test on tmpfs in a limited cgroup, z3fold took
      3% more time and used 1.8% more memory.  The latency of zswap_load() was
      7% higher, and that of zswap_store() was 10% higher.  Zsmalloc is better
      in all metrics.
      
      Moreover, z3fold apparently has latent bugs, which was made noticeable by
      a recent soft lockup bug report with z3fold [2].  Switching to zsmalloc
      not only fixed the problem, but also reduced the swap usage from 6~8G to
      1~2G.  Other users have also reported being bitten by mistakenly enabling
      z3fold.
      
      Other than hurting users, z3fold is repeatedly causing wasted engineering
      effort.  Apart from investigating the above bug, it came up in multiple
      development discussions (e.g.  [3]) as something we need to handle, when
      there aren't any legit users (at least not intentionally).
      
      The natural course of action is to deprecate z3fold, and remove in a few
      cycles if no objections are raised from active users.  Next on the list
      should be zbud, as it offers marginal latency gains at the cost of huge
      memory waste when compared to zsmalloc.  That one will need to wait until
      zsmalloc does not depend on MMU.
      
      Rename the user-visible config option from CONFIG_Z3FOLD to
      CONFIG_Z3FOLD_DEPRECATED so that users with CONFIG_Z3FOLD=y get a new
      prompt with explanation during make oldconfig.  Also, remove
      CONFIG_Z3FOLD=y from defconfigs.
      
      [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/
      [2]https://lore.kernel.org/lkml/EF0ABD3E-A239-4111-A8AB-5C442E759CF3@gmail.com/
      [3]https://lore.kernel.org/lkml/CAJD7tkbnmeVugfunffSovJf9FAgy9rhBVt_tx=nxUveLUfqVsA@mail.gmail.com/
      
      [arnd@arndb.de: deprecate ZSWAP_ZPOOL_DEFAULT_Z3FOLD as well]
        Link: https://lkml.kernel.org/r/20240909202625.1054880-1-arnd@kernel.org
      Link: https://lkml.kernel.org/r/20240904233343.933462-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVitaly Wool <vitaly.wool@konsulko.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7a2369b7
    • Alex Williamson's avatar
      vfio/pci: implement huge_fault support · f9e54c3a
      Alex Williamson authored
      With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we can
      take advantage of PMD and PUD faults to PCI BAR mmaps and create more
      efficient mappings.  PCI BARs are always a power of two and will typically
      get at least PMD alignment without userspace even trying.  Userspace
      alignment for PUD mappings is also not too difficult.
      
      Consolidate faults through a single handler with a new wrapper for
      standard single page faults.  The pre-faulting behavior of commit
      d71a989c ("vfio/pci: Insert full vma on mmap'd MMIO fault") is removed
      in this refactoring since huge_fault will cover the bulk of the faults and
      results in more efficient page table usage.  We also want to avoid that
      pre-faulted single page mappings preempt huge page mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-20-peterx@redhat.comSigned-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f9e54c3a
    • Peter Xu's avatar
      mm/arm64: support large pfn mappings · 3e509c9b
      Peter Xu authored
      Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on
      pmds/puds.  Provide the pmd/pud helpers to set/get special bit.
      
      There's one more thing missing for arm64 which is the pxx_pgprot() for
      pmd/pud.  Add them too, which is mostly the same as the pte version by
      dropping the pfn field.  These helpers are essential to be used in the new
      follow_pfnmap*() API to report valid pgprot_t results.
      
      Note that arm64 doesn't yet support huge PUD yet, but it's still
      straightforward to provide the pud helpers that we need altogether.  Only
      PMD helpers will make an immediate benefit until arm64 will support huge
      PUDs first in general (e.g.  in THPs).
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-19-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e509c9b
    • Peter Xu's avatar
      mm/x86: support large pfn mappings · 75182022
      Peter Xu authored
      Helpers to install and detect special pmd/pud entries.  In short, bit 9 on
      x86 is not used for pmd/pud, so we can directly define them the same as
      the pte level.  One note is that it's also used in _PAGE_BIT_CPA_TEST but
      that is only used in the debug test, and shouldn't conflict in this case.
      
      One note is that pxx_set|clear_flags() for pmd/pud will need to be moved
      upper so that they can be referenced by the new special bit helpers. 
      There's no change in the code that was moved.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-18-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      75182022
    • Peter Xu's avatar
      mm: remove follow_pte() · b0a1c0d0
      Peter Xu authored
      follow_pte() users have been converted to follow_pfnmap*().  Remove the
      API.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-17-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0a1c0d0
    • Peter Xu's avatar
      mm/access_process_vm: use the new follow_pfnmap API · b17269a5
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-16-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b17269a5
    • Peter Xu's avatar
      acrn: use the new follow_pfnmap API · e6bc784c
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-15-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6bc784c
    • Peter Xu's avatar
      vfio: use the new follow_pfnmap API · a77f9489
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-14-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a77f9489
    • Peter Xu's avatar
      mm/x86/pat: use the new follow_pfnmap API · cbea8536
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-13-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cbea8536
    • Peter Xu's avatar
      s390/pci_mmio: use follow_pfnmap API · bd8c2d18
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-12-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd8c2d18
    • Peter Xu's avatar
      KVM: use follow_pfnmap API · 5731aacd
      Peter Xu authored
      Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
      is done perfectly on the other side (host_pfn_mapping_level()).
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-11-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5731aacd
    • Peter Xu's avatar
      mm: new follow_pfnmap API · 6da8e963
      Peter Xu authored
      Introduce a pair of APIs to follow pfn mappings to get entry information. 
      It's very similar to what follow_pte() does before, but different in that
      it recognizes huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-10-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6da8e963
    • Peter Xu's avatar
      mm: always define pxx_pgprot() · 0515e022
      Peter Xu authored
      There're:
      
        - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
        support pte_pgprot().
      
        - 2 archs (x86, sparc) that support pmd_pgprot().
      
        - 1 arch (x86) that support pud_pgprot().
      
      Always define them to be used in generic code, and then we don't need to
      fiddle with "#ifdef"s when doing so.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-9-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0515e022
    • Peter Xu's avatar
      mm/fork: accept huge pfnmap entries · bc02afbd
      Peter Xu authored
      Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
      much easier, the write bit needs to be persisted though for writable and
      shared pud mappings like PFNMAP ones, otherwise a follow up write in
      either parent or child process will trigger a write fault.
      
      Do the same for pmd level.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bc02afbd
    • Peter Xu's avatar
      mm/pagewalk: check pfnmap for folio_walk_start() · 10d83d77
      Peter Xu authored
      Teach folio_walk_start() to recognize special pmd/pud mappings, and fail
      them properly as it means there's no folio backing them.
      
      [peterx@redhat.com: remove some stale comments, per David]
        Link: https://lkml.kernel.org/r/20240829202237.2640288-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240826204353.2228736-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      10d83d77
    • Peter Xu's avatar
      mm/gup: detect huge pfnmap entries in gup-fast · ae3c99e6
      Peter Xu authored
      Since gup-fast doesn't have the vma reference, teach it to detect such huge
      pfnmaps by checking the special bit for pmd/pud too, just like ptes.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ae3c99e6
    • Peter Xu's avatar
      mm: allow THP orders for PFNMAPs · 5dd40721
      Peter Xu authored
      This enables PFNMAPs to be mapped at either pmd/pud layers.  Generalize the
      dax case into vma_is_special_huge() so as to cover both.  Meanwhile, rename
      the macro to THP_ORDERS_ALL_SPECIAL.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5dd40721
    • Peter Xu's avatar
      mm: mark special bits for huge pfn mappings when inject · 3c8e44c9
      Peter Xu authored
      We need these special bits to be around on pfnmaps.  Mark properly for
      !devmap case, reflecting that there's no page struct backing the entry.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-4-peterx@redhat.comReviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c8e44c9
    • Peter Xu's avatar
      mm: drop is_huge_zero_pud() · ef713ec3
      Peter Xu authored
      It constantly returns false since 2017.  One assertion is added in 2019 but
      it should never have triggered, IOW it means what is checked should be
      asserted instead.
      
      If it didn't exist for 7 years maybe it's good idea to remove it and only
      add it when it comes.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef713ec3
    • Peter Xu's avatar
      mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud · 6857be5f
      Peter Xu authored
      Patch series "mm: Support huge pfnmaps", v2.
      
      Overview
      ========
      
      This series implements huge pfnmaps support for mm in general.  Huge
      pfnmap allows e.g.  VM_PFNMAP vmas to map in either PMD or PUD levels,
      similar to what we do with dax / thp / hugetlb so far to benefit from TLB
      hits.  Now we extend that idea to PFN mappings, e.g.  PCI MMIO bars where
      it can grow as large as 8GB or even bigger.
      
      Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
      patch (from Alex Williamson) will be the first user of huge pfnmap, so as
      to enable vfio-pci driver to fault in huge pfn mappings.
      
      Implementation
      ==============
      
      In reality, it's relatively simple to add such support comparing to many
      other types of mappings, because of PFNMAP's specialties when there's no
      vmemmap backing it, so that most of the kernel routines on huge mappings
      should simply already fail for them, like GUPs or old-school follow_page()
      (which is recently rewritten to be folio_walk* APIs by David).
      
      One trick here is that we're still unmature on PUDs in generic paths here
      and there, as DAX is so far the only user.  This patchset will add the 2nd
      user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
      go on smoothly, but to be discussed later.
      
      The other trick is how to allow gup-fast working for such huge mappings
      even if there's no direct sign of knowing whether it's a normal page or
      MMIO mapping.  This series chose to keep the pte_special solution, so that
      it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so
      that gup-fast will be able to identify them and fail properly.
      
      Along the way, we'll also notice that the major pgtable pfn walker, aka,
      follow_pte(), will need to retire soon due to the fact that it only works
      with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
      be able to do whatever follow_pte() can already do, plus that it can also
      process huge pfnmaps now.  Half of this series is about that and
      converting all existing pfnmap walkers to use the new API properly. 
      Hopefully the new API also looks better to avoid exposing e.g.  pgtable
      lock details into the callers, so that it can be used in an even more
      straightforward way.
      
      Here, three more options will be introduced and involved in huge pfnmap:
      
        - ARCH_SUPPORTS_HUGE_PFNMAP
      
          Arch developers will need to select this option when huge pfnmap is
          supported in arch's Kconfig.  After this patchset applied, both x86_64
          and arm64 will start to enable it by default.
      
        - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
      
          These options are for driver developers to identify whether current
          arch / config supports huge pfnmaps, making decision on whether it can
          use the huge pfnmap APIs to inject them.  One can refer to the last
          vfio-pci patch from Alex on the use of them properly in a device
          driver.
      
      So after the whole set applied, and if one would enable some dynamic debug
      lines in vfio-pci core files, we should observe things like:
      
        vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
        vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
        vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
      
      In this specific case, it says that vfio-pci faults in PMDs properly for a
      few BAR0 offsets.
      
      Patch Layout
      ============
      
      Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
      Patch 2:         A tiny cleanup
      Patch 3-8:       Preparation patches for huge pfnmap (include introduce
                       special bit for pmd/pud)
      Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
                       then drop follow_pte() API
      Patch 17:        Add huge pfnmap support for x86_64
      Patch 18:        Add huge pfnmap support for arm64
      Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)
      
      TODO
      ====
      
      More architectures / More page sizes
      ------------------------------------
      
      Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems
      to have plan to support arm64 1G later on top of this series [2].
      
      Any arch will need to first support THP / THP_1G, then provide a special
      bit in pmds/puds to support huge pfnmaps.
      
      remap_pfn_range() support
      -------------------------
      
      Currently, remap_pfn_range() still only maps PTEs.  With the new option,
      remap_pfn_range() can logically start to inject either PMDs or PUDs when
      the alignment requirements match on the VAs.
      
      When the support is there, it should be able to silently benefit all
      drivers that is using remap_pfn_range() in its mmap() handler on better
      TLB hit rate and overall faster MMIO accesses similar to processor on
      hugepages.
      
      More driver support
      -------------------
      
      VFIO is so far the only consumer for the huge pfnmaps after this series
      applied.  Besides above remap_pfn_range() generic optimization, device
      driver can also try to optimize its mmap() on a better VA alignment for
      either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
      as the driver doesn't normally decide the VA to map a bar.  But I don't
      think I know all the drivers to know the full picture.
      
      Credits all go to Alex on help testing the GPU/NIC use cases above.
      
      [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
      [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
      [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
      
      
      This patch (of 19):
      
      This patch introduces the option to introduce special pte bit into
      pmd/puds.  Archs can start to define pmd_special / pud_special when
      supported by selecting the new option.  Per-arch support will be added
      later.
      
      Before that, create fallbacks for these helpers so that they are always
      available.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240826204353.2228736-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6857be5f
  2. 09 Sep, 2024 5 commits