An error occurred fetching the project authors.
  1. 17 Sep, 2024 4 commits
    • Peter Xu's avatar
      mm: remove follow_pte() · b0a1c0d0
      Peter Xu authored
      follow_pte() users have been converted to follow_pfnmap*().  Remove the
      API.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-17-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0a1c0d0
    • Peter Xu's avatar
      mm/access_process_vm: use the new follow_pfnmap API · b17269a5
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-16-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b17269a5
    • Peter Xu's avatar
      mm: new follow_pfnmap API · 6da8e963
      Peter Xu authored
      Introduce a pair of APIs to follow pfn mappings to get entry information. 
      It's very similar to what follow_pte() does before, but different in that
      it recognizes huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-10-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6da8e963
    • Peter Xu's avatar
      mm/pagewalk: check pfnmap for folio_walk_start() · 10d83d77
      Peter Xu authored
      Teach folio_walk_start() to recognize special pmd/pud mappings, and fail
      them properly as it means there's no folio backing them.
      
      [peterx@redhat.com: remove some stale comments, per David]
        Link: https://lkml.kernel.org/r/20240829202237.2640288-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240826204353.2228736-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      10d83d77
  2. 09 Sep, 2024 1 commit
  3. 04 Sep, 2024 1 commit
    • Kaiyang Zhao's avatar
      mm,memcg: provide per-cgroup counters for NUMA balancing operations · f77f0c75
      Kaiyang Zhao authored
      The ability to observe the demotion and promotion decisions made by the
      kernel on a per-cgroup basis is important for monitoring and tuning
      containerized workloads on machines equipped with tiered memory.
      
      Different containers in the system may experience drastically different
      memory tiering actions that cannot be distinguished from the global
      counters alone.
      
      For example, a container running a workload that has a much hotter memory
      accesses will likely see more promotions and fewer demotions, potentially
      depriving a colocated container of top tier memory to such an extent that
      its performance degrades unacceptably.
      
      For another example, some containers may exhibit longer periods between
      data reuse, causing much more numa_hint_faults than numa_pages_migrated. 
      In this case, tuning hot_threshold_ms may be appropriate, but the signal
      can easily be lost if only global counters are available.
      
      In the long term, we hope to introduce per-cgroup control of promotion and
      demotion actions to implement memory placement policies in tiering.
      
      This patch set adds seven counters to memory.stat in a cgroup:
      numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd,
      pgdemote_khugepaged, pgdemote_direct and pgpromote_success.  pgdemote_*
      and pgpromote_success are also available in memory.numa_stat.
      
      count_memcg_events_mm() is added to count multiple event occurrences at
      once, and get_mem_cgroup_from_folio() is added because we need to get a
      reference to the memcg of a folio before it's migrated to track
      numa_pages_migrated.  The accounting of PGDEMOTE_* is moved to
      shrink_inactive_list() before being changed to per-cgroup.
      
      [kaiyang2@cs.cmu.edu: add documentation of the memcg counters in cgroup-v2.rst]
        Link: https://lkml.kernel.org/r/20240814235122.252309-1-kaiyang2@cs.cmu.edu
      Link: https://lkml.kernel.org/r/20240814174227.30639-1-kaiyang2@cs.cmu.eduSigned-off-by: default avatarKaiyang Zhao <kaiyang2@cs.cmu.edu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Wei Xu <weixugc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f77f0c75
  4. 02 Sep, 2024 7 commits
  5. 16 Aug, 2024 1 commit
  6. 26 Jul, 2024 1 commit
    • Ram Tummala's avatar
      mm: fix old/young bit handling in the faulting path · 4cd7ba16
      Ram Tummala authored
      Commit 3bd786f7 ("mm: convert do_set_pte() to set_pte_range()")
      replaced do_set_pte() with set_pte_range() and that introduced a
      regression in the following faulting path of non-anonymous vmas which
      caused the PTE for the faulting address to be marked as old instead of
      young.
      
      handle_pte_fault()
        do_pte_missing()
          do_fault()
            do_read_fault() || do_cow_fault() || do_shared_fault()
              finish_fault()
                set_pte_range()
      
      The polarity of prefault calculation is incorrect.  This leads to prefault
      being incorrectly set for the faulting address.  The following check will
      incorrectly mark the PTE old rather than young.  On some architectures
      this will cause a double fault to mark it young when the access is
      retried.
      
          if (prefault && arch_wants_old_prefaulted_pte())
              entry = pte_mkold(entry);
      
      On a subsequent fault on the same address, the faulting path will see a
      non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE
      will then be correctly marked young in handle_pte_fault() itself.
      
      Due to this bug, performance degradation in the fault handling path will
      be observed due to unnecessary double faulting.
      
      Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com
      Fixes: 3bd786f7 ("mm: convert do_set_pte() to set_pte_range()")
      Signed-off-by: default avatarRam Tummala <rtummala@nvidia.com>
      Reviewed-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4cd7ba16
  7. 19 Jul, 2024 1 commit
    • Jason A. Donenfeld's avatar
      mm: add MAP_DROPPABLE for designating always lazily freeable mappings · 9651fced
      Jason A. Donenfeld authored
      The vDSO getrandom() implementation works with a buffer allocated with a
      new system call that has certain requirements:
      
      - It shouldn't be written to core dumps.
        * Easy: VM_DONTDUMP.
      - It should be zeroed on fork.
        * Easy: VM_WIPEONFORK.
      
      - It shouldn't be written to swap.
        * Uh-oh: mlock is rlimited.
        * Uh-oh: mlock isn't inherited by forks.
      
      - It shouldn't reserve actual memory, but it also shouldn't crash when
        page faulting in memory if none is available
        * Uh-oh: VM_NORESERVE means segfaults.
      
      It turns out that the vDSO getrandom() function has three really nice
      characteristics that we can exploit to solve this problem:
      
      1) Due to being wiped during fork(), the vDSO code is already robust to
         having the contents of the pages it reads zeroed out midway through
         the function's execution.
      
      2) In the absolute worst case of whatever contingency we're coding for,
         we have the option to fallback to the getrandom() syscall, and
         everything is fine.
      
      3) The buffers the function uses are only ever useful for a maximum of
         60 seconds -- a sort of cache, rather than a long term allocation.
      
      These characteristics mean that we can introduce VM_DROPPABLE, which
      has the following semantics:
      
      a) It never is written out to swap.
      b) Under memory pressure, mm can just drop the pages (so that they're
         zero when read back again).
      c) It is inherited by fork.
      d) It doesn't count against the mlock budget, since nothing is locked.
      e) If there's not enough memory to service a page fault, it's not fatal,
         and no signal is sent.
      
      This way, allocations used by vDSO getrandom() can use:
      
          VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
      
      And there will be no problem with OOMing, crashing on overcommitment,
      using memory when not in use, not wiping on fork(), coredumps, or
      writing out to swap.
      
      In order to let vDSO getrandom() use this, expose these via mmap(2) as
      MAP_DROPPABLE.
      
      Note that this involves removing the MADV_FREE special case from
      sort_folio(), which according to Yu Zhao is unnecessary and will simply
      result in an extra call to shrink_folio_list() in the worst case. The
      chunk removed reenables the swapbacked flag, which we don't want for
      VM_DROPPABLE, and we can't conditionalize it here because there isn't a
      vma reference available.
      
      Finally, the provided self test ensures that this is working as desired.
      
      Cc: linux-mm@kvack.org
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      9651fced
  8. 12 Jul, 2024 1 commit
  9. 06 Jul, 2024 1 commit
    • Kefeng Wang's avatar
      mm: move memory_failure_queue() into copy_mc_[user]_highpage() · 28bdacbc
      Kefeng Wang authored
      Patch series "mm: migrate: support poison recover from migrate folio", v5.
      
      The folio migration is widely used in kernel, memory compaction, memory
      hotplug, soft offline page, numa balance, memory demote/promotion, etc,
      but once access a poisoned source folio when migrating, the kernel will
      panic.
      
      There is a mechanism in the kernel to recover from uncorrectable memory
      errors, ARCH_HAS_COPY_MC(eg, Machine Check Safe Memory Copy on x86), which
      is already used in NVDIMM or core-mm paths(eg, CoW, khugepaged, coredump,
      ksm copy), see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers.
      
      This series of patches provide the recovery mechanism from folio copy for
      the widely used folio migration.  Please note, because folio migration is
      no guarantee of success, so we could chose to make folio migration
      tolerant of memory failures, adding folio_mc_copy() which is a #MC
      versions of folio_copy(), once accessing a poisoned source folio, we could
      return error and make the folio migration fail, and this could avoid the
      similar panic shown below.
      
        CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
        pc : copy_page+0x10/0xc0
        lr : copy_highpage+0x38/0x50
        ...
        Call trace:
         copy_page+0x10/0xc0
         folio_copy+0x78/0x90
         migrate_folio_extra+0x54/0xa0
         move_to_new_folio+0xd8/0x1f0
         migrate_folio_move+0xb8/0x300
         migrate_pages_batch+0x528/0x788
         migrate_pages_sync+0x8c/0x258
         migrate_pages+0x440/0x528
         soft_offline_in_use_page+0x2ec/0x3c0
         soft_offline_page+0x238/0x310
         soft_offline_page_store+0x6c/0xc0
         dev_attr_store+0x20/0x40
         sysfs_kf_write+0x4c/0x68
         kernfs_fop_write_iter+0x130/0x1c8
         new_sync_write+0xa4/0x138
         vfs_write+0x238/0x2d8
         ksys_write+0x74/0x110
      
      
      This patch (of 5):
      
      There is a memory_failure_queue() call after copy_mc_[user]_highpage(),
      see callers, eg, CoW/KSM page copy, it is used to mark the source page as
      h/w poisoned and unmap it from other tasks, and the upcomming poison
      recover from migrate folio will do the similar thing, so let's move the
      memory_failure_queue() into the copy_mc_[user]_highpage() instead of
      adding it into each user, this should also enhance the handling of
      poisoned page in khugepaged.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240626085328.608006-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28bdacbc
  10. 04 Jul, 2024 20 commits
    • David Hildenbrand's avatar
      mm/migrate: move NUMA hinting fault folio isolation + checks under PTL · ee86814b
      David Hildenbrand authored
      Currently we always take a folio reference even if migration will not even
      be tried or isolation failed, requiring us to grab+drop an additional
      reference.
      
      Further, we end up calling folio_likely_mapped_shared() while the folio
      might have already been unmapped, because after we dropped the PTL, that
      can easily happen.  We want to stop touching mapcounts and friends from
      such context, and only call folio_likely_mapped_shared() while the folio
      is still mapped: mapcount information is pretty much stale and unreliable
      otherwise.
      
      So let's move checks into numamigrate_isolate_folio(), rename that
      function to migrate_misplaced_folio_prepare(), and call that function from
      callsites where we call migrate_misplaced_folio(), but still with the PTL
      held.
      
      We can now stop taking temporary folio references, and really only take a
      reference if folio isolation succeeded.  Doing the
      folio_likely_mapped_shared() + folio isolation under PT lock is now
      similar to how we handle MADV_PAGEOUT.
      
      While at it, combine the folio_is_file_lru() checks.
      
      [david@redhat.com: fix list_del() corruption]
        Link: https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4b69@redhat.com
        Link: https://lkml.kernel.org/r/20240626191129.658CFC32782@smtp.kernel.org
      Link: https://lkml.kernel.org/r/20240620212935.656243-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Tested-by: default avatarDonet Tom <donettom@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ee86814b
    • David Hildenbrand's avatar
      mm/migrate: make migrate_misplaced_folio() return 0 on success · 4b88c23a
      David Hildenbrand authored
      Patch series "mm/migrate: move NUMA hinting fault folio isolation + checks
      under PTL".
      
      
      Let's just return 0 on success, which is less confusing.
      
      ...  especially because we got it wrong in the migrate.h stub where we
      have "return -EAGAIN; /* can't migrate now */" instead of "return 0;". 
      Likely this wrong return value doesn't currently matter, but it certainly
      adds confusion.
      
      We'll add migrate_misplaced_folio_prepare() next, where we want to use the
      same "return 0 on success" approach, so let's just clean this up.
      
      Link: https://lkml.kernel.org/r/20240620212935.656243-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20240620212935.656243-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Donet Tom <donettom@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b88c23a
    • Kefeng Wang's avatar
      mm: memory: rename pages_per_huge_page to nr_pages · 2f9f0854
      Kefeng Wang authored
      Since the callers are converted to use nr_pages naming, use it inside too.
      
      Link: https://lkml.kernel.org/r/20240618091242.2140164-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2f9f0854
    • Kefeng Wang's avatar
      mm: memory: improve copy_user_large_folio() · 530dd992
      Kefeng Wang authored
      Use nr_pages instead of pages_per_huge_page and move the address alignment
      from copy_user_large_folio() into the callers since it is only needed when
      we don't know which address will be accessed.
      
      Link: https://lkml.kernel.org/r/20240618091242.2140164-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      530dd992
    • Kefeng Wang's avatar
      mm: memory: use folio in struct copy_subpage_arg · 5132633e
      Kefeng Wang authored
      Directly use folio in struct copy_subpage_arg.
      
      Link: https://lkml.kernel.org/r/20240618091242.2140164-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5132633e
    • Kefeng Wang's avatar
      mm: memory: convert clear_huge_page() to folio_zero_user() · 78fefd04
      Kefeng Wang authored
      Patch series "mm: improve clear and copy user folio", v2.
      
      Some folio conversions.  An improvement is to move address alignment into
      the caller as it is only needed if we don't know which address will be
      accessed when clearing/copying user folios.
      
      
      This patch (of 4):
      
      Replace clear_huge_page() with folio_zero_user(), and take a folio
      instead of a page. Directly get number of pages by folio_nr_pages()
      to remove pages_per_huge_page argument, furthermore, move the address
      alignment from folio_zero_user() to the callers since the alignment
      is only needed when we don't know which address will be accessed.
      
      Link: https://lkml.kernel.org/r/20240618091242.2140164-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240618091242.2140164-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78fefd04
    • Barry Song's avatar
      mm: use folio_add_new_anon_rmap() if folio_test_anon(folio)==false · 9ae2feac
      Barry Song authored
      For the !folio_test_anon(folio) case, we can now invoke
      folio_add_new_anon_rmap() with the rmap flags set to either EXCLUSIVE or
      non-EXCLUSIVE.  This action will suppress the VM_WARN_ON_FOLIO check
      within __folio_add_anon_rmap() while initiating the process of bringing up
      mTHP swapin.
      
       static __always_inline void __folio_add_anon_rmap(struct folio *folio,
                       struct page *page, int nr_pages, struct vm_area_struct *vma,
                       unsigned long address, rmap_t flags, enum rmap_level level)
       {
               ...
               if (unlikely(!folio_test_anon(folio))) {
                       VM_WARN_ON_FOLIO(folio_test_large(folio) &&
                                        level != RMAP_LEVEL_PMD, folio);
               }
               ...
       }
      
      It also improves the code's readability.  Currently, all new anonymous
      folios calling folio_add_anon_rmap_ptes() are order-0.  This ensures that
      new folios cannot be partially exclusive; they are either entirely
      exclusive or entirely shared.
      
      A useful comment from Hugh's fix:
      
      : Commit "mm: use folio_add_new_anon_rmap() if folio_test_anon(folio)==
      : false" has extended folio_add_new_anon_rmap() to use on non-exclusive
      : folios, already visible to others in swap cache and on LRU.
      : 
      : That renders its non-atomic __folio_set_swapbacked() unsafe: it risks
      : overwriting concurrent atomic operations on folio->flags, losing bits
      : added or restoring bits cleared.  Since it's only used in this risky way
      : when folio_test_locked and !folio_test_anon, many such races are excluded;
      : but, for example, isolations by folio_test_clear_lru() are vulnerable, and
      : setting or clearing active.
      : 
      : It could just use the atomic folio_set_swapbacked(); but this function
      : does try to avoid atomics where it can, so use a branch instead: just
      : avoid setting swapbacked when it is already set, that is good enough. 
      : (Swapbacked is normally stable once set: lazyfree can undo it, but only
      : later, when found anon in a page table.)
      : 
      : This fixes a lot of instability under compaction and swapping loads:
      : assorted "Bad page"s, VM_BUG_ON_FOLIO()s, apparently even page double
      : frees - though I've not worked out what races could lead to the latter.
      
      [akpm@linux-foundation.org: comment fixes, per David and akpm]
      [v-songbaohua@oppo.com: lock the folio to avoid race]
        Link: https://lkml.kernel.org/r/20240622032002.53033-1-21cnbao@gmail.com
      [hughd@google.com: folio_add_new_anon_rmap() careful __folio_set_swapbacked()]
        Link: https://lkml.kernel.org/r/f3599b1d-8323-0dc5-e9e0-fdb3cfc3dd5a@google.com
      Link: https://lkml.kernel.org/r/20240617231137.80726-3-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarShuai Yuan <yuanshuai@oppo.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9ae2feac
    • Barry Song's avatar
      mm: extend rmap flags arguments for folio_add_new_anon_rmap · 15bde4ab
      Barry Song authored
      Patch series "mm: clarify folio_add_new_anon_rmap() and
      __folio_add_anon_rmap()", v2.
      
      This patchset is preparatory work for mTHP swapin.
      
      folio_add_new_anon_rmap() assumes that new anon rmaps are always
      exclusive.  However, this assumption doesn’t hold true for cases like
      do_swap_page(), where a new anon might be added to the swapcache and is
      not necessarily exclusive.
      
      The patchset extends the rmap flags to allow folio_add_new_anon_rmap() to
      handle both exclusive and non-exclusive new anon folios.  The
      do_swap_page() function is updated to use this extended API with rmap
      flags.  Consequently, all new anon folios now consistently use
      folio_add_new_anon_rmap().  The special case for !folio_test_anon() in
      __folio_add_anon_rmap() can be safely removed.
      
      In conclusion, new anon folios always use folio_add_new_anon_rmap(),
      regardless of exclusivity.  Old anon folios continue to use
      __folio_add_anon_rmap() via folio_add_anon_rmap_pmd() and
      folio_add_anon_rmap_ptes().
      
      
      This patch (of 3):
      
      In the case of a swap-in, a new anonymous folio is not necessarily
      exclusive.  This patch updates the rmap flags to allow a new anonymous
      folio to be treated as either exclusive or non-exclusive.  To maintain the
      existing behavior, we always use EXCLUSIVE as the default setting.
      
      [akpm@linux-foundation.org: cleanup and constifications per David and akpm]
      [v-songbaohua@oppo.com: fix missing doc for flags of folio_add_new_anon_rmap()]
        Link: https://lkml.kernel.org/r/20240619210641.62542-1-21cnbao@gmail.com
      [v-songbaohua@oppo.com: enhance doc for extend rmap flags arguments for folio_add_new_anon_rmap]
        Link: https://lkml.kernel.org/r/20240622030256.43775-1-21cnbao@gmail.com
      Link: https://lkml.kernel.org/r/20240617231137.80726-1-21cnbao@gmail.com
      Link: https://lkml.kernel.org/r/20240617231137.80726-2-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarShuai Yuan <yuanshuai@oppo.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15bde4ab
    • Barry Song's avatar
      mm: set pte writable while pte_soft_dirty() is true in do_swap_page() · 20dfa5b7
      Barry Song authored
      This patch leverages the new pte_needs_soft_dirty_wp() helper to optimize
      a scenario where softdirty is enabled, but the softdirty flag has already
      been set in do_swap_page().  In this situation, we can use pte_mkwrite
      instead of applying write-protection since we don't depend on write
      faults.
      
      Link: https://lkml.kernel.org/r/20240607211358.4660-3-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20dfa5b7
    • Yosry Ahmed's avatar
      mm: swap: remove 'synchronous' argument to swap_read_folio() · b2d1f38b
      Yosry Ahmed authored
      Commit [1] introduced IO polling support duding swapin to reduce swap read
      latency for block devices that can be polled.  However later commit [2]
      removed polling support.  Commit [3] removed the remnants of polling
      support from read_swap_cache_async() and __read_swap_cache_async(). 
      However, it left behind some remnants in swap_read_folio(), the
      'synchronous' argument.
      
      swap_read_folio() reads the folio synchronously if synchronous=true or if
      SWP_SYNCHRONOUS_IO is set in swap_info_struct.  The only caller that
      passes synchronous=true is in do_swap_page() in the SWP_SYNCHRONOUS_IO
      case.
      
      Hence, the argument is redundant, it is only set to true when the swap
      read would have been synchronous anyway. Remove it.
      
      [1] Commit 23955622 ("swap: add block io poll in swapin path")
      [2] Commit 9650b453 ("block: ignore RWF_HIPRI hint for sync dio")
      [3] Commit b243dcbf ("swap: remove remnants of polling from read_swap_cache_async")
      
      Link: https://lkml.kernel.org/r/20240607045515.1836558-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b2d1f38b
    • Barry Song's avatar
      mm: swap: reuse exclusive folio directly instead of wp page faults · c18160db
      Barry Song authored
      After swapping out, we perform a swap-in operation.  If we first read and
      then write, we encounter a major fault in do_swap_page for reading, along
      with additional minor faults in do_wp_page for writing.  However, the
      latter appears to be unnecessary and inefficient.  Instead, we can
      directly reuse in do_swap_page and completely eliminate the need for
      do_wp_page.
      
      This patch achieves that optimization specifically for exclusive folios. 
      The following microbenchmark demonstrates the significant reduction in
      minor faults.
      
       #define DATA_SIZE (2UL * 1024 * 1024)
       #define PAGE_SIZE (4UL * 1024)
      
       static void *read_write_data(char *addr)
       {
               char tmp;
      
               for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) {
                       tmp = *(volatile char *)(addr + i);
                       *(volatile char *)(addr + i) = tmp;
               }
       }
      
       int main(int argc, char **argv)
       {
               struct rusage ru;
      
               char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE,
                               MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
               memset(addr, 0x11, DATA_SIZE);
      
               do {
                       long old_ru_minflt, old_ru_majflt;
                       long new_ru_minflt, new_ru_majflt;
      
                       madvise(addr, DATA_SIZE, MADV_PAGEOUT);
      
                       getrusage(RUSAGE_SELF, &ru);
                       old_ru_minflt = ru.ru_minflt;
                       old_ru_majflt = ru.ru_majflt;
      
                       read_write_data(addr);
                       getrusage(RUSAGE_SELF, &ru);
                       new_ru_minflt = ru.ru_minflt;
                       new_ru_majflt = ru.ru_majflt;
      
                       printf("minor faults:%ld major faults:%ld\n",
                               new_ru_minflt - old_ru_minflt,
                               new_ru_majflt - old_ru_majflt);
               } while(0);
      
               return 0;
       }
      
      w/o patch,
      / # ~/a.out
      minor faults:512 major faults:512
      
      w/ patch,
      / # ~/a.out
      minor faults:0 major faults:512
      
      Minor faults decrease to 0!
      
      Link: https://lkml.kernel.org/r/20240602004502.26895-1-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c18160db
    • Baolin Wang's avatar
      mm: memory: extend finish_fault() to support large folio · 43e027e4
      Baolin Wang authored
      Patch series "add mTHP support for anonymous shmem", v5.
      
      Anonymous pages have already been supported for multi-size (mTHP)
      allocation through commit 19eaf449, that can allow THP to be
      configured through the sysfs interface located at
      '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
      
      However, the anonymous shmem will ignore the anonymous mTHP rule
      configured through the sysfs interface, and can only use the PMD-mapped
      THP, that is not reasonable.  Many implement anonymous page sharing
      through mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
      scenarios, therefore, users expect to apply an unified mTHP strategy for
      anonymous pages, also including the anonymous shared pages, in order to
      enjoy the benefits of mTHP.  For example, lower latency than PMD-mapped
      THP, smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM
      architecture to reduce TLB miss etc.
      
      As discussed in the bi-weekly MM meeting[1], the mTHP controls should
      control all of shmem, not only anonymous shmem, but support will be added
      iteratively.  Therefore, this patch set starts with support for anonymous
      shmem.
      
      The primary strategy is similar to supporting anonymous mTHP.  Introduce a
      new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
      which can have almost the same values as the top-level
      '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
      additional "inherit" option and dropping the testing options 'force' and
      'deny'.  By default all sizes will be set to "never" except PMD size,
      which is set to "inherit".  This ensures backward compatibility with the
      anonymous shmem enabled of the top level, meanwhile also allows
      independent control of anonymous shmem enabled for each mTHP.
      
      Use the page fault latency tool to measure the performance of 1G anonymous shmem
      with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
      125G memory:
      base: mm-unstable
      user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
      0.04s        3.10s         83516.416                  2669684.890
      
      mm-unstable + patchset, anon shmem mTHP disabled
      user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
      0.02s        3.14s         82936.359                  2630746.027
      
      mm-unstable + patchset, anon shmem 64K mTHP enabled
      user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
      0.08s        0.31s         678630.231                 17082522.495
      
      From the data above, it is observed that the patchset has a minimal impact
      when mTHP is not enabled (some fluctuations observed during testing). 
      When enabling 64K mTHP, there is a significant improvement of the page
      fault latency.
      
      [1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/
      
      
      This patch (of 6):
      
      Add large folio mapping establishment support for finish_fault() as a
      preparation, to support multi-size THP allocation of anonymous shmem pages
      in the following patches.
      
      Keep the same behavior (per-page fault) for non-anon shmem to avoid
      inflating the RSS unintentionally, and we can discuss what size of mapping
      to build when extending mTHP to control non-anon shmem in the future.
      
      [baolin.wang@linux.alibaba.com: avoid going beyond the PMD pagetable size]
        Link: https://lkml.kernel.org/r/b0e6a8b1-a32c-459e-ae67-fde5d28773e6@linux.alibaba.com
      [baolin.wang@linux.alibaba.com: use 'PTRS_PER_PTE' instead of 'PTRS_PER_PTE - 1']
        Link: https://lkml.kernel.org/r/e1f5767a-2c9b-4e37-afe6-1de26fe54e41@linux.alibaba.com
      Link: https://lkml.kernel.org/r/cover.1718090413.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/3a190892355989d42f59cf9f2f98b94694b0d24d.1718090413.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Daniel Gomez <da.gomez@samsung.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      43e027e4
    • Chuanhua Han's avatar
      mm: swap: entirely map large folios found in swapcache · 50875896
      Chuanhua Han authored
      When a large folio is found in the swapcache, the current implementation
      requires calling do_swap_page() nr_pages times, resulting in nr_pages page
      faults.  This patch opts to map the entire large folio at once to minimize
      page faults.  Additionally, redundant checks and early exits for ARM64 MTE
      restoring are removed.
      
      Link: https://lkml.kernel.org/r/20240529082824.150954-7-21cnbao@gmail.comSigned-off-by: default avatarChuanhua Han <hanchuanhua@oppo.com>
      Co-developed-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Signed-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50875896
    • Chuanhua Han's avatar
      mm: swap: make should_try_to_free_swap() support large-folio · 4c3f9664
      Chuanhua Han authored
      The function should_try_to_free_swap() operates under the assumption that
      swap-in always occurs at the normal page granularity, i.e.,
      folio_nr_pages() = 1.  However, in reality, for large folios,
      add_to_swap_cache() will invoke folio_ref_add(folio, nr).  To accommodate
      large folio swap-in, this patch eliminates this assumption.
      
      Link: https://lkml.kernel.org/r/20240529082824.150954-6-21cnbao@gmail.comSigned-off-by: default avatarChuanhua Han <hanchuanhua@oppo.com>
      Co-developed-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Signed-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Acked-by: default avatarChris Li <chrisl@kernel.org>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4c3f9664
    • Barry Song's avatar
      mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages · 29f252cd
      Barry Song authored
      Should do_swap_page() have the capability to directly map a large folio,
      metadata restoration becomes necessary for a specified number of pages
      denoted as nr.  It's important to highlight that metadata restoration is
      solely required by the SPARC platform, which, however, does not enable
      THP_SWAP.  Consequently, in the present kernel configuration, there exists
      no practical scenario where users necessitate the restoration of nr
      metadata.  Platforms implementing THP_SWAP might invoke this function with
      nr values exceeding 1, subsequent to do_swap_page() successfully mapping
      an entire large folio.  Nonetheless, their arch_do_swap_page_nr()
      functions remain empty.
      
      Link: https://lkml.kernel.org/r/20240529082824.150954-5-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chuanhua Han <hanchuanhua@oppo.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      29f252cd
    • Mateusz Guzik's avatar
      mm: batch unlink_file_vma calls in free_pgd_range · 3577dbb1
      Mateusz Guzik authored
      Execs of dynamically linked binaries at 20-ish cores are bottlenecked on
      the i_mmap_rwsem semaphore, while the biggest singular contributor is
      free_pgd_range inducing the lock acquire back-to-back for all consecutive
      mappings of a given file.
      
      Tracing the count of said acquires while building the kernel shows:
      [1, 2)     799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
      [2, 3)          0 |                                                    |
      [3, 4)       3009 |                                                    |
      [4, 5)       3009 |                                                    |
      [5, 6)     326442 |@@@@@@@@@@@@@@@@@@@@@                               |
      
      So in particular there were 326442 opportunities to coalesce 5 acquires
      into 1.
      
      Doing so increases execs per second by 4% (~50k to ~52k) when running
      the benchmark linked below.
      
      The lock remains the main bottleneck, I have not looked at other spots
      yet.
      
      Bench can be found here:
      http://apollo.backplane.com/DFlyMisc/doexec.c
      
      $ cc -O2 -o shared-doexec doexec.c
      $ ./shared-doexec $(nproc)
      
      Note this particular test makes sure binaries are separate, but the
      loader is shared.
      
      Stats collected on the patched kernel (+ "noinline") with:
      bpftrace -e 'kprobe:unlink_file_vma_batch_process
      { @ = lhist(((struct unlink_vma_file_batch *)arg0)->count, 0, 8, 1); }'
      
      Link: https://lkml.kernel.org/r/20240521234321.359501-1-mjguzik@gmail.comSigned-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3577dbb1
    • Bang Li's avatar
      mm: use update_mmu_tlb_range() to simplify code · 6faa49d1
      Bang Li authored
      Let us simplify the code by update_mmu_tlb_range().
      
      Link: https://lkml.kernel.org/r/20240522061204.117421-4-libang.li@antgroup.comSigned-off-by: default avatarBang Li <libang.li@antgroup.com>
      Reviewed-by: default avatarLance Yang <ioworker0@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6faa49d1
    • David Hildenbrand's avatar
      mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed() · fce831c9
      David Hildenbrand authored
      For now we only get the (small) zeropage mapped to user space in four
      cases (excluding VM_PFNMAP mappings, such as /proc/vmstat):
      
      (1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON):
          do_anonymous_page() will not refcount it and map it pte_mkspecial()
      (2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem
          (MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and
          map it pte_mkspecial().
      (3) KSM in mergeable VMA (anonymous VMA or COW mapping).
          cmp_and_merge_page() will not refcount it and map it
          pte_mkspecial().
      (4) FSDAX as an optimization for holes.
          vmf_insert_mixed()->__vm_insert_mixed() might end up calling
          insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the
          zeropage and not mapping it pte_mkspecial(). With
          CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will
          not refcount it and map it pte_mkspecial().
      
      In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets
      VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a08 ("dax:
      remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit
      e1fb4a08 ("dax: remove VM_MIXEDMAP for fsdax and device dax").
      
      Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page()
      would currently return the zeropage.  We'll refcount the zeropage when
      mapping and when unmapping.
      
      Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP,
      vm_normal_page() would currently refuse to return the zeropage.  So we'd
      refcount it when mapping but not when unmapping it ...  do we have fsdax
      without CONFIG_ARCH_HAS_PTE_SPECIAL in practice?  Hard to tell.
      
      Independent of that, we should never refcount the zeropage when we might
      be holding that reference for a long time, because even without an
      accounting imbalance we might overflow the refcount.  As there is interest
      in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean
      support for that in the cases where it makes sense:
      
      (A) Never refcount the zeropage when mapping it:
      
      In insert_page(), special-case the zeropage, do not refcount it, and use
      pte_mkspecial().  Don't involve insert_pfn(), adjusting insert_page()
      looks cleaner than branching off to insert_pfn().
      
      (B) Never refcount the zeropage when unmapping it:
      
      In vm_normal_page(), also don't return the zeropage in a VM_MIXEDMAP
      mapping without CONFIG_ARCH_HAS_PTE_SPECIAL.  Add a VM_WARN_ON_ONCE()
      sanity check if we'd ever return the zeropage, which could happen if
      someone forgets to set pte_mkspecial() when mapping the zeropage. 
      Document that.
      
      (C) Allow the zeropage only where reasonable
      
      s390x never wants the zeropage in some processes running legacy KVM guests
      that make use of storage keys.  So disallow that.
      
      Further, using the zeropage in COW mappings is unproblematic (just what we
      do for other COW mappings), because FAULT_FLAG_UNSHARE can just unshare it
      and GUP with FOLL_LONGTERM would work as expected.
      
      Similarly, mappings that can never have writable PTEs (implying no write
      faults) are also not problematic, because nothing could end up mapping the
      PTE writable by mistake later.  But in case we could have writable PTEs,
      we'll only allow the zeropage in FSDAX VMAs, that are incompatible with
      GUP and are blocked there completely.
      
      We'll always require the zeropage to be mapped with pte_special(). 
      GUP-fast will reject the zeropage that way, but GUP-slow will allow it. 
      (Note that GUP does not refcount the zeropage with FOLL_PIN, because there
      were issues with overflowing the refcount in the past).
      
      Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to
      catch early during testing if we'd ever find a zeropage unexpectedly in
      code that wants to upgrade write permissions.
      
      Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail
      with VM_FAULT_SIGBUS, like we do for other sanity checks.  Drop the stale
      comment regarding reserved pages from insert_page().
      
      Note that:
      * we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and
        vmf_insert_pfn() would allow the zeropage in some cases and
        not refcount it.
      * vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP
        mappings and we'll leave that alone for now. People can simply use
        one of the other interfaces.
      * we won't bother with the huge zeropage for now. It's never
        PTE-mapped and also GUP does not special-case it yet.
      
      Link: https://lkml.kernel.org/r/20240522125713.775114-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fce831c9
    • David Hildenbrand's avatar
      mm/memory: move page_count() check into validate_page_before_insert() · 11b914ee
      David Hildenbrand authored
      Patch series "mm/memory: cleanly support zeropage in vm_insert_page*(),
      vm_map_pages*() and vmf_insert_mixed()", v2.
      
      There is interest in mapping zeropages via vm_insert_pages() [1] into
      MAP_SHARED mappings.
      
      For now, we only get zeropages in MAP_SHARED mappings via
      vmf_insert_mixed() from FSDAX code, and I think it's a bit shaky in some
      cases because we refcount the zeropage when mapping it but not necessarily
      always when unmapping it ...  and we should actually never refcount it.
      
      It's all a bit tricky, especially how zeropages in MAP_SHARED mappings
      interact with GUP (FOLL_LONGTERM), mprotect(), write-faults and s390x
      forbidding the shared zeropage (rewrite [2] s now upstream).
      
      This series tries to take the careful approach of only allowing the
      zeropage where it is likely safe to use (which should cover the existing
      FSDAX use case and [1]), preventing that it could accidentally get mapped
      writable during a write fault, mprotect() etc, and preventing issues with
      FOLL_LONGTERM in the future with other users.
      
      Tested with a patch from Vincent that uses the zeropage in context of
      [1].
      
      [1] https://lkml.kernel.org/r/20240430111354.637356-1-vdonnefort@google.com
      [2] https://lkml.kernel.org/r/20240411161441.910170-1-david@redhat.com
      
      
      This patch (of 3):
      
      We'll now also cover the case where insert_page() is called from
      __vm_insert_mixed(), which sounds like the right thing to do.
      
      Link: https://lkml.kernel.org/r/20240522125713.775114-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      11b914ee
    • Illia Ostapyshyn's avatar
      mm/vmscan: update stale references to shrink_page_list · 0ba5e806
      Illia Ostapyshyn authored
      Commit 49fd9b6d ("mm/vmscan: fix a lot of comments") renamed
      shrink_page_list() to shrink_folio_list().  Fix up the remaining
      references to the old name in comments and documentation.
      
      Link: https://lkml.kernel.org/r/20240517091348.1185566-1-illia@yshyn.comSigned-off-by: default avatarIllia Ostapyshyn <illia@yshyn.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0ba5e806
  11. 25 Jun, 2024 1 commit
  12. 15 Jun, 2024 1 commit
    • Kefeng Wang's avatar
      mm: fix possible OOB in numa_rebuild_large_mapping() · cfdd12b4
      Kefeng Wang authored
      The large folio is mapped with folio size(not greater PMD_SIZE) aligned
      virtual address during the pagefault, ie, 'addr = ALIGN_DOWN(vmf->address,
      nr_pages * PAGE_SIZE)' in do_anonymous_page().  But after the mremap(),
      the virtual address only requires PAGE_SIZE alignment.  Also pte is moved
      to new in move_page_tables(), then traversal of the new pte in the
      numa_rebuild_large_mapping() could hit the following issue,
      
         Unable to handle kernel paging request at virtual address 00000a80c021a788
         Mem abort info:
           ESR = 0x0000000096000004
           EC = 0x25: DABT (current EL), IL = 32 bits
           SET = 0, FnV = 0
           EA = 0, S1PTW = 0
           FSC = 0x04: level 0 translation fault
         Data abort info:
           ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
           CM = 0, WnR = 0, TnD = 0, TagAccess = 0
           GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
         user pgtable: 4k pages, 48-bit VAs, pgdp=00002040341a6000
         [00000a80c021a788] pgd=0000000000000000, p4d=0000000000000000
         Internal error: Oops: 0000000096000004 [#1] SMP
         ...
         CPU: 76 PID: 15187 Comm: git Kdump: loaded Tainted: G        W          6.10.0-rc2+ #209
         Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 1.79 08/21/2021
         pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
         pc : numa_rebuild_large_mapping+0x338/0x638
         lr : numa_rebuild_large_mapping+0x320/0x638
         sp : ffff8000b41c3b00
         x29: ffff8000b41c3b30 x28: ffff8000812a0000 x27: 00000000000a8000
         x26: 00000000000000a8 x25: 0010000000000001 x24: ffff20401c7170f0
         x23: 0000ffff33a1e000 x22: 0000ffff33a76000 x21: ffff20400869eca0
         x20: 0000ffff33976000 x19: 00000000000000a8 x18: ffffffffffffffff
         x17: 0000000000000000 x16: 0000000000000020 x15: ffff8000b41c36a8
         x14: 0000000000000000 x13: 205d373831353154 x12: 5b5d333331363732
         x11: 000000000011ff78 x10: 000000000011ff10 x9 : ffff800080273f30
         x8 : 000000320400869e x7 : c0000000ffffd87f x6 : 00000000001e6ba8
         x5 : ffff206f3fb5af88 x4 : 0000000000000000 x3 : 0000000000000000
         x2 : 0000000000000000 x1 : fffffdffc0000000 x0 : 00000a80c021a780
         Call trace:
          numa_rebuild_large_mapping+0x338/0x638
          do_numa_page+0x3e4/0x4e0
          handle_pte_fault+0x1bc/0x238
          __handle_mm_fault+0x20c/0x400
          handle_mm_fault+0xa8/0x288
          do_page_fault+0x124/0x498
          do_translation_fault+0x54/0x80
          do_mem_abort+0x4c/0xa8
          el0_da+0x40/0x110
          el0t_64_sync_handler+0xe4/0x158
          el0t_64_sync+0x188/0x190
      
      Fix it by making the start and end not only within the vma range, but also
      within the page table range.
      
      Link: https://lkml.kernel.org/r/20240612122822.4033433-1-wangkefeng.wang@huawei.com
      Fixes: d2136d74 ("mm: support multi-size THP numa balancing")
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Liu Shixin <liushixin2@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfdd12b4