1. 04 Jul, 2024 40 commits
    • Eric Chanudet's avatar
      mm/mm_init: use node's number of cpus in deferred_page_init_max_threads · 188f87f2
      Eric Chanudet authored
      x86_64 is already using the node's cpu as maximum threads.  Make that the
      default for all archs setting DEFERRED_STRUCT_PAGE_INIT.
      
      This returns to the behavior prior making the function arch-specific with
      commit ecd09650 ("mm: make deferred init's max threads
      arch-specific").
      
      Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms
      shows faster deferred_init_memmap completions:
      
      |         | x13s        | SA8775p-ride | Ampere R137-P31 | Ampere HR330 |
      |         | Metal, 32GB | VM, 36GB     | VM, 58GB        | Metal, 128GB |
      |         | 8cpus       | 8cpus        | 8cpus           | 32cpus       |
      |---------|-------------|--------------|-----------------|--------------|
      | threads |  ms     (%) | ms       (%) |  ms         (%) |  ms      (%) |
      |---------|-------------|--------------|-----------------|--------------|
      | 1       | 108    (0%) | 72      (0%) | 224        (0%) | 324     (0%) |
      | cpus    |  24  (-77%) | 36    (-50%) |  40      (-82%) |  56   (-82%) |
      
      Michael Ellerman reported:
      
      : On a machine here (1TB, 40 cores, 4KB pages) the existing code gives:
      : 
      :   [    0.500124] node 2 deferred pages initialised in 210ms
      :   [    0.515790] node 3 deferred pages initialised in 230ms
      :   [    0.516061] node 0 deferred pages initialised in 230ms
      :   [    0.516522] node 7 deferred pages initialised in 230ms
      :   [    0.516672] node 4 deferred pages initialised in 230ms
      :   [    0.516798] node 6 deferred pages initialised in 230ms
      :   [    0.517051] node 5 deferred pages initialised in 230ms
      :   [    0.523887] node 1 deferred pages initialised in 240ms
      : 
      : vs with the patch:
      : 
      :   [    0.379613] node 0 deferred pages initialised in 90ms
      :   [    0.380388] node 1 deferred pages initialised in 90ms
      :   [    0.380540] node 4 deferred pages initialised in 100ms
      :   [    0.390239] node 6 deferred pages initialised in 100ms
      :   [    0.390249] node 2 deferred pages initialised in 100ms
      :   [    0.390786] node 3 deferred pages initialised in 110ms
      :   [    0.396721] node 5 deferred pages initialised in 110ms
      :   [    0.397095] node 7 deferred pages initialised in 110ms
      : 
      : Which is a nice speedup.
      
      [echanude@redhat.com: v3]
        Link: https://lkml.kernel.org/r/20240528185455.643227-4-echanude@redhat.com
      Link: https://lkml.kernel.org/r/20240522203758.626932-4-echanude@redhat.comSigned-off-by: default avatarEric Chanudet <echanude@redhat.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Acked-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      188f87f2
    • Mateusz Guzik's avatar
      mm: batch unlink_file_vma calls in free_pgd_range · 3577dbb1
      Mateusz Guzik authored
      Execs of dynamically linked binaries at 20-ish cores are bottlenecked on
      the i_mmap_rwsem semaphore, while the biggest singular contributor is
      free_pgd_range inducing the lock acquire back-to-back for all consecutive
      mappings of a given file.
      
      Tracing the count of said acquires while building the kernel shows:
      [1, 2)     799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
      [2, 3)          0 |                                                    |
      [3, 4)       3009 |                                                    |
      [4, 5)       3009 |                                                    |
      [5, 6)     326442 |@@@@@@@@@@@@@@@@@@@@@                               |
      
      So in particular there were 326442 opportunities to coalesce 5 acquires
      into 1.
      
      Doing so increases execs per second by 4% (~50k to ~52k) when running
      the benchmark linked below.
      
      The lock remains the main bottleneck, I have not looked at other spots
      yet.
      
      Bench can be found here:
      http://apollo.backplane.com/DFlyMisc/doexec.c
      
      $ cc -O2 -o shared-doexec doexec.c
      $ ./shared-doexec $(nproc)
      
      Note this particular test makes sure binaries are separate, but the
      loader is shared.
      
      Stats collected on the patched kernel (+ "noinline") with:
      bpftrace -e 'kprobe:unlink_file_vma_batch_process
      { @ = lhist(((struct unlink_vma_file_batch *)arg0)->count, 0, 8, 1); }'
      
      Link: https://lkml.kernel.org/r/20240521234321.359501-1-mjguzik@gmail.comSigned-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3577dbb1
    • Jane Chu's avatar
      mm/memory-failure: send SIGBUS in the event of thp split fail · 1a3798de
      Jane Chu authored
      While handling hwpoison in a THP page, it is possible that
      try_to_split_thp_page() fails.  For example, when the THP page has been
      RDMA pinned.  At this point, the kernel cannot isolate the poisoned THP
      page, all it could do is to send a SIGBUS to the user process with
      meaningful payload to give user-level recovery a chance.
      
      Link: https://lkml.kernel.org/r/20240524215306.2705454-6-jane.chu@oracle.comSigned-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <oalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1a3798de
    • Jane Chu's avatar
      mm/memory-failure: move hwpoison_filter() higher up · 9b0ab153
      Jane Chu authored
      Move hwpoison_filter() higher up as there is no need to spend a lot cycles
      only to find out later that the page is supposed to be skipped from
      hwpoison handling.
      
      Link: https://lkml.kernel.org/r/20240524215306.2705454-5-jane.chu@oracle.comSigned-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <oalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9b0ab153
    • Jane Chu's avatar
      mm/memory-failure: improve memory failure action_result messages · b8b9488d
      Jane Chu authored
      Added two explicit MF_MSG messages describing failure in
      get_hwpoison_page.  Attemped to document the definition of various action
      names, and made a few adjustment to the action_result() calls.
      
      Link: https://lkml.kernel.org/r/20240524215306.2705454-4-jane.chu@oracle.comSigned-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <oalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b8b9488d
    • Jane Chu's avatar
      mm/madvise: add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON) · 66802526
      Jane Chu authored
      The soft hwpoison injector via madvise(MADV_HWPOISON) operates in a
      synchrous way in a sense, the injector is also a process under test, and
      should it have the poisoned page mapped in its address space, it should
      get killed as much as in a real UE situation.  Doing so align with what
      the madvise(2) man page says: " "This operation may result in the calling
      process receiving a SIGBUS and the page being unmapped."
      
      Link: https://lkml.kernel.org/r/20240524215306.2705454-3-jane.chu@oracle.comSigned-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarOscar Salvador <oalvador@suse.de>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      66802526
    • Jane Chu's avatar
      mm/memory-failure: try to send SIGBUS even if unmap failed · aa298fdf
      Jane Chu authored
      Patch series "Enhance soft hwpoison handling and injection", v4.
      
      This series is aimed at the following enhancements:
      
      - Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave
        more like as if a real UE occurred.  Because the other two injectors
        such as hwpoison-inject and the 'einj' on x86 can't, and it seems to me
        we need a better simulation to real UE scenario.
      - For years, if the kernel is unable to unmap a hwpoisoned page, it send
        a SIGKILL instead of SIGBUS to prevent user process from potentially
        accessing the page again.  But in doing so, the user process also lose
        important information: vaddr, for recovery.  Fortunately, the kernel
        already has code to kill process re-accessing a hwpoisoned page, so
        remove the '!unmap_success' check.
      - Right now, if a thp page under GUP longterm pin is hwpoisoned, and
        kernel cannot split the thp page, memory-failure simply ignores the UE
        and returns.  That's not ideal, it could deliver a SIGBUS with useful
        information for userspace recovery.
      
      
      This patch (of 5):
      
      For years when it comes down to kill a process due to hwpoison, a SIGBUS
      is delivered only if unmap has been successful.  Otherwise, a SIGKILL is
      delivered.  And the reason for that is to prevent the involved process
      from accessing the hwpoisoned page again.
      
      Since then a lot has changed, a hwpoisoned page is marked and upon being
      re-accessed, the memory-failure handler invokes kill_accessing_process()
      to kill the process immediately.  So let's take out the '!unmap_success'
      factor and try to deliver SIGBUS if possible.
      
      Link: https://lkml.kernel.org/r/20240524215306.2705454-1-jane.chu@oracle.com
      Link: https://lkml.kernel.org/r/20240524215306.2705454-2-jane.chu@oracle.comSigned-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <oalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa298fdf
    • Bang Li's avatar
      mm: use update_mmu_tlb_range() to simplify code · 6faa49d1
      Bang Li authored
      Let us simplify the code by update_mmu_tlb_range().
      
      Link: https://lkml.kernel.org/r/20240522061204.117421-4-libang.li@antgroup.comSigned-off-by: default avatarBang Li <libang.li@antgroup.com>
      Reviewed-by: default avatarLance Yang <ioworker0@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6faa49d1
    • Bang Li's avatar
      mm: implement update_mmu_tlb() using update_mmu_tlb_range() · 8f65aa32
      Bang Li authored
      Let's make update_mmu_tlb() simply a generic wrapper around
      update_mmu_tlb_range().  Only the latter can now be overridden by the
      architecture.  We can now remove __HAVE_ARCH_UPDATE_MMU_TLB as well.
      
      Link: https://lkml.kernel.org/r/20240522061204.117421-3-libang.li@antgroup.comSigned-off-by: default avatarBang Li <libang.li@antgroup.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8f65aa32
    • Bang Li's avatar
      mm: add update_mmu_tlb_range() · 23b1b44e
      Bang Li authored
      Patch series "Add update_mmu_tlb_range() to simplify code", v4.
      
      This series of commits mainly adds the update_mmu_tlb_range() to batch
      update tlb in an address range and implement update_mmu_tlb() using
      update_mmu_tlb_range().
      
      After commit 19eaf449 ("mm: thp: support allocation of anonymous
      multi-size THP"), We may need to batch update tlb of a certain address
      range by calling update_mmu_tlb() in a loop.  Using the
      update_mmu_tlb_range(), we can simplify the code and possibly reduce the
      execution of some unnecessary code in some architectures.
      
      
      This patch (of 3):
      
      Add update_mmu_tlb_range(), we can batch update tlb of an address range.
      
      Link: https://lkml.kernel.org/r/20240522061204.117421-1-libang.li@antgroup.com
      Link: https://lkml.kernel.org/r/20240522061204.117421-2-libang.li@antgroup.comSigned-off-by: default avatarBang Li <libang.li@antgroup.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      23b1b44e
    • Dev Jain's avatar
      selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing · e4a4ba41
      Dev Jain authored
      Post FEAT_LPA2, the Aarch64 Linux kernel extends higher address support to
      4K and 16K translation granules.  To support testing this out, we need to
      do away with static initialization of page size, while still maintaining
      the nice array of testcases; this can be achieved by initializing and
      populating the array as a stack variable, and filling in the page size and
      hugepage size at runtime.
      
      Link: https://lkml.kernel.org/r/20240522070435.773918-3-dev.jain@arm.comSigned-off-by: default avatarDev Jain <dev.jain@arm.com>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e4a4ba41
    • Dev Jain's avatar
      selftests/mm: va_high_addr_switch: reduce test noise · 85e8bcb4
      Dev Jain authored
      Patch series "Restructure va_high_addr_switch".
      
      The va_high_addr_switch memory selftest tests out some corner cases
      related to allocation and page/hugepage faulting around the switch
      boundary.  Currently, the page size and hugepage size have been statically
      defined.  Post FEAT_LPA2, the Aarch64 Linux kernel adds support for 4k and
      16k translation granules on higher addresses; we restructure the test to
      support the same.  In addition, we avoid invocation of the binary twice,
      in the shell script, to reduce test noise.
      
      
      This patch (of 2):
      
      When invoking the binary with "--run-hugetlb" flag, the testcases
      involving the base page are anyways going to be run.  Therefore, remove
      duplication by invoking the binary only once.
      
      Link: https://lkml.kernel.org/r/20240522070435.773918-1-dev.jain@arm.com
      Link: https://lkml.kernel.org/r/20240522070435.773918-2-dev.jain@arm.comSigned-off-by: default avatarDev Jain <dev.jain@arm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85e8bcb4
    • David Hildenbrand's avatar
      mm/rmap: sanity check that zeropages are not passed to RMAP · 6ad28e7e
      David Hildenbrand authored
      Using insert_page() we might have previously ended up passing the zeropage
      into rmap code.  Make sure that won't happen again.
      
      Note that we won't check the huge zeropage for now, which might still end
      up in RMAP code.
      
      Link: https://lkml.kernel.org/r/20240522125713.775114-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6ad28e7e
    • David Hildenbrand's avatar
      mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed() · fce831c9
      David Hildenbrand authored
      For now we only get the (small) zeropage mapped to user space in four
      cases (excluding VM_PFNMAP mappings, such as /proc/vmstat):
      
      (1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON):
          do_anonymous_page() will not refcount it and map it pte_mkspecial()
      (2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem
          (MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and
          map it pte_mkspecial().
      (3) KSM in mergeable VMA (anonymous VMA or COW mapping).
          cmp_and_merge_page() will not refcount it and map it
          pte_mkspecial().
      (4) FSDAX as an optimization for holes.
          vmf_insert_mixed()->__vm_insert_mixed() might end up calling
          insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the
          zeropage and not mapping it pte_mkspecial(). With
          CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will
          not refcount it and map it pte_mkspecial().
      
      In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets
      VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a08 ("dax:
      remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit
      e1fb4a08 ("dax: remove VM_MIXEDMAP for fsdax and device dax").
      
      Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page()
      would currently return the zeropage.  We'll refcount the zeropage when
      mapping and when unmapping.
      
      Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP,
      vm_normal_page() would currently refuse to return the zeropage.  So we'd
      refcount it when mapping but not when unmapping it ...  do we have fsdax
      without CONFIG_ARCH_HAS_PTE_SPECIAL in practice?  Hard to tell.
      
      Independent of that, we should never refcount the zeropage when we might
      be holding that reference for a long time, because even without an
      accounting imbalance we might overflow the refcount.  As there is interest
      in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean
      support for that in the cases where it makes sense:
      
      (A) Never refcount the zeropage when mapping it:
      
      In insert_page(), special-case the zeropage, do not refcount it, and use
      pte_mkspecial().  Don't involve insert_pfn(), adjusting insert_page()
      looks cleaner than branching off to insert_pfn().
      
      (B) Never refcount the zeropage when unmapping it:
      
      In vm_normal_page(), also don't return the zeropage in a VM_MIXEDMAP
      mapping without CONFIG_ARCH_HAS_PTE_SPECIAL.  Add a VM_WARN_ON_ONCE()
      sanity check if we'd ever return the zeropage, which could happen if
      someone forgets to set pte_mkspecial() when mapping the zeropage. 
      Document that.
      
      (C) Allow the zeropage only where reasonable
      
      s390x never wants the zeropage in some processes running legacy KVM guests
      that make use of storage keys.  So disallow that.
      
      Further, using the zeropage in COW mappings is unproblematic (just what we
      do for other COW mappings), because FAULT_FLAG_UNSHARE can just unshare it
      and GUP with FOLL_LONGTERM would work as expected.
      
      Similarly, mappings that can never have writable PTEs (implying no write
      faults) are also not problematic, because nothing could end up mapping the
      PTE writable by mistake later.  But in case we could have writable PTEs,
      we'll only allow the zeropage in FSDAX VMAs, that are incompatible with
      GUP and are blocked there completely.
      
      We'll always require the zeropage to be mapped with pte_special(). 
      GUP-fast will reject the zeropage that way, but GUP-slow will allow it. 
      (Note that GUP does not refcount the zeropage with FOLL_PIN, because there
      were issues with overflowing the refcount in the past).
      
      Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to
      catch early during testing if we'd ever find a zeropage unexpectedly in
      code that wants to upgrade write permissions.
      
      Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail
      with VM_FAULT_SIGBUS, like we do for other sanity checks.  Drop the stale
      comment regarding reserved pages from insert_page().
      
      Note that:
      * we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and
        vmf_insert_pfn() would allow the zeropage in some cases and
        not refcount it.
      * vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP
        mappings and we'll leave that alone for now. People can simply use
        one of the other interfaces.
      * we won't bother with the huge zeropage for now. It's never
        PTE-mapped and also GUP does not special-case it yet.
      
      Link: https://lkml.kernel.org/r/20240522125713.775114-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fce831c9
    • David Hildenbrand's avatar
      mm/memory: move page_count() check into validate_page_before_insert() · 11b914ee
      David Hildenbrand authored
      Patch series "mm/memory: cleanly support zeropage in vm_insert_page*(),
      vm_map_pages*() and vmf_insert_mixed()", v2.
      
      There is interest in mapping zeropages via vm_insert_pages() [1] into
      MAP_SHARED mappings.
      
      For now, we only get zeropages in MAP_SHARED mappings via
      vmf_insert_mixed() from FSDAX code, and I think it's a bit shaky in some
      cases because we refcount the zeropage when mapping it but not necessarily
      always when unmapping it ...  and we should actually never refcount it.
      
      It's all a bit tricky, especially how zeropages in MAP_SHARED mappings
      interact with GUP (FOLL_LONGTERM), mprotect(), write-faults and s390x
      forbidding the shared zeropage (rewrite [2] s now upstream).
      
      This series tries to take the careful approach of only allowing the
      zeropage where it is likely safe to use (which should cover the existing
      FSDAX use case and [1]), preventing that it could accidentally get mapped
      writable during a write fault, mprotect() etc, and preventing issues with
      FOLL_LONGTERM in the future with other users.
      
      Tested with a patch from Vincent that uses the zeropage in context of
      [1].
      
      [1] https://lkml.kernel.org/r/20240430111354.637356-1-vdonnefort@google.com
      [2] https://lkml.kernel.org/r/20240411161441.910170-1-david@redhat.com
      
      
      This patch (of 3):
      
      We'll now also cover the case where insert_page() is called from
      __vm_insert_mixed(), which sounds like the right thing to do.
      
      Link: https://lkml.kernel.org/r/20240522125713.775114-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      11b914ee
    • Muhammad Usama Anjum's avatar
      selftests: mm: check return values · c66b0a05
      Muhammad Usama Anjum authored
      Check return value and return error/skip the tests.
      
      Link: https://lkml.kernel.org/r/20240520185248.1801945-1-usama.anjum@collabora.com
      Fixes: 46fd75d4 ("selftests: mm: add pagemap ioctl tests")
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c66b0a05
    • Sidhartha Kumar's avatar
      mm/hugetlb: remove {Set,Clear}Hpage macros · 63818aaf
      Sidhartha Kumar authored
      All users have been converted to use the folio version of these macros, we
      can safely remove the page based interface.
      
      Link: https://lkml.kernel.org/r/20240520224407.110062-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      63818aaf
    • Kairui Song's avatar
      mm/swap: reduce swap cache search space · 7aad25b4
      Kairui Song authored
      Currently we use one swap_address_space for every 64M chunk to reduce lock
      contention, this is like having a set of smaller swap files inside one
      swap device.  But when doing swap cache look up or insert, we are still
      using the offset of the whole large swap device.  This is OK for
      correctness, as the offset (key) is unique.
      
      But Xarray is specially optimized for small indexes, it creates the radix
      tree levels lazily to be just enough to fit the largest key stored in one
      Xarray.  So we are wasting tree nodes unnecessarily.
      
      For 64M chunk it should only take at most 3 levels to contain everything. 
      But if we are using the offset from the whole swap device, the offset
      (key) value will be way beyond 64M, and so will the tree level.
      
      Optimize this by using a new helper swap_cache_index to get a swap entry's
      unique offset in its own 64M swap_address_space.
      
      I see a ~1% performance gain in benchmark and actual workload with high
      memory pressure.
      
      Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
      with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable.  The
      test result is similar but the improvement is smaller if
      SWP_SYNCHRONOUS_IO is enabled, as swap out path can never skip swap
      cache):
      
      Before:
      6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
      0inputs+0outputs (55major+33555018minor)pagefaults 0swaps
      
      After (1.8% faster):
      6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
      0inputs+0outputs (54major+33555027minor)pagefaults 0swaps
      
      Similar result with MySQL and sysbench using swap:
      Before:
      94055.61 qps
      
      After (0.8% faster):
      94834.91 qps
      
      Radix tree slab usage is also very slightly lower.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-12-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7aad25b4
    • Kairui Song's avatar
      mm: drop page_index and simplify folio_index · 05b0c7ed
      Kairui Song authored
      There are two helpers for retrieving the index within address space for
      mixed usage of swap cache and page cache:
      
      - page_index
      - folio_index
      
      This commit drops page_index, as we have eliminated all users, and
      converts folio_index's helper __page_file_index to use folio to avoid the
      page conversion.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-11-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      05b0c7ed
    • Kairui Song's avatar
      mm: remove page_file_offset and folio_file_pos · 564a2ee9
      Kairui Song authored
      These two helpers were useful for mixed usage of swap cache and page
      cache, which help retrieve the corresponding file or swap device offset of
      a page or folio.
      
      They were introduced in commit f981c595 ("mm: methods for teaching
      filesystems about PG_swapcache pages") and used in commit d56b4ddf
      ("nfs: teach the NFS client how to treat PG_swapcache pages"), suppose to
      be used with direct_IO for swap over fs.
      
      But after commit e1209d3a ("mm: introduce ->swap_rw and use it for
      reads from SWP_FS_OPS swap-space"), swap with direct_IO is no more, and
      swap cache mapping is never exposed to fs.
      
      Now we have dropped all users of page_file_offset and folio_file_pos, so
      they can be deleted.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-10-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      564a2ee9
    • Kairui Song's avatar
      mm/swap: get the swap device offset directly · 545ebe71
      Kairui Song authored
      folio_file_pos and page_file_offset are for mixed usage of swap cache and
      page cache, it can't be page cache here, so introduce a new helper to get
      the swap offset in swap device directly.
      
      Need to include swapops.h in mm/swap.h to ensure swp_offset is always
      defined before use.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-9-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      545ebe71
    • Kairui Song's avatar
      nfs: drop usage of folio_file_pos · 237d2907
      Kairui Song authored
      folio_file_pos is only needed for mixed usage of page cache and swap
      cache, for pure page cache usage, the caller can just use folio_pos
      instead.
      
      After commit e1209d3a ("mm: introduce ->swap_rw and use it for reads
      from SWP_FS_OPS swap-space"), swap cache should never be exposed to nfs.
      
      So remove the usage of folio_file_pos in following NFS functions / helpers:
      
      - nfs_vm_page_mkwrite
      
        It's only used by nfs_file_vm_ops.page_mkwrite
      
      - trace event helper: nfs_folio_event
      - trace event helper: nfs_folio_event_done
      
        These two are used through DEFINE_NFS_FOLIO_EVENT and
        DEFINE_NFS_FOLIO_EVENT_DONE, which defined following events:
      
        - trace_nfs_aop_readpage{_done}: only called by nfs_read_folio
        - trace_nfs_writeback_folio: only called by nfs_wb_folio
        - trace_nfs_invalidate_folio: only called by nfs_invalidate_folio
        - trace_nfs_launder_folio_done: only called by nfs_launder_folio
      
        None of them could possibly be used on swap cache folio,
        nfs_read_folio only called by:
        .write_begin -> nfs_read_folio
        .read_folio
      
        nfs_wb_folio only called by nfs mapping:
        .release_folio -> nfs_wb_folio
        .launder_folio -> nfs_wb_folio
        .write_begin -> nfs_read_folio -> nfs_wb_folio
        .read_folio -> nfs_wb_folio
        .write_end -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio
        .page_mkwrite -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio
        .write_begin -> nfs_flush_incompatible -> nfs_wb_folio
        .page_mkwrite -> nfs_vm_page_mkwrite -> nfs_flush_incompatible -> nfs_wb_folio
      
        nfs_invalidate_folio is only called by .invalidate_folio.
        nfs_launder_folio is only called by .launder_folio
      
      - nfs_grow_file
      - nfs_update_folio
      
        nfs_grow_file is only called by nfs_update_folio, and all
        possible callers of them are:
      
        .write_end -> nfs_update_folio
        .page_mkwrite -> nfs_update_folio
      
      - nfs_wb_folio_cancel
      
        .invalidate_folio -> nfs_wb_folio_cancel
      
      Also, seeing from the swap side, swap_rw is now the only interface calling
      into fs, the offset info is always in iocb.ki_pos now.
      
      So we can remove all these folio_file_pos call safely.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-8-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      237d2907
    • Kairui Song's avatar
      netfs: drop usage of folio_file_pos · 7084021c
      Kairui Song authored
      folio_file_pos is only needed for mixed usage of page cache and swap
      cache, for pure page cache usage, the caller can just use folio_pos
      instead.
      
      It can't be a swap cache page here.  Swap mapping may only call into fs
      through swap_rw and that is not supported for netfs.  So just drop it and
      use folio_pos instead.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-7-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7084021c
    • Kairui Song's avatar
      afs: drop usage of folio_file_pos · d4f43986
      Kairui Song authored
      folio_file_pos is only needed for mixed usage of page cache and swap
      cache, for pure page cache usage, the caller can just use folio_pos
      instead.
      
      It can't be a swap cache page here.  Swap mapping may only call into fs
      through swap_rw and that is not supported for afs.  So just drop it and
      use folio_pos instead.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-6-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d4f43986
    • Kairui Song's avatar
      NFS: remove nfs_page_lengthg and usage of page_index · 8586be3d
      Kairui Song authored
      This function is no longer used after commit 4fa7a717 ("NFS: Fix up
      nfs_vm_page_mkwrite() for folios"), all users have been converted to use
      folio instead, just delete it to remove usage of page_index.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-5-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8586be3d
    • Kairui Song's avatar
      ceph: drop usage of page_index · 5e425300
      Kairui Song authored
      page_index is needed for mixed usage of page cache and swap cache, for
      pure page cache usage, the caller can just use page->index instead.
      
      It can't be a swap cache page here, so just drop it.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-4-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e425300
    • Kairui Song's avatar
      nilfs2: drop usage of page_index · 1f49c147
      Kairui Song authored
      Patch series "mm/swap: clean up and optimize swap cache index", v6.
      
      Currently we use one swap_address_space for every 64M chunk to reduce lock
      contention, this is like having a set of smaller files inside a swap
      device.  But when doing swap cache look up or insert, we are still using
      the offset of the whole large swap device.  This is OK for correctness, as
      the offset (key) is unique.
      
      But Xarray is specially optimized for small indexes, it creates the redix
      tree levels lazily to be just enough to fit the largest key stored in one
      Xarray.  So we are wasting tree nodes unnecessarily.
      
      For 64M chunk it should only take at most 3 level to contain everything. 
      But if we are using the offset from the whole swap device, the offset
      (key) value will be way beyond 64M, and so will the tree level.
      
      Optimize this by reduce the swap cache search space into 64M scope.
      
      Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
      with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable.  The
      test result is similar but the improvement is smaller if
      SWP_SYNCHRONOUS_IO is enabled, as swap out path can never skip swap
      cache):
      
      Before:
      6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
      0inputs+0outputs (55major+33555018minor)pagefaults 0swaps
      
      After (+1.8% faster):
      6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
      0inputs+0outputs (54major+33555027minor)pagefaults 0swaps
      
      Similar result with MySQL and sysbench using swap:
      Before:
      94055.61 qps
      
      After (+0.8% faster):
      94834.91 qps
      
      There is alse a very slight drop of radix tree node slab usage:
      Before: 303952K
      After:  302224K
      
      For this series:
      
      There are multiple places that expect mixed type of pages (page cache or
      swap cache), eg. migration, huge memory split; There are four helpers
      for that:
      
      - page_index
      - page_file_offset
      - folio_index
      - folio_file_pos
      
      To keep the code clean and compatible, this series first cleaned up usage
      of them.
      
      page_file_offset and folio_file_pos are historical helpes that can be
      simply dropped after clean up.  And page_index can be all converted to
      folio_index or folio->index.
      
      Then introduce two new helpers swap_cache_index and swap_dev_pos for swap.
      Replace swp_offset with swap_cache_index when used to retrieve folio from
      swap cache, and use swap_dev_pos when needed to retrieve the device
      position of a swap entry.  This way, swap_cache_index can return the
      optimized value with no compatibility issue.
      
      The result is better performance and reduced LOC.
      
      Idealy, in the future, we may want to reduce SWAP_ADDRESS_SPACE_SHIFT from
      14 to 12: Default Xarray chunk offset is 6, so we have 3 level trees
      instead of 2 level trees just for 2 extra bits.  But swap cache is based
      on address_space struct, with 4 times more metadata sparsely distributed
      in memory it waste more cacheline, the performance gain from this series
      is almost canceled according to my test.  So first, just have a cleaner
      seperation of offsets and smaller search space.
      
      
      This patch (of 10):
      
      page_index is only for mixed usage of page cache and swap cache, for pure
      page cache usage, the caller can just use page->index instead.
      
      It can't be a swap cache page here (being part of buffer head), so just
      drop it.  And while we are at it, optimize the code by retrieving the
      offset of the buffer head within the folio directly using bh_offset, and
      get rid of the loop and usage of page helpers.
      
      Link: https://lkml.kernel.org/r/20240521175854.96038-1-ryncsn@gmail.com
      Link: https://lkml.kernel.org/r/20240521175854.96038-3-ryncsn@gmail.comSuggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarKairui Song <kasong@tencent.com>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Marc Dionne <marc.dionne@auristor.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f49c147
    • Kemeng Shi's avatar
      writeback: factor out balance_wb_limits to remove repeated code · 8246291e
      Kemeng Shi authored
      Factor out balance_wb_limits to remove repeated code
      
      [shikemeng@huaweicloud.com: add comment]
        Link: https://lkml.kernel.org/r/20240606033547.344376-1-shikemeng@huaweicloud.com
      [akpm@linux-foundation.org: s/fileds/fields/ in comment]
      Link: https://lkml.kernel.org/r/20240514125254.142203-9-shikemeng@huaweicloud.comSigned-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8246291e
    • Kemeng Shi's avatar
      writeback: factor out wb_dirty_exceeded to remove repeated code · 236d0f16
      Kemeng Shi authored
      Factor out wb_dirty_exceeded to remove repeated code
      
      Link: https://lkml.kernel.org/r/20240514125254.142203-8-shikemeng@huaweicloud.comSigned-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      236d0f16
    • Kemeng Shi's avatar
      writeback: factor out balance_domain_limits to remove repeated code · 8c9918de
      Kemeng Shi authored
      Factor out balance_domain_limits to remove repeated code.
      
      Link: https://lkml.kernel.org/r/20240514125254.142203-7-shikemeng@huaweicloud.comSigned-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8c9918de
    • Kemeng Shi's avatar
      writeback: factor out wb_dirty_freerun to remove more repeated freerun code · 2530e239
      Kemeng Shi authored
      Factor out wb_dirty_freerun to remove more repeated freerun code.
      
      Link: https://lkml.kernel.org/r/20240514125254.142203-6-shikemeng@huaweicloud.comSigned-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2530e239
    • Kemeng Shi's avatar
      writeback: factor out code of freerun to remove repeated code · 9bb48a70
      Kemeng Shi authored
      Factor out code of freerun into new helper functions domain_poll_intv and
      domain_dirty_freerun to remove repeated code.
      
      Link: https://lkml.kernel.org/r/20240514125254.142203-5-shikemeng@huaweicloud.comSigned-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9bb48a70
    • Kemeng Shi's avatar
      writeback: factor out domain_over_bg_thresh to remove repeated code · 6e208329
      Kemeng Shi authored
      Factor out domain_over_bg_thresh from wb_over_bg_thresh to remove repeated
      code.
      
      Link: https://lkml.kernel.org/r/20240514125254.142203-4-shikemeng@huaweicloud.comSigned-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6e208329
    • Kemeng Shi's avatar
      writeback: add general function domain_dirty_avail to calculate dirty and avail of domain · ba62d5cf
      Kemeng Shi authored
      Add general function domain_dirty_avail to calculate dirty and avail for
      either dirty limit or background writeback in either global domain or wb
      domain.
      
      Link: https://lkml.kernel.org/r/20240514125254.142203-3-shikemeng@huaweicloud.comSigned-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ba62d5cf
    • Kemeng Shi's avatar
      writeback: factor out wb_bg_dirty_limits to remove repeated code · 7c0c629b
      Kemeng Shi authored
      Patch series "Add helper functions to remove repeated code and improve
      readability of cgroup writeback", v2.
      
      This series adds a lot of helpers to remove repeated code between domain
      and wb; dirty limit and dirty background; global domain and wb domain. 
      The helpers also improve readability.  More details can be found in the
      respective patches.
      
      A simple domain hierarchy is tested:
      global domain (> 20G)
      	|
      cgroup domain1(10G)
      	|
      	wb1
      	|
      	fio
      
      Test steps:
      /* make it easy to observe */
      echo 300000 > /proc/sys/vm/dirty_expire_centisecs
      echo 3000 > /proc/sys/vm/dirty_writeback_centisecs
      
      /* create cgroup domain */
      cd /sys/fs/cgroup
      echo "+memory +io" > cgroup.subtree_control
      mkdir group1
      cd group1
      echo 10G > memory.high
      echo 10G > memory.max
      echo $$ > cgroup.procs
      mkfs.ext4 -F /dev/vdb
      mount /dev/vdb /bdi1/
      
      /* run fio to generate dirty pages */
      fio -name test -filename=/bdi1/file -size=xxx -ioengine=libaio -bs=4K \
      -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
      
      When fio size is 1G, the wb is in freerun state and dirty pages are only
      written back when dirty inode is expired after 30 seconds.  When fio size
      is 2G, the dirty pages keep being written back and bandwidth of fio is
      limited.
      
      
      This patch (of 8):
      
      Similar to wb_dirty_limits which calculates dirty and thresh of wb,
      wb_bg_dirty_limits calculates background dirty and background thresh of
      wb.  With wb_bg_dirty_limits, we could remove repeated code in
      wb_over_bg_thresh.
      
      Link: https://lkml.kernel.org/r/20240514125254.142203-1-shikemeng@huaweicloud.com
      Link: https://lkml.kernel.org/r/20240514125254.142203-2-shikemeng@huaweicloud.comSigned-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7c0c629b
    • Shakeel Butt's avatar
      mm: vmscan: reset sc->priority on retry · 462966dc
      Shakeel Butt authored
      The commit 6be5e186fd65 ("mm: vmscan: restore incremental cgroup
      iteration") added a retry reclaim heuristic to iterate all the cgroups
      before returning an unsuccessful reclaim but missed to reset the
      sc->priority.  Let's fix it.
      
      Link: https://lkml.kernel.org/r/20240529154911.3008025-1-shakeel.butt@linux.dev
      Fixes: 6be5e186fd65 ("mm: vmscan: restore incremental cgroup iteration")
      Signed-off-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Reported-by: syzbot+17416257cb95200cba44@syzkaller.appspotmail.com
      Tested-by: syzbot+17416257cb95200cba44@syzkaller.appspotmail.com
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      462966dc
    • Johannes Weiner's avatar
      mm: vmscan: restore incremental cgroup iteration · b82b5307
      Johannes Weiner authored
      Currently, reclaim always walks the entire cgroup tree in order to ensure
      fairness between groups.  While overreclaim is limited in shrink_lruvec(),
      many of our systems have a sizable number of active groups, and an even
      bigger number of idle cgroups with cache left behind by previous jobs; the
      mere act of walking all these cgroups can impose significant latency on
      direct reclaimers.
      
      In the past, we've used a save-and-restore iterator that enabled
      incremental tree walks over multiple reclaim invocations.  This ensured
      fairness, while keeping the work of individual reclaimers small.
      
      However, in edge cases with a lot of reclaim concurrency, individual
      reclaimers would sometimes not see enough of the cgroup tree to make
      forward progress and (prematurely) declare OOM.  Consequently we switched
      to comprehensive walks in 1ba6fc9a ("mm: vmscan: do not share cgroup
      iteration between reclaimers").
      
      To address the latency problem without bringing back the premature OOM
      issue, reinstate the shared iteration, but with a restart condition to do
      the full walk in the OOM case - similar to what we do for memory.low
      enforcement and active page protection.
      
      In the worst case, we do one more full tree walk before declaring
      OOM. But the vast majority of direct reclaim scans can then finish
      much quicker, while fairness across the tree is maintained:
      
      - Before this patch, we observed that direct reclaim always takes more
        than 100us and most direct reclaim time is spent in reclaim cycles
        lasting between 1ms and 1 second. Almost 40% of direct reclaim time
        was spent on reclaim cycles exceeding 100ms.
      
      - With this patch, almost all page reclaim cycles last less than 10ms,
        and a good amount of direct page reclaim finishes in under 100us. No
        page reclaim cycles lasting over 100ms were observed anymore.
      
      The shared iterator state is maintaned inside the target cgroup, so
      fair and incremental walks are performed during both global reclaim
      and cgroup limit reclaim of complex subtrees.
      
      Link: https://lkml.kernel.org/r/20240514202641.2821494-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reported-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Facebook Kernel Team <kernel-team@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b82b5307
    • Ran Xiaokai's avatar
      mm/huge_memory: mark racy access onhuge_anon_orders_always · 7f83bf14
      Ran Xiaokai authored
      huge_anon_orders_always is accessed lockless, it is better to use the
      READ_ONCE() wrapper.  This is not fixing any visible bug, hopefully this
      can cease some KCSAN complains in the future.  Also do that for
      huge_anon_orders_madvise.
      
      Link: https://lkml.kernel.org/r/20240515104754889HqrahFPePOIE1UlANHVAh@zte.com.cnSigned-off-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarLu Zhongjun <lu.zhongjun@zte.com.cn>
      Reviewed-by: default avatarxu xin <xu.xin16@zte.com.cn>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7f83bf14
    • Kefeng Wang's avatar
      mm: shmem: use folio_alloc_mpol() in shmem_alloc_folio() · 6f775463
      Kefeng Wang authored
      Let's change shmem_alloc_folio() to take a order and use
      folio_alloc_mpol() helper, then directly use it for normal or large folio
      to cleanup code.
      
      Link: https://lkml.kernel.org/r/20240515070709.78529-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f775463
    • Kefeng Wang's avatar
      mm: mempolicy: use folio_alloc_mpol() in alloc_migration_target_by_mpol() · 1d9cb785
      Kefeng Wang authored
      Convert to use folio_alloc_mpol() to make vma_alloc_folio_noprof() to use
      folio throughout.
      
      Link: https://lkml.kernel.org/r/20240515070709.78529-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d9cb785