1. 04 Sep, 2024 13 commits
    • Mike Rapoport (Microsoft)'s avatar
      MIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION · 6c701269
      Mike Rapoport (Microsoft) authored
      Commit f8f9f21c ("MIPS: Fix build error for loongson64 and sgi-ip27")
      added HAVE_ARCH_NODEDATA_EXTENSION to sgi-ip27 to silence a compilation
      error that happened because sgi-ip27 didn't define array of pg_data_t as
      node_data like most other architectures did.
      
      After addition of node_data array that matches other architectures and
      after ensuring that offline nodes do not appear on node_possible_map, it
      is safe to drop arch_alloc_nodedata() and HAVE_ARCH_NODEDATA_EXTENSION
      from sgi-ip27.
      
      Link: https://lkml.kernel.org/r/20240807064110.1003856-5-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport (Microsoft) <rppt@kernel.org>
      Reviewed-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Rob Herring (Arm) <robh@kernel.org>
      Cc: Samuel Holland <samuel.holland@sifive.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c701269
    • Mike Rapoport (Microsoft)'s avatar
      MIPS: sgi-ip27: ensure node_possible_map only contains valid nodes · 0c445078
      Mike Rapoport (Microsoft) authored
      For SGI IP27 machines node_possible_map is statically set to NODE_MASK_ALL
      and it is not updated during NUMA initialization.
      
      Ensure that it only contains nodes present in the system.
      
      Link: https://lkml.kernel.org/r/20240807064110.1003856-4-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport (Microsoft) <rppt@kernel.org>
      Reviewed-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Rob Herring (Arm) <robh@kernel.org>
      Cc: Samuel Holland <samuel.holland@sifive.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0c445078
    • Mike Rapoport (Microsoft)'s avatar
      MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures · bc5c8ad3
      Mike Rapoport (Microsoft) authored
      sgi-ip27 is the only system that defines NODE_DATA() differently than the
      rest of NUMA machines.
      
      Add node_data array of struct pglist pointers that will point to
      __node_data[node]->pglist and redefine NODE_DATA() to use node_data array.
      
      This will allow pulling declaration of node_data to the generic mm code in
      the next commit.
      
      Link: https://lkml.kernel.org/r/20240807064110.1003856-3-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport (Microsoft) <rppt@kernel.org>
      Reviewed-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Rob Herring (Arm) <robh@kernel.org>
      Cc: Samuel Holland <samuel.holland@sifive.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bc5c8ad3
    • Mike Rapoport (Microsoft)'s avatar
      mm: move kernel/numa.c to mm/ · 0e8b6798
      Mike Rapoport (Microsoft) authored
      Patch series "mm: introduce numa_memblks", v4.
      
      Following the discussion about handling of CXL fixed memory windows on
      arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
      the generic code so they will be available on arm64/riscv and maybe on
      loongarch sometime later.
      
      While it could be possible to use memblock to describe CXL memory windows,
      it currently lacks notion of unpopulated memory ranges and numa_memblks
      does implement this.
      
      Another reason to make numa_memblks generic is that both arch_numa (arm64
      and riscv) and loongarch use trimmed copy of x86 code although there is no
      fundamental reason why the same code cannot be used on all these
      platforms.  Having numa_memblks in mm/ will make it's interaction with
      ACPI and FDT more consistent and I believe will reduce maintenance burden.
      
      And with generic numa_memblks it is (almost) straightforward to enable
      NUMA emulation on arm64 and riscv.
      
      The first 9 commits in this series are cleanups that are not strictly
      related to numa_memblks.
      Commits 10-16 slightly reorder code in x86 to allow extracting numa_memblks
      and NUMA emulation to the generic code.
      Commits 17-19 actually move the code from arch/x86/ to mm/ and commits 20-22
      does some aftermath cleanups.
      Commit 23 updates of_numa_init() to return error of no NUMA nodes were
      found in the device tree.
      Commit 24 switches arch_numa to numa_memblks.
      Commit 25 enables usage of phys_to_target_node() and
      memory_add_physaddr_to_nid() with numa_memblks.
      Commit 26 moves the description for numa=fake from x86 to admin-guide.
      
      [1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/
      
      
      This patch (of 26):
      
      The stub functions in kernel/numa.c belong to mm/ rather than to kernel/
      
      Link: https://lkml.kernel.org/r/20240807064110.1003856-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20240807064110.1003856-2-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport (Microsoft) <rppt@kernel.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
      Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Rob Herring (Arm) <robh@kernel.org>
      Cc: Samuel Holland <samuel.holland@sifive.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0e8b6798
    • Kairui Song's avatar
      mm: swap: add a adaptive full cluster cache reclaim · 2cacbdfd
      Kairui Song authored
      Link all full cluster with one full list, and reclaim from it when the
      allocation have ran out of all usable clusters.
      
      There are many reason a folio can end up being in the swap cache while
      having no swap count reference.  So the best way to search for such slots
      is still by iterating the swap clusters.
      
      With the list as an LRU, iterating from the oldest cluster and keep them
      rotating is a very doable and clean way to free up potentially not inuse
      clusters.
      
      When any allocation failure, try reclaim and rotate only one cluster. 
      This is adaptive for high order allocations they can tolerate fallback. 
      So this avoids latency, and give the full cluster list an fair chance to
      get reclaimed.  It release the usage stress for the fallback order 0
      allocation or following up high order allocation.
      
      If the swap device is getting very full, reclaim more aggresively to
      ensure no OOM will happen.  This ensures order 0 heavy workload won't go
      OOM as order 0 won't fail if any cluster still have any space.
      
      [ryncsn@gmail.com: fix discard of full cluster]
        Link: https://lkml.kernel.org/r/CAMgjq7CWwK75_2Zi5P40K08pk9iqOcuWKL6khu=x4Yg_nXaQag@mail.gmail.com
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-9-cb9c148b9297@kernel.orgSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reported-by: default avatarBarry Song <21cnbao@gmail.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kairui Song <ryncsn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2cacbdfd
    • Kairui Song's avatar
      mm: swap: relaim the cached parts that got scanned · 661383c6
      Kairui Song authored
      This commit implements reclaim during scan for cluster allocator.
      
      Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could
      result in low allocation success rate or early OOM.
      
      So to ensure maximum allocation success rate, integrate reclaiming with
      scanning.  If found a range of suitable swap slots but fragmented due to
      HAS_CACHE, just try to reclaim the slots.
      
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-8-cb9c148b9297@kernel.orgSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reported-by: default avatarBarry Song <21cnbao@gmail.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      661383c6
    • Kairui Song's avatar
      mm: swap: add a fragment cluster list · 477cb7ba
      Kairui Song authored
      Now swap cluster allocator arranges the clusters in LRU style, so the
      "cold" cluster stay at the head of nonfull lists are the ones that were
      used for allocation long time ago and still partially occupied.  So if
      allocator can't find enough contiguous slots to satisfy an high order
      allocation, it's unlikely there will be slot being free on them to satisfy
      the allocation, at least in a short period.
      
      As a result, nonfull cluster scanning will waste time repeatly scanning
      the unusable head of the list.
      
      Also, multiple CPUs could content on the same head cluster of nonfull
      list.  Unlike free clusters which are removed from the list when any CPU
      starts using it, nonfull cluster stays on the head.
      
      So introduce a new list frag list, all scanned nonfull clusters will be
      moved to this list.  Both for avoiding repeated scanning and contention.
      
      Frag list is still used as fallback for allocations, so if one CPU failed
      to allocate one order of slots, it can still steal other CPU's clusters. 
      And order 0 will favor the fragmented clusters to better protect nonfull
      clusters
      
      If any slots on a fragment list are being freed, move the fragment list
      back to nonfull list indicating it worth another scan on the cluster. 
      Compared to scan upon freeing a slot, this keep the scanning lazy and save
      some CPU if there are still other clusters to use.
      
      It may seems unneccessay to keep the fragmented cluster on list at all if
      they can't be used for specific order allocation.  But this will start to
      make sense once reclaim dring scanning is ready.
      
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-7-cb9c148b9297@kernel.orgSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reported-by: default avatarBarry Song <21cnbao@gmail.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      477cb7ba
    • Kairui Song's avatar
      mm: swap: allow cache reclaim to skip slot cache · 862590ac
      Kairui Song authored
      Currently we free the reclaimed slots through slot cache even if the slot
      is required to be empty immediately.  As a result the reclaim caller will
      see the slot still occupied even after a successful reclaim, and need to
      keep reclaiming until slot cache get flushed.  This caused ineffective or
      over reclaim when SWAP is under stress.
      
      So introduce a new flag allowing the slot to be emptied bypassing the slot
      cache.
      
      [21cnbao@gmail.com: small folios should have nr_pages == 1 but not nr_page == 0]
        Link: https://lkml.kernel.org/r/20240805015324.45134-1-21cnbao@gmail.com
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-6-cb9c148b9297@kernel.orgSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reported-by: default avatarBarry Song <21cnbao@gmail.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      862590ac
    • Kairui Song's avatar
      mm: swap: skip slot cache on freeing for mTHP · 650975d2
      Kairui Song authored
      Currently when we are freeing mTHP folios from swap cache, we free then
      one by one and put each entry into swap slot cache.  Slot cache is
      designed to reduce the overhead by batching the freeing, but mTHP swap
      entries are already continuous so they can be batch freed without it
      already, it saves litle overhead, or even increase overhead for larger
      mTHP.
      
      What's more, mTHP entries could stay in swap cache for a while. 
      Contiguous swap entry is an rather rare resource so releasing them
      directly can help improve mTHP allocation success rate when under
      pressure.
      
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-5-cb9c148b9297@kernel.orgSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reported-by: default avatarBarry Song <21cnbao@gmail.com>
      Acked-by: default avatarBarry Song <baohua@kernel.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      650975d2
    • Kairui Song's avatar
      mm: swap: clean up initialization helper · 3b2561b5
      Kairui Song authored
      At this point, alloc_cluster is never called already, and
      inc_cluster_info_page is called by initialization only, a lot of dead code
      can be dropped.
      
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-4-cb9c148b9297@kernel.orgSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reported-by: default avatarBarry Song <21cnbao@gmail.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3b2561b5
    • Chris Li's avatar
      mm: swap: separate SSD allocation from scan_swap_map_slots() · 5f843a9a
      Chris Li authored
      Previously the SSD and HDD share the same swap_map scan loop in
      scan_swap_map_slots().  This function is complex and hard to flow the
      execution flow.
      
      scan_swap_map_try_ssd_cluster() can already do most of the heavy lifting
      to locate the candidate swap range in the cluster.  However it needs to go
      back to scan_swap_map_slots() to check conflict and then perform the
      allocation.
      
      When scan_swap_map_try_ssd_cluster() failed, it still depended on the
      scan_swap_map_slots() to do brute force scanning of the swap_map.  When
      the swapfile is large and almost full, it will take some CPU time to go
      through the swap_map array.
      
      Get rid of the cluster allocation dependency on the swap_map scan loop in
      scan_swap_map_slots().  Streamline the cluster allocation code path.  No
      more conflict checks.
      
      For order 0 swap entry, when run out of free and nonfull list.  It will
      allocate from the higher order nonfull cluster list.
      
      Users should see less CPU time spent on searching the free swap slot when
      swapfile is almost full.
      
      [ryncsn@gmail.com: fix array-bounds error with CONFIG_THP_SWAP=n]
        Link: https://lkml.kernel.org/r/CAMgjq7Bz0DY+rY0XgCoH7-Q=uHLdo3omi8kUr4ePDweNyofsbQ@mail.gmail.com
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-3-cb9c148b9297@kernel.orgSigned-off-by: default avatarChris Li <chrisl@kernel.org>
      Signed-off-by: default avatarKairui Song <kasong@tencent.com>
      Reported-by: default avatarBarry Song <21cnbao@gmail.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5f843a9a
    • Chris Li's avatar
      mm: swap: mTHP allocate swap entries from nonfull list · d07a46a4
      Chris Li authored
      Track the nonfull cluster as well as the empty cluster on lists.  Each
      order has one nonfull cluster list.
      
      The cluster will remember which order it was used during new cluster
      allocation.
      
      When the cluster has free entry, add to the nonfull[order] list.   When
      the free cluster list is empty, also allocate from the nonempty list of
      that order.
      
      This improves the mTHP swap allocation success rate.
      
      There are limitations if the distribution of numbers of different orders
      of mTHP changes a lot.  e.g.  there are a lot of nonfull cluster assign to
      order A while later time there are a lot of order B allocation while very
      little allocation in order A.  Currently the cluster used by order A will
      not reused by order B unless the cluster is 100% empty.
      
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-2-cb9c148b9297@kernel.orgSigned-off-by: default avatarChris Li <chrisl@kernel.org>
      Reported-by: default avatarBarry Song <21cnbao@gmail.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d07a46a4
    • Chris Li's avatar
      mm: swap: swap cluster switch to double link list · 73ed0baa
      Chris Li authored
      Patch series "mm: swap: mTHP swap allocator base on swap cluster order",
      v5.
      
      This is the short term solutions "swap cluster order" listed in my "Swap
      Abstraction" discussion slice 8 in the recent LSF/MM conference.
      
      When commit 845982eb "mm: swap: allow storage of all mTHP orders" is
      introduced, it only allocates the mTHP swap entries from the new empty
      cluster list.   It has a fragmentation issue reported by Barry.
      
      https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
      
      The reason is that all the empty clusters have been exhausted while there
      are plenty of free swap entries in the cluster that are not 100% free.
      
      Remember the swap allocation order in the cluster.  Keep track of the per
      order non full cluster list for later allocation.
      
      This series gives the swap SSD allocation a new separate code path from
      the HDD allocation.  The new allocator use cluster list only and do not
      global scan swap_map[] without lock any more.
      
      This streamline the swap allocation for SSD.  The code matches the
      execution flow much better.
      
      User impact: For users that allocate and free mix order mTHP swapping, It
      greatly improves the success rate of the mTHP swap allocation after the
      initial phase.
      
      It also performs faster when the swapfile is close to full, because the
      allocator can get the non full cluster from a list rather than scanning a
      lot of swap_map entries. 
      
      With Barry's mthp test program V2:
      
      Without:
      $ ./thp_swap_allocator_test -a
      Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71%
      Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
      Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
      ...
      Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
      Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00%
      Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
      
      $ ./thp_swap_allocator_test -a -s
      Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
      Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
      Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
      ..
      Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
      Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
      Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
      
      $ ./thp_swap_allocator_test -s
      Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
      Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
      Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
      ..
      Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
      Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
      Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
      
      $ ./thp_swap_allocator_test
      Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
      Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
      Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
      ..
      Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
      Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
      Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
      
      With: # with all 0.00% filter out
      $ ./thp_swap_allocator_test -a | grep -v "0.00%"
      $ # all result are 0.00%
      
      $ ./thp_swap_allocator_test -a -s | grep -v "0.00%"
      ./thp_swap_allocator_test -a -s | grep -v "0.00%" 
      Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
      Iteration 19: swpout inc: 219, swpout fallback inc: 7, Fallback percentage: 3.10%
      Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44%
      Iteration 29: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
      Iteration 34: swpout inc: 220, swpout fallback inc: 8, Fallback percentage: 3.51%
      Iteration 35: swpout inc: 222, swpout fallback inc: 11, Fallback percentage: 4.72%
      Iteration 38: swpout inc: 217, swpout fallback inc: 4, Fallback percentage: 1.81%
      Iteration 40: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63%
      Iteration 42: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90%
      Iteration 43: swpout inc: 215, swpout fallback inc: 7, Fallback percentage: 3.15%
      Iteration 47: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
      Iteration 49: swpout inc: 217, swpout fallback inc: 1, Fallback percentage: 0.46%
      Iteration 52: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49%
      Iteration 56: swpout inc: 224, swpout fallback inc: 4, Fallback percentage: 1.75%
      Iteration 58: swpout inc: 214, swpout fallback inc: 5, Fallback percentage: 2.28%
      Iteration 62: swpout inc: 220, swpout fallback inc: 3, Fallback percentage: 1.35%
      Iteration 64: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
      Iteration 67: swpout inc: 221, swpout fallback inc: 1, Fallback percentage: 0.45%
      Iteration 75: swpout inc: 220, swpout fallback inc: 9, Fallback percentage: 3.93%
      Iteration 82: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
      Iteration 86: swpout inc: 211, swpout fallback inc: 12, Fallback percentage: 5.38%
      Iteration 89: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
      Iteration 93: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45%
      Iteration 94: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
      Iteration 96: swpout inc: 221, swpout fallback inc: 6, Fallback percentage: 2.64%
      Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
      Iteration 99: swpout inc: 227, swpout fallback inc: 3, Fallback percentage: 1.30%
      
      $ ./thp_swap_allocator_test      
      ./thp_swap_allocator_test 
      Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
      Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
      Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
      Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
      Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
      Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
      Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
      Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
      Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
      Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
      Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
      Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
      Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
      Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
      Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
      Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
      Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
      
      $ ./thp_swap_allocator_test -s
       ./thp_swap_allocator_test -s
      Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
      Iteration 2: swpout inc: 97, swpout fallback inc: 135, Fallback percentage: 58.19%
      Iteration 3: swpout inc: 42, swpout fallback inc: 192, Fallback percentage: 82.05%
      Iteration 4: swpout inc: 19, swpout fallback inc: 214, Fallback percentage: 91.85%
      Iteration 5: swpout inc: 12, swpout fallback inc: 213, Fallback percentage: 94.67%
      Iteration 6: swpout inc: 11, swpout fallback inc: 217, Fallback percentage: 95.18%
      Iteration 7: swpout inc: 9, swpout fallback inc: 214, Fallback percentage: 95.96%
      Iteration 8: swpout inc: 8, swpout fallback inc: 213, Fallback percentage: 96.38%
      Iteration 9: swpout inc: 2, swpout fallback inc: 223, Fallback percentage: 99.11%
      Iteration 10: swpout inc: 2, swpout fallback inc: 228, Fallback percentage: 99.13%
      Iteration 11: swpout inc: 4, swpout fallback inc: 214, Fallback percentage: 98.17%
      Iteration 12: swpout inc: 5, swpout fallback inc: 226, Fallback percentage: 97.84%
      Iteration 13: swpout inc: 3, swpout fallback inc: 212, Fallback percentage: 98.60%
      Iteration 14: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
      Iteration 15: swpout inc: 3, swpout fallback inc: 222, Fallback percentage: 98.67%
      Iteration 16: swpout inc: 4, swpout fallback inc: 223, Fallback percentage: 98.24%
      
      =========
      Kernel compile under tmpfs with cgroup memory.max = 470M.
      12 core 24 hyperthreading, 32 jobs. 10 Run each group
      
      SSD swap 10 runs average, 20G swap partition:
      With:
      user    2929.064
      system  1479.381 : 1376.89 1398.22 1444.64 1477.39 1479.04 1497.27
      1504.47 1531.4 1532.92 1551.57
      real    1441.324
      
      Without:
      user    2910.872
      system  1482.732 : 1440.01 1451.4 1462.01 1467.47 1467.51 1469.3
      1470.19 1496.32 1544.1 1559.01
      real    1580.822
      
      Two zram swap: zram0 3.0G zram1 20G.
      
      The idea is forcing the zram0 almost full then overflow to zram1:
      
      With:
      user    4320.301
      system  4272.403 : 4236.24 4262.81 4264.75 4269.13 4269.44 4273.06
      4279.85 4285.98 4289.64 4293.13
      real    431.759
      
      Without
      user    4301.393
      system  4387.672 : 4374.47 4378.3 4380.95 4382.84 4383.06 4388.05
      4389.76 4397.16 4398.23 4403.9
      real    433.979
      
      ------ more test result from Kaiui ----------
      
      Test with build linux kernel using a 4G ZRAM, 1G memory.max limit on top of shmem:
      
      System info: 32 Core AMD Zen2, 64G total memory.
      
      Test 3 times using only 4K pages:
      =================================
      
      With:
      -----
      1838.74user 2411.21system 2:37.86elapsed 2692%CPU (0avgtext+0avgdata 847060maxresident)k
      1839.86user 2465.77system 2:39.35elapsed 2701%CPU (0avgtext+0avgdata 847060maxresident)k
      1840.26user 2454.68system 2:39.43elapsed 2693%CPU (0avgtext+0avgdata 847060maxresident)k
      
      Summary (~4.6% improment of system time):
      User: 1839.62
      System: 2443.89: 2465.77 2454.68 2411.21
      Real: 158.88
      
      Without:
      --------
      1837.99user 2575.95system 2:43.09elapsed 2706%CPU (0avgtext+0avgdata 846520maxresident)k
      1838.32user 2555.15system 2:42.52elapsed 2709%CPU (0avgtext+0avgdata 846520maxresident)k
      1843.02user 2561.55system 2:43.35elapsed 2702%CPU (0avgtext+0avgdata 846520maxresident)k
      
      Summary:
      User: 1839.78
      System: 2564.22: 2575.95 2555.15 2561.55
      Real: 162.99
      
      Test 5 times using enabled all mTHP pages:
      ==========================================
      
      With:
      -----
      1796.44user 2937.33system 2:59.09elapsed 2643%CPU (0avgtext+0avgdata 846936maxresident)k
      1802.55user 3002.32system 2:54.68elapsed 2750%CPU (0avgtext+0avgdata 847072maxresident)k
      1806.59user 2986.53system 2:55.17elapsed 2736%CPU (0avgtext+0avgdata 847092maxresident)k
      1803.27user 2982.40system 2:54.49elapsed 2742%CPU (0avgtext+0avgdata 846796maxresident)k
      1807.43user 3036.08system 2:56.06elapsed 2751%CPU (0avgtext+0avgdata 846488maxresident)k
      
      Summary (~8.4% improvement of system time):
      User: 1803.25
      System: 2988.93: 2937.33 3002.32 2986.53 2982.40 3036.08
      Real: 175.90
      
      mTHP swapout status:
      /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout:347721
      /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout_fallback:3110
      /sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout:3365
      /sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout_fallback:8269
      /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout:24
      /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout_fallback:3341
      /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout:145
      /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout_fallback:5038
      /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout:322737
      /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback:36808
      /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout:380455
      /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout_fallback:1010
      /sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout:24973
      /sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout_fallback:13223
      /sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout:197348
      /sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout_fallback:80541
      
      Without:
      --------
      1794.41user 3151.29system 3:05.97elapsed 2659%CPU (0avgtext+0avgdata 846704maxresident)k
      1810.27user 3304.48system 3:05.38elapsed 2759%CPU (0avgtext+0avgdata 846636maxresident)k
      1809.84user 3254.85system 3:03.83elapsed 2755%CPU (0avgtext+0avgdata 846952maxresident)k
      1813.54user 3259.56system 3:04.28elapsed 2752%CPU (0avgtext+0avgdata 846848maxresident)k
      1829.97user 3338.40system 3:07.32elapsed 2759%CPU (0avgtext+0avgdata 847024maxresident)k
      
      Summary:
      User: 1811.61
      System: 3261.72 : 3151.29 3304.48 3254.85 3259.56 3338.40
      Real: 185.356
      
      mTHP swapout status:
      hugepages-32kB/stats/swpout:35630
      hugepages-32kB/stats/swpout_fallback:1809908
      hugepages-512kB/stats/swpout:523
      hugepages-512kB/stats/swpout_fallback:55235
      hugepages-2048kB/stats/swpout:53
      hugepages-2048kB/stats/swpout_fallback:17264
      hugepages-1024kB/stats/swpout:85
      hugepages-1024kB/stats/swpout_fallback:24979
      hugepages-64kB/stats/swpout:30117
      hugepages-64kB/stats/swpout_fallback:1825399
      hugepages-16kB/stats/swpout:42775
      hugepages-16kB/stats/swpout_fallback:1951123
      hugepages-256kB/stats/swpout:2326
      hugepages-256kB/stats/swpout_fallback:170165
      hugepages-128kB/stats/swpout:17925
      hugepages-128kB/stats/swpout_fallback:1309757
      
      
      This patch (of 9):
      
      Previously, the swap cluster used a cluster index as a pointer to
      construct a custom single link list type "swap_cluster_list".  The next
      cluster pointer is shared with the cluster->count.  It prevents puting the
      non free cluster into a list.
      
      Change the cluster to use the standard double link list instead.  This
      allows tracing the nonfull cluster in the follow up patch.  That way, it
      is faster to get to the nonfull cluster of that order.
      
      Remove the cluster getter/setter for accessing the cluster struct member.
      
      The list operation is protected by the swap_info_struct->lock.
      
      Change cluster code to use "struct swap_cluster_info *" to reference the
      cluster rather than by using index.  That is more consistent with the list
      manipulation.  It avoids the repeat adding index to the cluser_info.  The
      code is easier to understand.
      
      Remove the cluster next pointer is NULL flag, the double link list can
      handle the empty list pretty well.
      
      The "swap_cluster_info" struct is two pointer bigger, because 512 swap
      entries share one swap_cluster_info struct, it has very little impact on
      the average memory usage per swap entry.  For 1TB swapfile, the swap
      cluster data structure increases from 8MB to 24MB.
      
      Other than the list conversion, there is no real function change in this
      patch.
      
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org
      Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-1-cb9c148b9297@kernel.orgSigned-off-by: default avatarChris Li <chrisl@kernel.org>
      Reported-by: default avatarBarry Song <21cnbao@gmail.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      73ed0baa
  2. 02 Sep, 2024 27 commits