1. 26 Apr, 2024 1 commit
    • Lucas Stach's avatar
      mm: page_alloc: control latency caused by zone PCP draining · 55f77df7
      Lucas Stach authored
      Patch series "mm/treewide: Remove pXd_huge() API", v2.
      
      In previous work [1], we removed the pXd_large() API, which is arch
      specific.  This patchset further removes the hugetlb pXd_huge() API.
      
      Hugetlb was never special on creating huge mappings when compared with
      other huge mappings.  Having a standalone API just to detect such pgtable
      entries is more or less redundant, especially after the pXd_leaf() API set
      is introduced with/without CONFIG_HUGETLB_PAGE.
      
      When looking at this problem, a few issues are also exposed that we don't
      have a clear definition of the *_huge() variance API.  This patchset
      started by cleaning these issues first, then replace all *_huge() users to
      use *_leaf(), then drop all *_huge() code.
      
      On x86/sparc, swap entries will be reported "true" in pXd_huge(), while
      for all the rest archs they're reported "false" instead.  This part is
      done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix,
      but I'll leave that to hmm experts to decide.
      
      Besides, there are three archs (arm, arm64, powerpc) that have slightly
      different definitions between the *_huge() v.s.  *_leaf() variances.  I
      tackled them separately so that it'll be easier for arch experts to chim
      in when necessary.  This part is done in patch 6-9.
      
      The final patches 10-14 do the rest on the final removal, since *_leaf()
      will be the ultimate API in the future, and we seem to have quite some
      confusions on how *_huge() APIs can be defined, provide a rich comment for
      *_leaf() API set to define them properly to avoid future misuse, and
      hopefully that'll also help new archs to start support huge mappings and
      avoid traps (like either swap entries, or PROT_NONE entry checks).
      
      [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com
      
      
      This patch (of 14):
      
      When the complete PCP is drained a much larger number of pages than the
      usual batch size might be freed at once, causing large IRQ and preemption
      latency spikes, as they are all freed while holding the pcp and zone
      spinlocks.
      
      To avoid those latency spikes, limit the number of pages freed in a single
      bulk operation to common batch limits.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@pengutronix.de
      
      Signed-off-by: default avatarLucas Stach <l.stach@pengutronix.de>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55f77df7
  2. 06 Mar, 2024 1 commit
  3. 05 Mar, 2024 14 commits
  4. 24 Feb, 2024 3 commits
  5. 22 Feb, 2024 1 commit
  6. 08 Jan, 2024 2 commits
  7. 29 Dec, 2023 1 commit
  8. 20 Dec, 2023 1 commit
  9. 11 Dec, 2023 4 commits
    • Charan Teja Kalla's avatar
      mm: page_alloc: unreserve highatomic page blocks before oom · ac3f3b0a
      Charan Teja Kalla authored
      __alloc_pages_direct_reclaim() is called from slowpath allocation where
      high atomic reserves can be unreserved after there is a progress in
      reclaim and yet no suitable page is found.  Later should_reclaim_retry()
      gets called from slow path allocation to decide if the reclaim needs to be
      retried before OOM kill path is taken.
      
      should_reclaim_retry() checks the available(reclaimable + free pages)
      memory against the min wmark levels of a zone and returns:
      
      a) true, if it is above the min wmark so that slow path allocation will
         do the reclaim retries.
      
      b) false, thus slowpath allocation takes oom kill path.
      
      should_reclaim_retry() can also unreserves the high atomic reserves **but
      only after all the reclaim retries are exhausted.**
      
      In a case where there are almost none reclaimable memory and free pages
      contains mostly the high atomic reserves but allocation context can't use
      these high atomic reserves, makes the available memory below min wmark
      levels hence false is returned from should_reclaim_retry() leading the
      allocation request to take OOM kill path.  This can turn into a early oom
      kill if high atomic reserves are holding lot of free memory and
      unreserving of them is not attempted.
      
      (early)OOM is encountered on a VM with the below state:
      [  295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
      high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
      active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
      present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
      local_pcp:492kB free_cma:0kB
      [  295.998656] lowmem_reserve[]: 0 32
      [  295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
      33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
      0*4096kB = 7752kB
      
      Per above log, the free memory of ~7MB exist in the high atomic reserves
      is not freed up before falling back to oom kill path.
      
      Fix it by trying to unreserve the high atomic reserves in
      should_reclaim_retry() before __alloc_pages_direct_reclaim() can fallback
      to oom kill path.
      
      Link: https://lkml.kernel.org/r/1700823445-27531-1-git-send-email-quic_charante@quicinc.com
      Fixes: 0aaa29a5
      
       ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Reported-by: default avatarChris Goldsworthy <quic_cgoldswo@quicinc.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac3f3b0a
    • Charan Teja Kalla's avatar
      mm: page_alloc: enforce minimum zone size to do high atomic reserves · 9cd20f3f
      Charan Teja Kalla authored
      Highatomic reserves are set to roughly 1% of zone for maximum and a
      pageblock size for minimum.  Encountered a system with the below
      configuration:
      Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
      reserved_highatomic:8192KB managed:49224kB
      
      On such systems, even a single pageblock makes highatomic reserves are set
      to ~8% of the zone memory.  This high value can easily exert pressure on
      the zone.
      
      Per discussion with Michal and Mel, it is not much useful to reserve the
      memory for highatomic allocations on such small systems[1].  Since the
      minimum size for high atomic reserves is always going to be a pageblock
      size and if 1% of zone managed pages is going to be below pageblock size,
      don't reserve memory for high atomic allocations.  Thanks Michal for this
      suggestion[2].
      
      Since no memory is being reserved for high atomic allocations and if
      respective allocation failures are seen, this patch can be reverted.
      
      [1] https://lore.kernel.org/linux-mm/20231117161956.d3yjdxhhm4rhl7h2@techsingularity.net/
      [2] https://lore.kernel.org/linux-mm/ZVYRJMUitykepLRy@tiehlicka/
      
      Link: https://lkml.kernel.org/r/c3a2a48e2cfe08176a80eaf01c110deb9e918055.1700821416.git.quic_charante@quicinc.com
      
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cd20f3f
    • Charan Teja Kalla's avatar
      mm: page_alloc: correct high atomic reserve calculations · d68e39fc
      Charan Teja Kalla authored
      Patch series "mm: page_alloc: fixes for high atomic reserve
      caluculations", v3.
      
      The state of the system where the issue exposed shown in oom kill logs:
      
      [  295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kBlocal_pcp:492kB free_cma:0kB
      [  295.998656] lowmem_reserve[]: 0 32
      [  295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
      33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7752kB
      
      From the above, it is seen that ~16MB of memory reserved for high atomic
      reserves against the expectation of 1% reserves which is fixed in the 1st
      patch.
      
      Don't reserve the high atomic page blocks if 1% of zone memory size is
      below a pageblock size.
      
      
      This patch (of 2):
      
      reserve_highatomic_pageblock() aims to reserve the 1% of the managed pages
      of a zone, which is used for the high order atomic allocations.
      
      It uses the below calculation to reserve:
      static void reserve_highatomic_pageblock(struct page *page, ....) {
      
         .......
         max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;
      
         if (zone->nr_reserved_highatomic >= max_managed)
             goto out;
      
         zone->nr_reserved_highatomic += pageblock_nr_pages;
         set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
         move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
      
      out:
         ....
      }
      
      Since we are always appending the 1% of zone managed pages count to
      pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the
      nr_reserved_highatomic is incremented/decremented in pageblock sizes.
      
      Encountered a system(actually a VM running on the Linux kernel) with the
      below zone configuration:
      Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
      reserved_highatomic:8192KB managed:49224kB
      
      The existing calculations making it to reserve the 8MB(with pageblock size
      of 4MB) i.e.  16% of the zone managed memory.  Reserving such high amount
      of memory can easily exert memory pressure in the system thus may lead
      into unnecessary reclaims till unreserving of high atomic reserves.
      
      Since high atomic reserves are managed in pageblock size granules, as
      MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for
      high atomic reserves as, minimum is pageblock size , maximum is
      approximately 1% of the zone managed pages.
      
      Link: https://lkml.kernel.org/r/cover.1700821416.git.quic_charante@quicinc.com
      Link: https://lkml.kernel.org/r/1660034138397b82a0a8b6ae51cbe96bd583d89e.1700821416.git.quic_charante@quicinc.com
      
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d68e39fc
    • Brendan Jackman's avatar
      mm/page_alloc: dedupe some memcg uncharging logic · 17b46e7b
      Brendan Jackman authored
      The duplication makes it seem like some work is required before uncharging
      in the !PageHWPoison case.  But it isn't, so we can simplify the code a
      little.
      
      Note the PageMemcgKmem check is redundant, but I've left it in as it
      avoids an unnecessary function call.
      
      Link: https://lkml.kernel.org/r/20231108164920.3401565-1-jackmanb@google.com
      
      Signed-off-by: default avatarBrendan Jackman <jackmanb@google.com>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17b46e7b
  10. 26 Nov, 2023 1 commit
    • Jesper Dangaard Brouer's avatar
      mm/page_pool: catch page_pool memory leaks · dba1b8a7
      Jesper Dangaard Brouer authored
      
      Pages belonging to a page_pool (PP) instance must be freed through the
      PP APIs in-order to correctly release any DMA mappings and release
      refcnt on the DMA device when freeing PP instance. When PP release a
      page (page_pool_release_page) the page->pp_magic value is cleared.
      
      This patch detect a leaked PP page in free_page_is_bad() via
      unexpected state of page->pp_magic value being PP_SIGNATURE.
      
      We choose to report and treat it as a bad page. It would be possible
      to release the page via returning it to the PP instance as the
      page->pp pointer is likely still valid.
      
      Notice this code is only activated when either compiled with
      CONFIG_DEBUG_VM or boot cmdline debug_pagealloc=on, and
      CONFIG_PAGE_POOL.
      
      Reduced example output of leak with PP_SIGNATURE = dead000000000040:
      
       BUG: Bad page state in process swapper/4  pfn:141fa6
       page:000000006dbf8062 refcount:0 mapcount:0 mapping:0000000000000000 index:0x141fa6000 pfn:0x141fa6
       flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff)
       page_type: 0xffffffff()
       raw: 002fffff80000000 dead000000000040 ffff88814888a000 0000000000000000
       raw: 0000000141fa6000 0000000000000001 00000000ffffffff 0000000000000000
       page dumped because: page_pool leak
       [...]
       Call Trace:
        <IRQ>
        dump_stack_lvl+0x32/0x50
        bad_page+0x70/0xf0
        free_unref_page_prepare+0x263/0x430
        free_unref_page+0x34/0x130
        mlx5e_free_rx_mpwqe+0x190/0x1c0 [mlx5_core]
        mlx5e_post_rx_mpwqes+0x1ac/0x280 [mlx5_core]
        mlx5e_napi_poll+0x12b/0x710 [mlx5_core]
        ? skb_free_head+0x4f/0x90
        __napi_poll+0x2b/0x1c0
        net_rx_action+0x27b/0x360
      
      The advantage is the Call Trace directly points to the function
      leaking the PP page, which in this case is an on purpose bug
      introduced into the mlx5 driver to test this code change.
      
      Currently PP will periodically in page_pool_release_retry()
      printk warning "stalled pool shutdown" which cannot be directly
      corrolated to leaking and might as well be a false positive
      due to SKBs being stuck on a socket for an extended period.
      After this patch we should be able to remove this printk.
      Signed-off-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dba1b8a7
  11. 25 Oct, 2023 11 commits
    • Hugh Dickins's avatar
      mm: add page_rmappable_folio() wrapper · 23e48832
      Hugh Dickins authored
      folio_prep_large_rmappable() is being used repeatedly along with a
      conversion from page to folio, a check non-NULL, a check order > 1: wrap
      it all up into struct folio *page_rmappable_folio(struct page *).
      
      Link: https://lkml.kernel.org/r/8d92c6cf-eebe-748-e29c-c8ab224c741@google.com
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      23e48832
    • Hyesoo Yu's avatar
      mm: page_alloc: check the order of compound page even when the order is zero · 76f26535
      Hyesoo Yu authored
      For compound pages, the head sets the PG_head flag and the tail sets the
      compound_head to indicate the head page.  If a user allocates a compound
      page and frees it with a different order, the compound page information
      will not be properly initialized.  To detect this problem,
      compound_order(page) and the order argument are compared, but this is not
      checked when the order argument is zero.  That error should be checked
      regardless of the order.
      
      Link: https://lkml.kernel.org/r/20231023083217.1866451-1-hyesoo.yu@samsung.com
      
      Signed-off-by: default avatarHyesoo Yu <hyesoo.yu@samsung.com>
      Reviewed-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      76f26535
    • Qi Zheng's avatar
      mm: page_alloc: skip memoryless nodes entirely · c2baef39
      Qi Zheng authored
      Patch series "handle memoryless nodes more appropriately", v3.
      
      Currently, in the process of initialization or offline memory, memoryless
      nodes will still be built into the fallback list of itself or other nodes.
      
      This is not what we expected, so this patch series removes memoryless
      nodes from the fallback list entirely.
      
      
      This patch (of 2):
      
      In find_next_best_node(), we skipped the memoryless nodes when building
      the zonelists of other normal nodes (N_NORMAL), but did not skip the
      memoryless node itself when building the zonelist.  This will cause it to
      be traversed at runtime.
      
      For example, say we have node0 and node1, node0 is memoryless
      node, then the fallback order of node0 and node1 as follows:
      
      [    0.153005] Fallback order for Node 0: 0 1
      [    0.153564] Fallback order for Node 1: 1
      
      After this patch, we skip memoryless node0 entirely, then
      the fallback order of node0 and node1 as follows:
      
      [    0.155236] Fallback order for Node 0: 1
      [    0.155806] Fallback order for Node 1: 1
      
      So it becomes completely invisible, which will reduce runtime
      overhead.
      
      And in this way, we will not try to allocate pages from memoryless node0,
      then the panic mentioned in [1] will also be fixed.  Even though this
      problem has been solved by dropping the NODE_MIN_SIZE constrain in x86
      [2], it would be better to fix it in core MM as well.
      
      [1]. https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/
      [2]. https://lore.kernel.org/all/20231017062215.171670-1-rppt@kernel.org/
      
      [zhengqi.arch@bytedance.com: update comment, per Ingo]
        Link: https://lkml.kernel.org/r/7300fc00a057eefeb9a68c8ad28171c3f0ce66ce.1697799303.git.zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/cover.1697799303.git.zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/cover.1697711415.git.zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/157013e978468241de4a4c05d5337a44638ecb0e.1697711415.git.zhengqi.arch@bytedance.com
      
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2baef39
    • Huang Ying's avatar
      mm, pcp: reduce detecting time of consecutive high order page freeing · 6ccdcb6d
      Huang Ying authored
      In current PCP auto-tuning design, if the number of pages allocated is
      much more than that of pages freed on a CPU, the PCP high may become the
      maximal value even if the allocating/freeing depth is small, for example,
      in the sender of network workloads.  If a CPU was used as sender
      originally, then it is used as receiver after context switching, we need
      to fill the whole PCP with maximal high before triggering PCP draining for
      consecutive high order freeing.  This will hurt the performance of some
      network workloads.
      
      To solve the issue, in this patch, we will track the consecutive page
      freeing with a counter in stead of relying on PCP draining.  So, we can
      detect consecutive page freeing much earlier.
      
      On a 2-socket Intel server with 128 logical CPU, we tested
      SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. 
      With the patch, the network bandwidth improves 5.0%.  This restores the
      performance drop caused by PCP auto-tuning.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-10-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6ccdcb6d
    • Huang Ying's avatar
      mm, pcp: decrease PCP high if free pages < high watermark · 57c0419c
      Huang Ying authored
      One target of PCP is to minimize pages in PCP if the system free pages is
      too few.  To reach that target, when page reclaiming is active for the
      zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating
      path, decrease PCP high and free some pages in freeing path.  But this may
      be too late because the background page reclaiming may introduce latency
      for some workloads.  So, in this patch, during page allocation we will
      detect whether the number of free pages of the zone is below high
      watermark.  If so, we will stop increasing PCP high in allocating path,
      decrease PCP high and free some pages in freeing path.  With this, we can
      reduce the possibility of the premature background page reclaiming caused
      by too large PCP.
      
      The high watermark checking is done in allocating path to reduce the
      overhead in hotter freeing path.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-9-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      57c0419c
    • Huang Ying's avatar
      mm: tune PCP high automatically · 51a755c5
      Huang Ying authored
      The target to tune PCP high automatically is as follows,
      
      - Minimize allocation/freeing from/to shared zone
      
      - Minimize idle pages in PCP
      
      - Minimize pages in PCP if the system free pages is too few
      
      To reach these target, a tuning algorithm as follows is designed,
      
      - When we refill PCP via allocating from the zone, increase PCP high.
        Because if we had larger PCP, we could avoid to allocate from the
        zone.
      
      - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
        decrease PCP high to try to free possible idle PCP pages.
      
      - When page reclaiming is active for the zone, stop increasing PCP
        high in allocating path, decrease PCP high and free some pages in
        freeing path.
      
      So, the PCP high can be tuned to the page allocating/freeing depth of
      workloads eventually.
      
      One issue of the algorithm is that if the number of pages allocated is
      much more than that of pages freed on a CPU, the PCP high may become the
      maximal value even if the allocating/freeing depth is small.  But this
      isn't a severe issue, because there are no idle pages in this case.
      
      One alternative choice is to increase PCP high when we drain PCP via
      trying to free pages to the zone, but don't increase PCP high during PCP
      refilling.  This can avoid the issue above.  But if the number of pages
      allocated is much less than that of pages freed on a CPU, there will be
      many idle pages in PCP and it is hard to free these idle pages.
      
      1/8 (>> 3) of PCP high will be decreased periodically.  The value 1/8 is
      kind of arbitrary.  Just to make sure that the idle PCP pages will be
      freed eventually.
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.  With the patch, the
      build time decreases 3.5%.  The cycles% of the spinlock contention (mostly
      for zone lock) decreases from 11.0% to 0.5%.  The number of PCP draining
      for high order pages freeing (free_high) decreases 65.6%.  The number of
      pages allocated from zone (instead of from PCP) decreases 83.9%.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-8-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      51a755c5
    • Huang Ying's avatar
      mm: add framework for PCP high auto-tuning · 90b41691
      Huang Ying authored
      The page allocation performance requirements of different workloads are
      usually different.  So, we need to tune PCP (per-CPU pageset) high to
      optimize the workload page allocation performance.  Now, we have a system
      wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand.
      But, it's hard to find out the best value by hand.  And one global
      configuration may not work best for the different workloads that run on
      the same system.  One solution to these issues is to tune PCP high of each
      CPU automatically.
      
      This patch adds the framework for PCP high auto-tuning.  With it,
      pcp->high of each CPU will be changed automatically by tuning algorithm at
      runtime.  The minimal high (pcp->high_min) is the original PCP high value
      calculated based on the low watermark pages.  While the maximal high
      (pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction
      sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, the
      maximal pcp->high that can be set via sysctl knob by hand.
      
      It's possible that PCP high auto-tuning doesn't work well for some
      workloads.  So, when PCP high is tuned by hand via the sysctl knob, the
      auto-tuning will be disabled.  The PCP high set by hand will be used
      instead.
      
      This patch only adds the framework, so pcp->high will be set to
      pcp->high_min (original default) always.  We will add actual auto-tuning
      algorithm in the following patches in the series.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-7-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      90b41691
    • Huang Ying's avatar
      mm, page_alloc: scale the number of pages that are batch allocated · c0a24239
      Huang Ying authored
      When a task is allocating a large number of order-0 pages, it may acquire
      the zone->lock multiple times allocating pages in batches.  This may
      unnecessarily contend on the zone lock when allocating very large number
      of pages.  This patch adapts the size of the batch based on the recent
      pattern to scale the batch size for subsequent allocations.
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.  With the patch, the
      cycles% of the spinlock contention (mostly for zone lock) decreases from
      12.6% to 11.0% (with PCP size == 367).
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-6-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0a24239
    • Huang Ying's avatar
      mm: restrict the pcp batch scale factor to avoid too long latency · 52166607
      Huang Ying authored
      In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
      batches to increase page allocation throughput, reduce page
      allocation/freeing latency per page, and reduce zone lock contention.  But
      too large batch size will cause too long maximal allocation/freeing
      latency, which may punish arbitrary users.  So the default batch size is
      chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to
      avoid that.
      
      In commit 3b12e7e9 ("mm/page_alloc: scale the number of pages that are
      batch freed"), the batch size will be scaled for large number of page
      freeing to improve page freeing performance and reduce zone lock
      contention.  Similar optimization can be used for large number of pages
      allocation too.
      
      To find out a suitable max batch scale factor (that is, max effective
      batch size), some tests and measurement on some machines were done as
      follows.
      
      A set of debug patches are implemented as follows,
      
      - Set PCP high to be 2 * batch to reduce the effect of PCP high
      
      - Disable free batch size scaling to get the raw performance.
      
      - The code with zone lock held is extracted from rmqueue_bulk() and
        free_pcppages_bulk() to 2 separate functions to make it easy to
        measure the function run time with ftrace function_graph tracer.
      
      - The batch size is hard coded to be 63 (default), 127, 255, 511,
        1023, 2047, 4095.
      
      Then will-it-scale/page_fault1 is used to generate the page
      allocation/freeing workload.  The page allocation/freeing throughput
      (page/s) is measured via will-it-scale.  The page allocation/freeing
      average latency (alloc/free latency avg, in us) and allocation/freeing
      latency at 99 percentile (alloc/free latency 99%, in us) are measured with
      ftrace function_graph tracer.
      
      The test results are as follows,
      
      Sapphire Rapids Server
      ======================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	513633.4	 2.33		 3.57		 2.67		  6.83
       127	517616.7	 4.35		 6.65		 4.22		 13.03
       255	520822.8	 8.29		13.32		 7.52		 25.24
       511	524122.0	15.79		23.42		14.02		 49.35
      1023	525980.5	30.25		44.19		25.36		 94.88
      2047	526793.6	59.39		84.50		45.22		140.81
      
      Ice Lake Server
      ===============
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	620210.3	 2.21		 3.68		 2.02		 4.35
       127	627003.0	 4.09		 6.86		 3.51		 8.28
       255	630777.5	 7.70		13.50		 6.17		15.97
       511	633651.5	14.85		22.62		11.66		31.08
      1023	637071.1	28.55		42.02		20.81		54.36
      2047	638089.7	56.54		84.06		39.28		91.68
      
      Cascade Lake Server
      ===================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	404706.7	 3.29		  5.03		 3.53		  4.75
       127	422475.2	 6.12		  9.09		 6.36		  8.76
       255	411522.2	11.68		 16.97		10.90		 16.39
       511	428124.1	22.54		 31.28		19.86		 32.25
      1023	414718.4	43.39		 62.52		40.00		 66.33
      2047	429848.7	86.64		120.34		71.14		106.08
      
      Commet Lake Desktop
      ===================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
      
        63	795183.13	 2.18		 3.55		 2.03		 3.05
       127	803067.85	 3.91		 6.56		 3.85		 5.52
       255	812771.10	 7.35		10.80		 7.14		10.20
       511	817723.48	14.17		27.54		13.43		30.31
      1023	818870.19	27.72		40.10		27.89		46.28
      
      Coffee Lake Desktop
      ===================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	510542.8	 3.13		  4.40		 2.48		 3.43
       127	514288.6	 5.97		  7.89		 4.65		 6.04
       255	516889.7	11.86		 15.58		 8.96		12.55
       511	519802.4	23.10		 28.81		16.95		26.19
      1023	520802.7	45.30		 52.51		33.19		45.95
      2047	519997.1	90.63		104.00		65.26		81.74
      
      From the above data, to restrict the allocation/freeing latency to be less
      than 100 us in most times, the max batch scale factor needs to be less
      than or equal to 5.
      
      Although it is reasonable to use 5 as max batch scale factor for the
      systems tested, there are also slower systems.  Where smaller value should
      be used to constrain the page allocation/freeing latency.
      
      So, in this patch, a new kconfig option (PCP_BATCH_SCALE_MAX) is added to
      set the max batch scale factor.  Whose default value is 5, and users can
      reduce it when necessary.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-5-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52166607
    • Huang Ying's avatar
      mm, pcp: reduce lock contention for draining high-order pages · 362d37a1
      Huang Ying authored
      In commit f26b3fa0 ("mm/page_alloc: limit number of high-order pages
      on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
      PCP is mostly used for high-order pages freeing to improve the cache-hot
      pages reusing between page allocating and freeing CPUs.
      
      On system with small per-CPU data cache slice, pages shouldn't be cached
      before draining to guarantee cache-hot.  But on a system with large
      per-CPU data cache slice, some pages can be cached before draining to
      reduce zone lock contention.
      
      So, in this patch, instead of draining without any caching, "pcp->batch"
      pages will be cached in PCP before draining if the size of the per-CPU
      data cache slice is more than "3 * batch".
      
      In theory, if the size of per-CPU data cache slice is more than "2 *
      batch", we can reuse cache-hot pages between CPUs.  But considering the
      other usage of cache (code, other data accessing, etc.), "3 * batch" is
      used.
      
      Note: "3 * batch" is chosen to make sure the optimization works on recent
      x86_64 server CPUs.  If you want to increase it, please check whether it
      breaks the optimization.
      
      On a 2-socket Intel server with 128 logical CPU, with the patch, the
      network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite
      with 16-pair processes increase 70.5%.  The cycles% of the spinlock
      contention (mostly for zone lock) decreases from 46.1% to 21.3%.  The
      number of PCP draining for high order pages freeing (free_high) decreases
      89.9%.  The cache miss rate keeps 0.2%.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-4-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      362d37a1
    • Huang Ying's avatar
      mm, pcp: avoid to drain PCP when process exit · ca71fe1a
      Huang Ying authored
      Patch series "mm: PCP high auto-tuning", v3.
      
      The page allocation performance requirements of different workloads are
      often different.  So, we need to tune the PCP (Per-CPU Pageset) high on
      each CPU automatically to optimize the page allocation performance.
      
      The list of patches in series is as follows,
      
      [1/9] mm, pcp: avoid to drain PCP when process exit
      [2/9] cacheinfo: calculate per-CPU data cache size
      [3/9] mm, pcp: reduce lock contention for draining high-order pages
      [4/9] mm: restrict the pcp batch scale factor to avoid too long latency
      [5/9] mm, page_alloc: scale the number of pages that are batch allocated
      [6/9] mm: add framework for PCP high auto-tuning
      [7/9] mm: tune PCP high automatically
      [8/9] mm, pcp: decrease PCP high if free pages < high watermark
      [9/9] mm, pcp: reduce detecting time of consecutive high order page freeing
      
      Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive
      high-order pages freeing.
      
      Patch [4/9], [5/9] optimize batch freeing and allocating.
      
      Patch [6/9], [7/9], [8/9] implement and optimize a PCP high
      auto-tuning method.
      
      Patch [9/9] optimize the PCP draining for consecutive high order page
      freeing based on PCP high auto-tuning.
      
      The test results for patches with performance impact are as follows,
      
      kbuild
      ======
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.
      
      	build time   lock contend%	free_high	alloc_zone
      	----------	----------	---------	----------
      base	     100.0	      14.0          100.0            100.0
      patch1	      99.5	      12.8	     19.5	      95.6
      patch3	      99.4	      12.6	      7.1	      95.6
      patch5	      98.6	      11.0	      8.1	      97.1
      patch7	      95.1	       0.5	      2.8	      15.6
      patch9	      95.0	       1.0	      8.8	      20.0
      
      The PCP draining optimization (patch [1/9], [3/9]) and PCP batch
      allocation optimization (patch [5/9]) reduces zone lock contention a
      little.  The PCP high auto-tuning (patch [7/9], [9/9]) reduces build time
      visibly.  Where the tuning target: the number of pages allocated from zone
      reduces greatly.  So, the zone contention cycles% reduces greatly.
      
      With PCP tuning patches (patch [7/9], [9/9]), the average used memory
      during test increases up to 18.4% because more pages are cached in PCP. 
      But at the end of the test, the number of the used memory decreases to the
      same level as that of the base patch.  That is, the pages cached in PCP
      will be released to zone after not being used actively.
      
      netperf SCTP_STREAM_MANY
      ========================
      
      On a 2-socket Intel server with 128 logical CPU, we tested
      SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
      
      	     score   lock contend%	free_high	alloc_zone  cache miss rate%
      	     -----	----------	---------	----------  ----------------
      base	     100.0	       2.1          100.0            100.0	         1.3
      patch1	      99.4	       2.1	     99.4	      99.4		 1.3
      patch3	     106.4	       1.3	     13.3	     106.3		 1.3
      patch5	     106.0	       1.2	     13.2	     105.9		 1.3
      patch7	     103.4	       1.9	      6.7	      90.3		 7.6
      patch9	     108.6	       1.3	     13.7	     108.6		 1.3
      
      The PCP draining optimization (patch [1/9]+[3/9]) improves performance. 
      The PCP high auto-tuning (patch [7/9]) reduces performance a little
      because PCP draining cannot be triggered in time sometimes.  So, the cache
      miss rate% increases.  The further PCP draining optimization (patch [9/9])
      based on PCP tuning restore the performance.
      
      lmbench3 UNIX (AF_UNIX)
      =======================
      
      On a 2-socket Intel server with 128 logical CPU, we tested UNIX
      (AF_UNIX socket) test case of lmbench3 test suite with 16-pair
      processes.
      
      	     score   lock contend%	free_high	alloc_zone  cache miss rate%
      	     -----	----------	---------	----------  ----------------
      base	     100.0	      51.4          100.0            100.0	         0.2
      patch1	     116.8	      46.1           69.5	     104.3	         0.2
      patch3	     199.1	      21.3            7.0	     104.9	         0.2
      patch5	     200.0	      20.8            7.1	     106.9	         0.3
      patch7	     191.6	      19.9            6.8	     103.8	         2.8
      patch9	     193.4	      21.7            7.0	     104.7	         2.1
      
      The PCP draining optimization (patch [1/9], [3/9]) improves performance
      much.  The PCP tuning (patch [7/9]) reduces performance a little because
      PCP draining cannot be triggered in time sometimes.  The further PCP
      draining optimization (patch [9/9]) based on PCP tuning restores the
      performance partly.
      
      The patchset adds several fields in struct per_cpu_pages.  The struct
      layout before/after the patchset is as follows,
      
      base
      ====
      
      struct per_cpu_pages {
      	spinlock_t                 lock;                 /*     0     4 */
      	int                        count;                /*     4     4 */
      	int                        high;                 /*     8     4 */
      	int                        batch;                /*    12     4 */
      	short int                  free_factor;          /*    16     2 */
      	short int                  expire;               /*    18     2 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct list_head           lists[13];            /*    24   208 */
      
      	/* size: 256, cachelines: 4, members: 7 */
      	/* sum members: 228, holes: 1, sum holes: 4 */
      	/* padding: 24 */
      } __attribute__((__aligned__(64)));
      
      patched
      =======
      
      struct per_cpu_pages {
      	spinlock_t                 lock;                 /*     0     4 */
      	int                        count;                /*     4     4 */
      	int                        high;                 /*     8     4 */
      	int                        high_min;             /*    12     4 */
      	int                        high_max;             /*    16     4 */
      	int                        batch;                /*    20     4 */
      	u8                         flags;                /*    24     1 */
      	u8                         alloc_factor;         /*    25     1 */
      	u8                         expire;               /*    26     1 */
      
      	/* XXX 1 byte hole, try to pack */
      
      	short int                  free_count;           /*    28     2 */
      
      	/* XXX 2 bytes hole, try to pack */
      
      	struct list_head           lists[13];            /*    32   208 */
      
      	/* size: 256, cachelines: 4, members: 11 */
      	/* sum members: 237, holes: 2, sum holes: 3 */
      	/* padding: 16 */
      } __attribute__((__aligned__(64)));
      
      The size of the struct doesn't changed with the patchset.
      
      
      This patch (of 9):
      
      In commit f26b3fa0 ("mm/page_alloc: limit number of high-order pages
      on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
      PCP is mostly used for high-order pages freeing to improve the cache-hot
      pages reusing between page allocation and freeing CPUs.
      
      But, the PCP draining mechanism may be triggered unexpectedly when process
      exits.  With some customized trace point, it was found that PCP draining
      (free_high == true) was triggered with the order-1 page freeing with the
      following call stack,
      
       => free_unref_page_commit
       => free_unref_page
       => __mmdrop
       => exit_mm
       => do_exit
       => do_group_exit
       => __x64_sys_exit_group
       => do_syscall_64
      
      Checking the source code, this is the page table PGD freeing
      (mm_free_pgd()).  It's a order-1 page freeing if
      CONFIG_PAGE_TABLE_ISOLATION=y.  Which is a common configuration for
      security.
      
      Just before that, page freeing with the following call stack was found,
      
       => free_unref_page_commit
       => free_unref_page_list
       => release_pages
       => tlb_batch_pages_flush
       => tlb_finish_mmu
       => exit_mmap
       => __mmput
       => exit_mm
       => do_exit
       => do_group_exit
       => __x64_sys_exit_group
       => do_syscall_64
      
      So, when a process exits,
      
      - a large number of user pages of the process will be freed without
        page allocation, it's highly possible that pcp->free_factor becomes >
        0.  In fact, this is expected behavior to improve process exit
        performance.
      
      - after freeing all user pages, the PGD will be freed, which is a
        order-1 page freeing, PCP will be drained.
      
      All in all, when a process exits, it's high possible that the PCP will be
      drained.  This is an unexpected behavior.
      
      To avoid this, in the patch, the PCP draining will only be triggered for 2
      consecutive high-order page freeing.
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.  With the patch, the
      cycles% of the spinlock contention (mostly for zone lock) decreases from
      14.0% to 12.8% (with PCP size == 367).  The number of PCP draining for
      high order pages freeing (free_high) decreases 80.5%.
      
      This helps network workload too for reduced zone lock contention.  On a
      2-socket Intel server with 128 logical CPU, with the patch, the network
      bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with
      16-pair processes increase 16.8%.  The cycles% of the spinlock contention
      (mostly for zone lock) decreases from 51.4% to 46.1%.  The number of PCP
      draining for high order pages freeing (free_high) decreases 30.5%.  The
      cache miss rate keeps 0.2%.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20231016053002.756205-2-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca71fe1a