1. 06 Apr, 2023 27 commits
  2. 28 Mar, 2023 13 commits
    • Peter Xu's avatar
      mm/thp: rename TRANSPARENT_HUGEPAGE_NEVER_DAX to _UNSUPPORTED · 3c556d24
      Peter Xu authored
      TRANSPARENT_HUGEPAGE_NEVER_DAX has nothing to do with DAX.  It's set when
      has_transparent_hugepage() returns false, checked in hugepage_vma_check()
      and will disable THP completely if false.  Rename it to
      TRANSPARENT_HUGEPAGE_UNSUPPORTED to reflect its real purpose.
      
      [peterx@redhat.com: fix comment, per David]
        Link: https://lkml.kernel.org/r/ZBMzQW674oHQJV7F@x1n
      Link: https://lkml.kernel.org/r/20230315171642.1244625-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c556d24
    • Kefeng Wang's avatar
      mm: memory-failure: directly use IS_ENABLED(CONFIG_HWPOISON_INJECT) · 611b9fd8
      Kefeng Wang authored
      It's more clear and simple to just use IS_ENABLED(CONFIG_HWPOISON_INJECT)
      to check whether or not to enable HWPoison injector module instead of
      CONFIG_HWPOISON_INJECT/CONFIG_HWPOISON_INJECT_MODULE.
      
      Link: https://lkml.kernel.org/r/20230313053929.84607-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      611b9fd8
    • Qi Zheng's avatar
      mm: shrinkers: convert shrinker_rwsem to mutex · cf2e309e
      Qi Zheng authored
      Now there are no readers of shrinker_rwsem, so we can simply replace it
      with mutex lock.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-9-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf2e309e
    • Qi Zheng's avatar
      mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() · 1643db98
      Qi Zheng authored
      Currently, the synchronize_shrinkers() is only used by TTM pool.  It only
      requires that no shrinkers run in parallel, and doesn't care about
      registering and unregistering of shrinkers.
      
      Since slab shrink is protected by SRCU, synchronize_srcu() is sufficient
      to ensure that no shrinker is running in parallel.  So the shrinker_rwsem
      in synchronize_shrinkers() is no longer needed, just remove it.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-8-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1643db98
    • Qi Zheng's avatar
      mm: vmscan: hold write lock to reparent shrinker nr_deferred · b3cabea3
      Qi Zheng authored
      For now, reparent_shrinker_deferred() is the only holder of read lock of
      shrinker_rwsem.  And it already holds the global cgroup_mutex, so it will
      not be called in parallel.
      
      Therefore, in order to convert shrinker_rwsem to shrinker_mutex later,
      here we change to hold the write lock of shrinker_rwsem to reparent.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-7-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b3cabea3
    • Qi Zheng's avatar
      mm: shrinkers: make count and scan in shrinker debugfs lockless · 20cd1892
      Qi Zheng authored
      Like global and memcg slab shrink, also use SRCU to make count and scan
      operations in memory shrinker debugfs lockless.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-6-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20cd1892
    • Kirill Tkhai's avatar
      mm: vmscan: add shrinker_srcu_generation · 475733dd
      Kirill Tkhai authored
      After we make slab shrink lockless with SRCU, the longest sleep
      unregister_shrinker() will be a sleep waiting for all do_shrink_slab()
      calls.
      
      To avoid long unbreakable action in the unregister_shrinker(), add
      shrinker_srcu_generation to restore a check similar to the
      rwsem_is_contendent() check that we had before.
      
      And for memcg slab shrink, we unlock SRCU and continue iterations from the
      next shrinker id.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-5-zhengqi.arch@bytedance.comSigned-off-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      475733dd
    • Qi Zheng's avatar
      mm: vmscan: make memcg slab shrink lockless · caa05325
      Qi Zheng authored
      Like global slab shrink, this commit also uses SRCU to make memcg slab
      shrink lockless.
      
      We can reproduce the down_read_trylock() hotspot through the
      following script:
      
      ```
      
      DIR="/root/shrinker/memcg/mnt"
      
      do_create()
      {
          mkdir -p /sys/fs/cgroup/memory/test
          mkdir -p /sys/fs/cgroup/perf_event/test
          echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          for i in `seq 0 $1`;
          do
              mkdir -p /sys/fs/cgroup/memory/test/$i;
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
              mkdir -p $DIR/$i;
          done
      }
      
      do_mount()
      {
          for i in `seq $1 $2`;
          do
              mount -t tmpfs $i $DIR/$i;
          done
      }
      
      do_touch()
      {
          for i in `seq $1 $2`;
          do
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
                  dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
          done
      }
      
      case "$1" in
        touch)
          do_touch $2 $3
          ;;
        test)
            do_create 4000
          do_mount 0 4000
          do_touch 0 3000
          ;;
        *)
          exit 1
          ;;
      esac
      ```
      
      Save the above script, then run test and touch commands.
      Then we can use the following perf command to view hotspots:
      
      perf top -U -F 999
      
      1) Before applying this patchset:
      
        32.31%  [kernel]           [k] down_read_trylock
        19.40%  [kernel]           [k] pv_native_safe_halt
        16.24%  [kernel]           [k] up_read
        15.70%  [kernel]           [k] shrink_slab
         4.69%  [kernel]           [k] _find_next_bit
         2.62%  [kernel]           [k] shrink_node
         1.78%  [kernel]           [k] shrink_lruvec
         0.76%  [kernel]           [k] do_shrink_slab
      
      2) After applying this patchset:
      
        27.83%  [kernel]           [k] _find_next_bit
        16.97%  [kernel]           [k] shrink_slab
        15.82%  [kernel]           [k] pv_native_safe_halt
         9.58%  [kernel]           [k] shrink_node
         8.31%  [kernel]           [k] shrink_lruvec
         5.64%  [kernel]           [k] do_shrink_slab
         3.88%  [kernel]           [k] mem_cgroup_iter
      
      At the same time, we use the following perf command to capture
      IPC information:
      
      perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
      
      1) Before applying this patchset:
      
       Performance counter stats for 'system wide' (5 runs):
      
            454187219766      cycles                    test                    ( +-  1.84% )
             78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )
      
              10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
      
      2) After applying this patchset:
      
       Performance counter stats for 'system wide' (5 runs):
      
            841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
            527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)
      
                10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
      
      We can see that IPC drops very seriously when calling
      down_read_trylock() at high frequency. After using SRCU,
      the IPC is at a normal level.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-4-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarVlastimil Babka <Vbabka@suse.cz>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      caa05325
    • Qi Zheng's avatar
      mm: vmscan: make global slab shrink lockless · f95bdb70
      Qi Zheng authored
      The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
      which protects most operations such as slab shrink, registration and
      unregistration of shrinkers, etc.  This can easily cause problems in the
      following cases.
      
      1) When the memory pressure is high and there are many
         filesystems mounted or unmounted at the same time,
         slab shrink will be affected (down_read_trylock()
         failed).
      
         Such as the real workload mentioned by Kirill Tkhai:
      
         ```
         One of the real workloads from my experience is start
         of an overcommitted node containing many starting
         containers after node crash (or many resuming containers
         after reboot for kernel update). In these cases memory
         pressure is huge, and the node goes round in long reclaim.
         ```
      
      2) If a shrinker is blocked (such as the case mentioned
         in [1]) and a writer comes in (such as mount a fs),
         then this writer will be blocked and cause all
         subsequent shrinker-related operations to be blocked.
      
      Even if there is no competitor when shrinking slab, there may still be a
      problem.  If we have a long shrinker list and we do not reclaim enough
      memory with each shrinker, then the down_read_trylock() may be called with
      high frequency.  Because of the poor multicore scalability of atomic
      operations, this can lead to a significant drop in IPC (instructions per
      cycle).
      
      So many times in history ([2],[3],[4],[5]), some people wanted to replace
      shrinker_rwsem trylock with SRCU in the slab shrink, but all these patches
      were abandoned because SRCU was not unconditionally enabled.
      
      But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"), the SRCU
      is unconditionally enabled.  So it's time to use SRCU to protect readers
      who previously held shrinker_rwsem.
      
      This commit uses SRCU to make global slab shrink lockless,
      the memcg slab shrink is handled in the subsequent patch.
      
      [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
      [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
      [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
      [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
      [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-3-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f95bdb70
    • Qi Zheng's avatar
      mm: vmscan: add a map_nr_max field to shrinker_info · 42c9db39
      Qi Zheng authored
      Patch series "make slab shrink lockless", v5.
      
      This patch series aims to make slab shrink lockless.
      
      1. Background
      =============
      
      On our servers, we often find the following system cpu hotspots:
      
        52.22% [kernel]        [k] down_read_trylock
        19.60% [kernel]        [k] up_read
         8.86% [kernel]        [k] shrink_slab
         2.44% [kernel]        [k] idr_find
         1.25% [kernel]        [k] count_shadow_nodes
         1.18% [kernel]        [k] shrink lruvec
         0.71% [kernel]        [k] mem_cgroup_iter
         0.71% [kernel]        [k] shrink_node
         0.55% [kernel]        [k] find_next_bit
      
      And we used bpftrace to capture its calltrace as follows:
      
      @[
          down_read_trylock+1
          shrink_slab+128
          shrink_node+371
          do_try_to_free_pages+232
          try_to_free_pages+243
          _alloc_pages_slowpath+771
          _alloc_pages_nodemask+702
          pagecache_get_page+255
          filemap_fault+1361
          ext4_filemap_fault+44
          __do_fault+76
          handle_mm_fault+3543
          do_user_addr_fault+442
          do_page_fault+48
          page_fault+62
      ]: 1161690
      @[
          down_read_trylock+1
          shrink_slab+128
          shrink_node+371
          balance_pgdat+690
          kswapd+389
          kthread+246
          ret_from_fork+31
      ]: 8424884
      @[
          down_read_trylock+1
          shrink_slab+128
          shrink_node+371
          do_try_to_free_pages+232
          try_to_free_pages+243
          __alloc_pages_slowpath+771
          __alloc_pages_nodemask+702
          __do_page_cache_readahead+244
          filemap_fault+1674
          ext4_filemap_fault+44
          __do_fault+76
          handle_mm_fault+3543
          do_user_addr_fault+442
          do_page_fault+48
          page_fault+62
      ]: 20917631
      
      We can see that down_read_trylock() of shrinker_rwsem is being called with
      high frequency at that time.  Because of the poor multicore scalability of
      atomic operations, this can lead to a significant drop in IPC
      (instructions per cycle).
      
      And more, the shrinker_rwsem is a global read-write lock in shrinkers
      subsystem, which protects most operations such as slab shrink,
      registration and unregistration of shrinkers, etc.  This can easily cause
      problems in the following cases.
      
      1) When the memory pressure is high and there are many filesystems
         mounted or unmounted at the same time, slab shrink will be affected
         (down_read_trylock() failed).
      
         Such as the real workload mentioned by Kirill Tkhai:
      
         ```
         One of the real workloads from my experience is start of an
         overcommitted node containing many starting containers after node crash
         (or many resuming containers after reboot for kernel update).  In these
         cases memory pressure is huge, and the node goes round in long reclaim.
         ```
      
      2) If a shrinker is blocked (such as the case mentioned in [1]) and a
         writer comes in (such as mount a fs), then this writer will be blocked
         and cause all subsequent shrinker-related operations to be blocked.
      
      [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
      
      All the above cases can be solved by replacing the shrinker_rwsem trylocks
      with SRCU.
      
      2. Survey
      =========
      
      Before doing the code implementation, I found that there were many similar
      submissions in the community:
      
      a. Davidlohr Bueso submitted a patch in 2015.
         Subject: [PATCH -next v2] mm: srcu-ify shrinkers
         Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
         Result: It was finally merged into the linux-next branch,
                 but failed on arm allnoconfig (without CONFIG_SRCU)
      
      b. Tetsuo Handa submitted a patchset in 2017.
         Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
         Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
         Result: Finally chose to use the current simple way (break
                 when rwsem_is_contended()).  And Christoph Hellwig suggested to
                 using SRCU, but SRCU was not unconditionally enabled at the
                 time.
      
      c. Kirill Tkhai submitted a patchset in 2018.
         Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
         Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
         Result: At that time, SRCU was not unconditionally enabled,
                 and there were some objections to enabling SRCU.  Later,
                 because Kirill's focus was moved to other things, this patchset
                 was not continued to be updated.
      
      d. Sultan Alsawaf submitted a patch in 2021.
         Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
         Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
         Result: Rejected because SRCU was not unconditionally enabled.
      
      We can find that almost all these historical commits were abandoned
      because SRCU was not unconditionally enabled.  But now SRCU has been
      unconditionally enable by Paul E.  McKenney in 2023 [2], so it's time to
      replace shrinker_rwsem trylocks with SRCU.
      
      [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
      
      3. Reproduction and testing
      ===========================
      
      We can reproduce the down_read_trylock() hotspot through the following script:
      
      ```
      #!/bin/bash
      
      DIR="/root/shrinker/memcg/mnt"
      
      do_create()
      {
          mkdir -p /sys/fs/cgroup/memory/test
          mkdir -p /sys/fs/cgroup/perf_event/test
          echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          for i in `seq 0 $1`;
          do
              mkdir -p /sys/fs/cgroup/memory/test/$i;
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
              mkdir -p $DIR/$i;
          done
      }
      
      do_mount()
      {
          for i in `seq $1 $2`;
          do
              mount -t tmpfs $i $DIR/$i;
          done
      }
      
      do_touch()
      {
          for i in `seq $1 $2`;
          do
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
                  dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
          done
      }
      
      case "$1" in
        touch)
          do_touch $2 $3
          ;;
        test)
            do_create 4000
          do_mount 0 4000
          do_touch 0 3000
          ;;
        *)
          exit 1
          ;;
      esac
      ```
      
      Save the above script, then run test and touch commands.  Then we can use
      the following perf command to view hotspots:
      
      perf top -U -F 999
      
      1) Before applying this patchset:
      
        32.31%  [kernel]           [k] down_read_trylock
        19.40%  [kernel]           [k] pv_native_safe_halt
        16.24%  [kernel]           [k] up_read
        15.70%  [kernel]           [k] shrink_slab
         4.69%  [kernel]           [k] _find_next_bit
         2.62%  [kernel]           [k] shrink_node
         1.78%  [kernel]           [k] shrink_lruvec
         0.76%  [kernel]           [k] do_shrink_slab
      
      2) After applying this patchset:
      
        27.83%  [kernel]           [k] _find_next_bit
        16.97%  [kernel]           [k] shrink_slab
        15.82%  [kernel]           [k] pv_native_safe_halt
         9.58%  [kernel]           [k] shrink_node
         8.31%  [kernel]           [k] shrink_lruvec
         5.64%  [kernel]           [k] do_shrink_slab
         3.88%  [kernel]           [k] mem_cgroup_iter
      
      At the same time, we use the following perf command to capture IPC
      information:
      
      perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
      
      1) Before applying this patchset:
      
       Performance counter stats for 'system wide' (5 runs):
      
            454187219766      cycles                    test                    ( +-  1.84% )
             78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )
      
              10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
      
      2) After applying this patchset:
      
       Performance counter stats for 'system wide' (5 runs):
      
            841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
            527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)
      
                10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
      
      We can see that IPC drops very seriously when calling down_read_trylock()
      at high frequency.  After using SRCU, the IPC is at a normal level.
      
      
      This patch (of 8):
      
      To prepare for the subsequent lockless memcg slab shrink, add a map_nr_max
      field to struct shrinker_info to records its own real shrinker_nr_max.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-1-zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/20230313112819.38938-2-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Suggested-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      42c9db39
    • Lorenzo Stoakes's avatar
      mm: prefer xxx_page() alloc/free functions for order-0 pages · dcc1be11
      Lorenzo Stoakes authored
      Update instances of alloc_pages(..., 0), __get_free_pages(..., 0) and
      __free_pages(..., 0) to use alloc_page(), __get_free_page() and
      __free_page() respectively in core code.
      
      Link: https://lkml.kernel.org/r/50c48ca4789f1da2a65795f2346f5ae3eff7d665.1678710232.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dcc1be11
    • Peter Collingbourne's avatar
      kasan: remove PG_skip_kasan_poison flag · 0a54864f
      Peter Collingbourne authored
      Code inspection reveals that PG_skip_kasan_poison is redundant with
      kasantag, because the former is intended to be set iff the latter is the
      match-all tag.  It can also be observed that it's basically pointless to
      poison pages which have kasantag=0, because any pages with this tag would
      have been pointed to by pointers with match-all tags, so poisoning the
      pages would have little to no effect in terms of bug detection. 
      Therefore, change the condition in should_skip_kasan_poison() to check
      kasantag instead, and remove PG_skip_kasan_poison and associated flags.
      
      Link: https://lkml.kernel.org/r/20230310042914.3805818-3-pcc@google.com
      Link: https://linux-review.googlesource.com/id/I57f825f2eaeaf7e8389d6cf4597c8a5821359838Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a54864f
    • Sebastian Andrzej Siewior's avatar
      io-mapping: don't disable preempt on RT in io_mapping_map_atomic_wc(). · 7eb16f23
      Sebastian Andrzej Siewior authored
      io_mapping_map_atomic_wc() disables preemption and pagefaults for
      historical reasons.  The conversion to io_mapping_map_local_wc(), which
      only disables migration, cannot be done wholesale because quite some call
      sites need to be updated to accommodate with the changed semantics.
      
      On PREEMPT_RT enabled kernels the io_mapping_map_atomic_wc() semantics are
      problematic due to the implicit disabling of preemption which makes it
      impossible to acquire 'sleeping' spinlocks within the mapped atomic
      sections.
      
      PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for
      more than a decade.  It could be argued that this is a justification to do
      this unconditionally, but PREEMPT_RT covers only a limited number of
      architectures and it disables some functionality which limits the coverage
      further.
      
      Limit the replacement to PREEMPT_RT for now.  This is also done
      kmap_atomic().
      
      Link: https://lkml.kernel.org/r/20230310162905.O57Pj7hh@linutronix.deSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reported-by: default avatarRichard Weinberger <richard.weinberger@gmail.com>
        Link: https://lore.kernel.org/CAFLxGvw0WMxaMqYqJ5WgvVSbKHq2D2xcXTOgMCpgq9nDC-MWTQ@mail.gmail.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7eb16f23