• Qi Zheng's avatar
    mm: vmscan: add a map_nr_max field to shrinker_info · 42c9db39
    Qi Zheng authored
    Patch series "make slab shrink lockless", v5.
    
    This patch series aims to make slab shrink lockless.
    
    1. Background
    =============
    
    On our servers, we often find the following system cpu hotspots:
    
      52.22% [kernel]        [k] down_read_trylock
      19.60% [kernel]        [k] up_read
       8.86% [kernel]        [k] shrink_slab
       2.44% [kernel]        [k] idr_find
       1.25% [kernel]        [k] count_shadow_nodes
       1.18% [kernel]        [k] shrink lruvec
       0.71% [kernel]        [k] mem_cgroup_iter
       0.71% [kernel]        [k] shrink_node
       0.55% [kernel]        [k] find_next_bit
    
    And we used bpftrace to capture its calltrace as follows:
    
    @[
        down_read_trylock+1
        shrink_slab+128
        shrink_node+371
        do_try_to_free_pages+232
        try_to_free_pages+243
        _alloc_pages_slowpath+771
        _alloc_pages_nodemask+702
        pagecache_get_page+255
        filemap_fault+1361
        ext4_filemap_fault+44
        __do_fault+76
        handle_mm_fault+3543
        do_user_addr_fault+442
        do_page_fault+48
        page_fault+62
    ]: 1161690
    @[
        down_read_trylock+1
        shrink_slab+128
        shrink_node+371
        balance_pgdat+690
        kswapd+389
        kthread+246
        ret_from_fork+31
    ]: 8424884
    @[
        down_read_trylock+1
        shrink_slab+128
        shrink_node+371
        do_try_to_free_pages+232
        try_to_free_pages+243
        __alloc_pages_slowpath+771
        __alloc_pages_nodemask+702
        __do_page_cache_readahead+244
        filemap_fault+1674
        ext4_filemap_fault+44
        __do_fault+76
        handle_mm_fault+3543
        do_user_addr_fault+442
        do_page_fault+48
        page_fault+62
    ]: 20917631
    
    We can see that down_read_trylock() of shrinker_rwsem is being called with
    high frequency at that time.  Because of the poor multicore scalability of
    atomic operations, this can lead to a significant drop in IPC
    (instructions per cycle).
    
    And more, the shrinker_rwsem is a global read-write lock in shrinkers
    subsystem, which protects most operations such as slab shrink,
    registration and unregistration of shrinkers, etc.  This can easily cause
    problems in the following cases.
    
    1) When the memory pressure is high and there are many filesystems
       mounted or unmounted at the same time, slab shrink will be affected
       (down_read_trylock() failed).
    
       Such as the real workload mentioned by Kirill Tkhai:
    
       ```
       One of the real workloads from my experience is start of an
       overcommitted node containing many starting containers after node crash
       (or many resuming containers after reboot for kernel update).  In these
       cases memory pressure is huge, and the node goes round in long reclaim.
       ```
    
    2) If a shrinker is blocked (such as the case mentioned in [1]) and a
       writer comes in (such as mount a fs), then this writer will be blocked
       and cause all subsequent shrinker-related operations to be blocked.
    
    [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
    
    All the above cases can be solved by replacing the shrinker_rwsem trylocks
    with SRCU.
    
    2. Survey
    =========
    
    Before doing the code implementation, I found that there were many similar
    submissions in the community:
    
    a. Davidlohr Bueso submitted a patch in 2015.
       Subject: [PATCH -next v2] mm: srcu-ify shrinkers
       Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
       Result: It was finally merged into the linux-next branch,
               but failed on arm allnoconfig (without CONFIG_SRCU)
    
    b. Tetsuo Handa submitted a patchset in 2017.
       Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
       Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
       Result: Finally chose to use the current simple way (break
               when rwsem_is_contended()).  And Christoph Hellwig suggested to
               using SRCU, but SRCU was not unconditionally enabled at the
               time.
    
    c. Kirill Tkhai submitted a patchset in 2018.
       Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
       Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
       Result: At that time, SRCU was not unconditionally enabled,
               and there were some objections to enabling SRCU.  Later,
               because Kirill's focus was moved to other things, this patchset
               was not continued to be updated.
    
    d. Sultan Alsawaf submitted a patch in 2021.
       Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
       Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
       Result: Rejected because SRCU was not unconditionally enabled.
    
    We can find that almost all these historical commits were abandoned
    because SRCU was not unconditionally enabled.  But now SRCU has been
    unconditionally enable by Paul E.  McKenney in 2023 [2], so it's time to
    replace shrinker_rwsem trylocks with SRCU.
    
    [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
    
    3. Reproduction and testing
    ===========================
    
    We can reproduce the down_read_trylock() hotspot through the following script:
    
    ```
    #!/bin/bash
    
    DIR="/root/shrinker/memcg/mnt"
    
    do_create()
    {
        mkdir -p /sys/fs/cgroup/memory/test
        mkdir -p /sys/fs/cgroup/perf_event/test
        echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
        for i in `seq 0 $1`;
        do
            mkdir -p /sys/fs/cgroup/memory/test/$i;
            echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
            echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
            mkdir -p $DIR/$i;
        done
    }
    
    do_mount()
    {
        for i in `seq $1 $2`;
        do
            mount -t tmpfs $i $DIR/$i;
        done
    }
    
    do_touch()
    {
        for i in `seq $1 $2`;
        do
            echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
            echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
                dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
        done
    }
    
    case "$1" in
      touch)
        do_touch $2 $3
        ;;
      test)
          do_create 4000
        do_mount 0 4000
        do_touch 0 3000
        ;;
      *)
        exit 1
        ;;
    esac
    ```
    
    Save the above script, then run test and touch commands.  Then we can use
    the following perf command to view hotspots:
    
    perf top -U -F 999
    
    1) Before applying this patchset:
    
      32.31%  [kernel]           [k] down_read_trylock
      19.40%  [kernel]           [k] pv_native_safe_halt
      16.24%  [kernel]           [k] up_read
      15.70%  [kernel]           [k] shrink_slab
       4.69%  [kernel]           [k] _find_next_bit
       2.62%  [kernel]           [k] shrink_node
       1.78%  [kernel]           [k] shrink_lruvec
       0.76%  [kernel]           [k] do_shrink_slab
    
    2) After applying this patchset:
    
      27.83%  [kernel]           [k] _find_next_bit
      16.97%  [kernel]           [k] shrink_slab
      15.82%  [kernel]           [k] pv_native_safe_halt
       9.58%  [kernel]           [k] shrink_node
       8.31%  [kernel]           [k] shrink_lruvec
       5.64%  [kernel]           [k] do_shrink_slab
       3.88%  [kernel]           [k] mem_cgroup_iter
    
    At the same time, we use the following perf command to capture IPC
    information:
    
    perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
    
    1) Before applying this patchset:
    
     Performance counter stats for 'system wide' (5 runs):
    
          454187219766      cycles                    test                    ( +-  1.84% )
           78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )
    
            10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
    
    2) After applying this patchset:
    
     Performance counter stats for 'system wide' (5 runs):
    
          841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
          527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)
    
              10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
    
    We can see that IPC drops very seriously when calling down_read_trylock()
    at high frequency.  After using SRCU, the IPC is at a normal level.
    
    
    This patch (of 8):
    
    To prepare for the subsequent lockless memcg slab shrink, add a map_nr_max
    field to struct shrinker_info to records its own real shrinker_nr_max.
    
    Link: https://lkml.kernel.org/r/20230313112819.38938-1-zhengqi.arch@bytedance.com
    Link: https://lkml.kernel.org/r/20230313112819.38938-2-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
    Suggested-by: default avatarKirill Tkhai <tkhai@ya.ru>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
    Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Sultan Alsawaf <sultan@kerneltoast.com>
    Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    42c9db39
vmscan.c 221 KB