• Adrian Huang's avatar
    mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area · 409faf8c
    Adrian Huang authored
    When running the vmalloc stress on a 448-core system, observe the average
    latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
    'funclatency.py' tool [1].
    
      # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
    
         usecs             : count    distribution
            0 -> 1         : 0       |                                        |
            2 -> 3         : 29      |                                        |
            4 -> 7         : 19      |                                        |
            8 -> 15        : 56      |                                        |
           16 -> 31        : 483     |****                                    |
           32 -> 63        : 1548    |************                            |
           64 -> 127       : 2634    |*********************                   |
          128 -> 255       : 2535    |*********************                   |
          256 -> 511       : 1776    |**************                          |
          512 -> 1023      : 1015    |********                                |
         1024 -> 2047      : 573     |****                                    |
         2048 -> 4095      : 488     |****                                    |
         4096 -> 8191      : 1091    |*********                               |
         8192 -> 16383     : 3078    |*************************               |
        16384 -> 32767     : 4821    |****************************************|
        32768 -> 65535     : 3318    |***************************             |
        65536 -> 131071    : 1718    |**************                          |
       131072 -> 262143    : 2220    |******************                      |
       262144 -> 524287    : 1147    |*********                               |
       524288 -> 1048575   : 1179    |*********                               |
      1048576 -> 2097151   : 822     |******                                  |
      2097152 -> 4194303   : 906     |*******                                 |
      4194304 -> 8388607   : 2148    |*****************                       |
      8388608 -> 16777215  : 4497    |*************************************   |
     16777216 -> 33554431  : 289     |**                                      |
    
      avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
    
      The worst case is over 16-33 seconds, so soft lockup is triggered [2].
    
    [Root Cause]
    1) Each purge_list has the long list. The following shows the number of
       vmap_area is purged.
    
       crash> p vmap_nodes
       vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
       crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
         nr_purged = 663070
         ...
         nr_purged = 821670
         nr_purged = 692214
         nr_purged = 726808
         ...
    
    2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
       operation when purging each vmap_area. However, the iteration is over
       600000 vmap_area (See 'nr_purged' above).
    
       Here is objdump output:
    
         $ objdump -D vmlinux
         ffffffff813e8c80 <purge_vmap_node>:
         ...
         ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
         ...
    
       Quote from "Instruction tables" pdf file [3]:
         Instructions with a LOCK prefix have a long latency that depends on
         cache organization and possibly RAM speed. If there are multiple
         processors or cores or direct memory access (DMA) devices, then all
         locked instructions will lock a cache line for exclusive access,
         which may involve RAM access. A LOCK prefix typically costs more
         than a hundred clock cycles, even on single-processor systems.
    
       That's why the latency of purge_vmap_node() dramatically increases
       on a many-core system: One core is busy on purging each vmap_area of
       the *long* purge_list and executing atomic_long_sub() for each
       vmap_area, while other cores free vmalloc allocations and execute
       atomic_long_add_return() in free_vmap_area_noflush().
    
    [Solution]
    Employ a local variable to record the total purged pages, and execute
    atomic_long_sub() after the traversal of the purge_list is done. The
    experiment result shows the latency improvement is 99%.
    
    [Experiment Result]
    1) System Configuration: Three servers (with HT-enabled) are tested.
         * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
         * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
         * 448-core server: AMD Zen 4 Processor*2
    
    2) Kernel Config
         * CONFIG_KASAN is disabled
    
    3) The data in column "w/o patch" and "w/ patch"
         * Unit: micro seconds (us)
         * Each data is the average of 3-time measurements
    
             System        w/o patch (us)   w/ patch (us)    Improvement (%)
         ---------------   --------------   -------------    -------------
         72-core server          2194              14            99.36%
         192-core server       143799            1139            99.21%
         448-core server      1992122            6883            99.65%
    
    [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
    [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
    [3] https://www.agner.org/optimize/instruction_tables.pdf
    
    Link: https://lkml.kernel.org/r/20240829130633.2184-1-ahuang12@lenovo.comSigned-off-by: default avatarAdrian Huang <ahuang12@lenovo.com>
    Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    409faf8c
vmalloc.c 132 KB