• Barry Song's avatar
    arm64: support batched/deferred tlb shootdown during page reclamation/migration · 43b3dfdd
    Barry Song authored
    On x86, batched and deferred tlb shootdown has lead to 90% performance
    increase on tlb shootdown.  on arm64, HW can do tlb shootdown without
    software IPI.  But sync tlbi is still quite expensive.
    
    Even running a simplest program which requires swapout can
    prove this is true,
     #include <sys/types.h>
     #include <unistd.h>
     #include <sys/mman.h>
     #include <string.h>
    
     int main()
     {
     #define SIZE (1 * 1024 * 1024)
             volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    
             memset(p, 0x88, SIZE);
    
             for (int k = 0; k < 10000; k++) {
                     /* swap in */
                     for (int i = 0; i < SIZE; i += 4096) {
                             (void)p[i];
                     }
    
                     /* swap out */
                     madvise(p, SIZE, MADV_PAGEOUT);
             }
     }
    
    Perf result on snapdragon 888 with 8 cores by using zRAM
    as the swap block device.
    
     ~ # perf record taskset -c 4 ./a.out
     [ perf record: Woken up 10 times to write data ]
     [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
     ~ # perf report
     # To display the perf.data header info, please use --header/--header-only options.
     # To display the perf.data header info, please use --header/--header-only options.
     #
     #
     # Total Lost Samples: 0
     #
     # Samples: 60K of event 'cycles'
     # Event count (approx.): 35706225414
     #
     # Overhead  Command  Shared Object      Symbol
     # ........  .......  .................  ......
     #
        21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
         8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
         6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
         6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
         5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
         3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
         3.49%  a.out    [kernel.kallsyms]  [k] memset64
         1.63%  a.out    [kernel.kallsyms]  [k] clear_page
         1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
         1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
         1.23%  a.out    [kernel.kallsyms]  [k] xas_load
         1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock
    
    ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out
    a page mapped by only one process.  If the page is mapped by multiple
    processes, typically, like more than 100 on a phone, the overhead would be
    much higher as we have to run tlb flush 100 times for one single page. 
    Plus, tlb flush overhead will increase with the number of CPU cores due to
    the bad scalability of tlb shootdown in HW, so those ARM64 servers should
    expect much higher overhead.
    
    Further perf annonate shows 95% cpu time of ptep_clear_flush is actually
    used by the final dsb() to wait for the completion of tlb flush.  This
    provides us a very good chance to leverage the existing batched tlb in
    kernel.  The minimum modification is that we only send async tlbi in the
    first stage and we send dsb while we have to sync in the second stage.
    
    With the above simplest micro benchmark, collapsed time to finish the
    program decreases around 5%.
    
    Typical collapsed time w/o patch:
     ~ # time taskset -c 4 ./a.out
     0.21user 14.34system 0:14.69elapsed
    w/ patch:
     ~ # time taskset -c 4 ./a.out
     0.22user 13.45system 0:13.80elapsed
    
    Also tested with benchmark in the commit on Kunpeng920 arm64 server
    and observed an improvement around 12.5% with command
    `time ./swap_bench`.
            w/o             w/
    real    0m13.460s       0m11.771s
    user    0m0.248s        0m0.279s
    sys     0m12.039s       0m11.458s
    
    Originally it's noticed a 16.99% overhead of ptep_clear_flush()
    which has been eliminated by this patch:
    
    [root@localhost yang]# perf record -- ./swap_bench && perf report
    [...]
    16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush
    
    It is tested on 4,8,128 CPU platforms and shows to be beneficial on
    large systems but may not have improvement on small systems like on
    a 4 CPU platform.
    
    Also this patch improve the performance of page migration. Using pmbench
    and tries to migrate the pages of pmbench between node 0 and node 1 for
    100 times for 1G memory, this patch decrease the time used around 20%
    (prev 18.338318910 sec after 13.981866350 sec) and saved the time used
    by ptep_clear_flush().
    
    Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.comTested-by: default avatarYicong Yang <yangyicong@hisilicon.com>
    Tested-by: default avatarXin Hao <xhao@linux.alibaba.com>
    Tested-by: default avatarPunit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
    Signed-off-by: default avatarYicong Yang <yangyicong@hisilicon.com>
    Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: default avatarXin Hao <xhao@linux.alibaba.com>
    Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Darren Hart <darren@os.amperecomputing.com>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: lipeifeng <lipeifeng@oppo.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Steven Miao <realmz6@gmail.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zeng Tao <prime.zeng@hisilicon.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    43b3dfdd
tlbbatch.h 281 Bytes