• Yu Zhao's avatar
    mm: multi-gen LRU: support page table walks · bd74fdae
    Yu Zhao authored
    To further exploit spatial locality, the aging prefers to walk page tables
    to search for young PTEs and promote hot pages.  A kill switch will be
    added in the next patch to disable this behavior.  When disabled, the
    aging relies on the rmap only.
    
    NB: this behavior has nothing similar with the page table scanning in the
    2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
    to swapcache and unmaps them.
    
    To avoid confusion, the term "iteration" specifically means the traversal
    of an entire mm_struct list; the term "walk" will be applied to page
    tables and the rmap, as usual.
    
    An mm_struct list is maintained for each memcg, and an mm_struct follows
    its owner task to the new memcg when this task is migrated.  Given an
    lruvec, the aging iterates lruvec_memcg()->mm_list and calls
    walk_page_range() with each mm_struct on this list to promote hot pages
    before it increments max_seq.
    
    When multiple page table walkers iterate the same list, each of them gets
    a unique mm_struct; therefore they can run concurrently.  Page table
    walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
    pages it left in the previous memcg will not be promoted when its current
    memcg is under reclaim.  Similarly, page table walkers will not promote
    pages from nodes other than the one under reclaim.
    
    This patch uses the following optimizations when walking page tables:
    1. It tracks the usage of mm_struct's between context switches so that
       page table walkers can skip processes that have been sleeping since
       the last iteration.
    2. It uses generational Bloom filters to record populated branches so
       that page table walkers can reduce their search space based on the
       query results, e.g., to skip page tables containing mostly holes or
       misplaced pages.
    3. It takes advantage of the accessed bit in non-leaf PMD entries when
       CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
    4. It does not zigzag between a PGD table and the same PMD table
       spanning multiple VMAs. IOW, it finishes all the VMAs within the
       range of the same PMD table before it returns to a PGD table. This
       improves the cache performance for workloads that have large
       numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change
    
      Single workload:
        memcached (anon): +[8, 10]%
                    Ops/sec      KB/sec
          patch1-7: 1147696.57   44640.29
          patch1-8: 1245274.91   48435.66
    
      Configurations:
        no change
    
    Client benchmark results:
      kswapd profiles:
        patch1-7
          48.16%  lzo1x_1_do_compress (real work)
           8.20%  page_vma_mapped_walk (overhead)
           7.06%  _raw_spin_unlock_irq
           2.92%  ptep_clear_flush
           2.53%  __zram_bvec_write
           2.11%  do_raw_spin_lock
           2.02%  memmove
           1.93%  lru_gen_look_around
           1.56%  free_unref_page_list
           1.40%  memset
    
        patch1-8
          49.44%  lzo1x_1_do_compress (real work)
           6.19%  page_vma_mapped_walk (overhead)
           5.97%  _raw_spin_unlock_irq
           3.13%  get_pfn_folio
           2.85%  ptep_clear_flush
           2.42%  __zram_bvec_write
           2.08%  do_raw_spin_lock
           1.92%  memmove
           1.44%  alloc_zspage
           1.36%  memset
    
      Configurations:
        no change
    
    Thanks to the following developers for their efforts [3].
      kernel test robot <lkp@intel.com>
    
    [1] https://lwn.net/Articles/23732/
    [2] https://llvm.org/docs/ScudoHardenedAllocator.html
    [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
    
    Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
    Acked-by: default avatarBrian Geffon <bgeffon@google.com>
    Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: default avatarSteven Barrett <steven@liquorix.net>
    Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
    Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
    Tested-by: default avatarDonald Carr <d@chaos-reins.com>
    Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
    Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    bd74fdae
swap.h 21.6 KB