• Huang Ying's avatar
    swap: reduce lock contention on swap cache from swap slots allocation · 49070588
    Huang Ying authored
    In some swap scalability test, it is found that there are heavy lock
    contention on swap cache even if we have split one swap cache radix tree
    per swap device to one swap cache radix tree every 64 MB trunk in commit
    4b3ef9da ("mm/swap: split swap cache into 64MB trunks").
    
    The reason is as follow.  After the swap device becomes fragmented so
    that there's no free swap cluster, the swap device will be scanned
    linearly to find the free swap slots.  swap_info_struct->cluster_next is
    the next scanning base that is shared by all CPUs.  So nearby free swap
    slots will be allocated for different CPUs.  The probability for
    multiple CPUs to operate on the same 64 MB trunk is high.  This causes
    the lock contention on the swap cache.
    
    To solve the issue, in this patch, for SSD swap device, a percpu version
    next scanning base (cluster_next_cpu) is added.  Every CPU will use its
    own per-cpu next scanning base.  And after finishing scanning a 64MB
    trunk, the per-cpu scanning base will be changed to the beginning of
    another randomly selected 64MB trunk.  In this way, the probability for
    multiple CPUs to operate on the same 64 MB trunk is reduced greatly.
    Thus the lock contention is reduced too.  For HDD, because sequential
    access is more important for IO performance, the original shared next
    scanning base is used.
    
    To test the patch, we have run 16-process pmbench memory benchmark on a
    2-socket server machine with 48 cores.  One ram disk is configured as the
    swap device per socket.  The pmbench working-set size is much larger than
    the available memory so that swapping is triggered.  The memory read/write
    ratio is 80/20 and the accessing pattern is random.  In the original
    implementation, the lock contention on the swap cache is heavy.  The perf
    profiling data of the lock contention code path is as following,
    
     _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      7.91
     _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:               7.11
     _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
     _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     1.66
     _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      1.29
     _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.03
     _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        0.93
    
    After applying this patch, it becomes,
    
     _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
     _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      2.3
     _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     2.26
     _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        1.8
     _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.19
    
    The lock contention on the swap cache is almost eliminated.
    
    And the pmbench score increases 18.5%.  The swapin throughput increases
    18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout throughput increases
    18.5% from 2.99 GB/s to 3.54 GB/s.
    
    We need really fast disk to show the benefit.  I have tried this on 2
    Intel P3600 NVMe disks.  The performance improvement is only about 1%.
    The improvement should be better on the faster disks, such as Intel Optane
    disk.
    
    [ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel]
      Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com
    [ying.huang@intel.com: v4]
      Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Tim Chen <tim.c.chen@linux.intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    49070588
swapfile.c 96.7 KB