• Aaron Lu's avatar
    mm/free_pcppages_bulk: prefetch buddy while not holding lock · 97334162
    Aaron Lu authored
    When a page is freed back to the global pool, its buddy will be checked
    to see if it's possible to do a merge.  This requires accessing buddy's
    page structure and that access could take a long time if it's cache
    cold.
    
    This patch adds a prefetch to the to-be-freed page's buddy outside of
    zone->lock in hope of accessing buddy's page structure later under
    zone->lock will be faster.  Since we *always* do buddy merging and check
    an order-0 page's buddy to try to merge it when it goes into the main
    allocator, the cacheline will always come in, i.e.  the prefetched data
    will never be unused.
    
    Normally, the number of prefetch will be pcp->batch(default=31 and has
    an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of
    pcp's pages get all drained, it will be pcp->count which has an upper
    limit of pcp->high.  pcp->high, although has a default value of 186
    (pcp->batch=31 * 6), can be changed by user through
    /proc/sys/vm/percpu_pagelist_fraction and there is no software upper
    limit so could be large, like several thousand.  For this reason, only
    the first pcp->batch number of page's buddy structure is prefetched to
    avoid excessive prefetching.
    
    In the meantime, there are two concerns:
    
     1. the prefetch could potentially evict existing cachelines, especially
        for L1D cache since it is not huge
    
     2. there is some additional instruction overhead, namely calculating
        buddy pfn twice
    
    For 1, it's hard to say, this microbenchmark though shows good result
    but the actual benefit of this patch will be workload/CPU dependant;
    
    For 2, since the calculation is a XOR on two local variables, it's
    expected in many cases that cycles spent will be offset by reduced
    memory latency later.  This is especially true for NUMA machines where
    multiple CPUs are contending on zone->lock and the most time consuming
    part under zone->lock is the wait of 'struct page' cacheline of the
    to-be-freed pages and their buddies.
    
    Test with will-it-scale/page_fault1 full load:
    
      kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
      v4.16-rc2+  9034215        7971818       13667135       15677465
      patch2/3    95363747 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
      this patch 10180856 +6.8%  8506369 +2.3% 14756865 +4.9% 17325324 +3.9%
    
    Note: this patch's performance improvement percent is against patch2/3.
    
    (Changelog stolen from Dave Hansen and Mel Gorman's comments at
    http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com)
    
    [aaron.lu@intel.com: use helper function, avoid disordering pages]
      Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
      Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com
    [aaron.lu@intel.com: v4]
      Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
      Link: http://lkml.kernel.org/r/20180309082431.GB30868@intel.com
    Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
    Suggested-by: default avatarYing Huang <ying.huang@intel.com>
    Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Kemi Wang <kemi.wang@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Tim Chen <tim.c.chen@linux.intel.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    97334162
page_alloc.c 220 KB