• Johannes Weiner's avatar
    mm: fix LRU balancing effect of new transparent huge pages · 5df74196
    Johannes Weiner authored
    The reclaim code that balances between swapping and cache reclaim tries to
    predict likely reuse based on in-memory reference patterns alone.  This
    works in many cases, but when it fails it cannot detect when the cache is
    thrashing pathologically, or when we're in the middle of a swap storm.
    
    The high seek cost of rotational drives under which the algorithm evolved
    also meant that mistakes could quickly result in lockups from too
    aggressive swapping (which is predominantly random IO).  As a result, the
    balancing code has been tuned over time to a point where it mostly goes
    for page cache and defers swapping until the VM is under significant
    memory pressure.
    
    The resulting strategy doesn't make optimal caching decisions - where
    optimal is the least amount of IO required to execute the workload.
    
    The proliferation of fast random IO devices such as SSDs, in-memory
    compression such as zswap, and persistent memory technologies on the
    horizon, has made this undesirable behavior very noticable: Even in the
    presence of large amounts of cold anonymous memory and a capable swap
    device, the VM refuses to even seriously scan these pages, and can leave
    the page cache thrashing needlessly.
    
    This series sets out to address this.  Since commit ("a528910e mm:
    thrash detection-based file cache sizing") we have exact tracking of
    refault IO - the ultimate cost of reclaiming the wrong pages.  This allows
    us to use an IO cost based balancing model that is more aggressive about
    scanning anonymous memory when the cache is thrashing, while being able to
    avoid unnecessary swap storms.
    
    These patches base the LRU balance on the rate of refaults on each list,
    times the relative IO cost between swap device and filesystem
    (swappiness), in order to optimize reclaim for least IO cost incurred.
    
    	History
    
    I floated these changes in 2016.  At the time they were incomplete and
    full of workarounds due to a lack of infrastructure in the reclaim code:
    We didn't have PageWorkingset, we didn't have hierarchical cgroup
    statistics, and problems with the cgroup swap controller.  As swapping
    wasn't too high a priority then, the patches stalled out.  With all
    dependencies in place now, here we are again with much cleaner,
    feature-complete patches.
    
    I kept the acks for patches that stayed materially the same :-)
    
    Below is a series of test results that demonstrate certain problematic
    behavior of the current code, as well as showcase the new code's more
    predictable and appropriate balancing decisions.
    
    	Test #1: No convergence
    
    This test shows an edge case where the VM currently doesn't converge at
    all on a new file workingset with a stale anon/tmpfs set.
    
    The test sets up a cold anon set the size of 3/4 RAM, then tries to
    establish a new file set half the size of RAM (flat access pattern).
    
    The vanilla kernel refuses to even scan anon pages and never converges.
    The file set is perpetually served from the filesystem.
    
    The first test kernel is with the series up to the workingset patch
    applied.  This allows thrashing page cache to challenge the anonymous
    workingset.  The VM then scans the lists based on the current
    scanned/rotated balancing algorithm.  It converges on a stable state where
    all cold anon pages are pushed out and the fileset is served entirely from
    cache:
    
    			    noconverge/5.7-rc5-mm	noconverge/5.7-rc5-mm-workingset
    Scanned			417719308.00 (    +0.00%)		64091155.00 (   -84.66%)
    Reclaimed		417711094.00 (    +0.00%)		61640308.00 (   -85.24%)
    Reclaim efficiency %	      100.00 (    +0.00%)		      96.18 (    -3.78%)
    Scanned file		417719308.00 (    +0.00%)		59211118.00 (   -85.83%)
    Scanned anon			0.00 (    +0.00%)	         4880037.00 (          )
    Swapouts			0.00 (    +0.00%)	         2439957.00 (          )
    Swapins				0.00 (    +0.00%)		     257.00 (          )
    Refaults		415246605.00 (    +0.00%)		59183722.00 (   -85.75%)
    Restore refaults		0.00 (    +0.00%)	        54988252.00 (          )
    
    The second test kernel is with the full patch series applied, which
    replaces the scanned/rotated ratios with refault/swapin rate-based
    balancing.  It evicts the cold anon pages more aggressively in the
    presence of a thrashing cache and the absence of swapins, and so converges
    with about 60% of the IO and reclaim activity:
    
    			noconverge/5.7-rc5-mm-workingset	noconverge/5.7-rc5-mm-lrubalance
    Scanned				64091155.00 (    +0.00%)		37579741.00 (   -41.37%)
    Reclaimed			61640308.00 (    +0.00%)		35129293.00 (   -43.01%)
    Reclaim efficiency %		      96.18 (    +0.00%)		      93.48 (    -2.78%)
    Scanned file			59211118.00 (    +0.00%)		32708385.00 (   -44.76%)
    Scanned anon			 4880037.00 (    +0.00%)		 4871356.00 (    -0.18%)
    Swapouts			 2439957.00 (    +0.00%)		 2435565.00 (    -0.18%)
    Swapins				     257.00 (    +0.00%)		     262.00 (    +1.94%)
    Refaults			59183722.00 (    +0.00%)		32675667.00 (   -44.79%)
    Restore refaults		54988252.00 (    +0.00%)		28480430.00 (   -48.21%)
    
    We're triggering this case in host sideloading scenarios: When a host's
    primary workload is not saturating the machine (primary load is usually
    driven by user activity), we can optimistically sideload a batch job; if
    user activity picks up and the primary workload needs the whole host
    during this time, we freeze the sideload and rely on it getting pushed to
    swap.  Frequently that swapping doesn't happen and the completely inactive
    sideload simply stays resident while the expanding primary worklad is
    struggling to gain ground.
    
    	Test #2: Kernel build
    
    This test is a a kernel build that is slightly memory-restricted (make -j4
    inside a 400M cgroup).
    
    Despite the very aggressive swapping of cold anon pages in test #1, this
    test shows that the new kernel carefully balances swap against cache
    refaults when both the file and the cache set are pressured.
    
    It shows the patched kernel to be slightly better at finding the coldest
    memory from the combined anon and file set to evict under pressure.  The
    result is lower aggregate reclaim and paging activity:
    
    z				    5.7-rc5-mm	5.7-rc5-mm-lrubalance
    Real time		   210.60 (    +0.00%)	   210.97 (    +0.18%)
    User time		   745.42 (    +0.00%)	   746.48 (    +0.14%)
    System time		    69.78 (    +0.00%)	    69.79 (    +0.02%)
    Scanned file		354682.00 (    +0.00%)	293661.00 (   -17.20%)
    Scanned anon		465381.00 (    +0.00%)	378144.00 (   -18.75%)
    Swapouts		185920.00 (    +0.00%)	147801.00 (   -20.50%)
    Swapins			 34583.00 (    +0.00%)	 32491.00 (    -6.05%)
    Refaults		212664.00 (    +0.00%)	172409.00 (   -18.93%)
    Restore refaults	 48861.00 (    +0.00%)	 80091.00 (   +63.91%)
    Total paging IO		433167.00 (    +0.00%)	352701.00 (   -18.58%)
    
    	Test #3: Overload
    
    This next test is not about performance, but rather about the
    predictability of the algorithm.  The current balancing behavior doesn't
    always lead to comprehensible results, which makes performance analysis
    and parameter tuning (swappiness e.g.) very difficult.
    
    The test shows the balancing behavior under equivalent anon and file
    input.  Anon and file sets are created of equal size (3/4 RAM), have the
    same access patterns (a hot-cold gradient), and synchronized access rates.
    Swappiness is raised from the default of 60 to 100 to indicate equal IO
    cost between swap and cache.
    
    With the vanilla balancing code, anon scans make up around 9% of the total
    pages scanned, or a ~1:10 ratio.  This is a surprisingly skewed ratio, and
    it's an outcome that is hard to explain given the input parameters to the
    VM.
    
    The new balancing model targets a 1:2 balance: All else being equal,
    reclaiming a file page costs one page IO - the refault; reclaiming an anon
    page costs two IOs - the swapout and the swapin.  In the test we observe a
    ~1:3 balance.
    
    The scanned and paging IO numbers indicate that the anon LRU algorithm we
    have in place right now does a slightly worse job at picking the coldest
    pages compared to the file algorithm.  There is ongoing work to improve
    this, like Joonsoo's anon workingset patches; however, it's difficult to
    compare the two aging strategies when the balancing between them is
    behaving unintuitively.
    
    The slightly less efficient anon reclaim results in a deviation from the
    optimal 1:2 scan ratio we would like to see here - however, 1:3 is much
    closer to what we'd want to see in this test than the vanilla kernel's
    aging of 10+ cache pages for every anonymous one:
    
    			overload-100/5.7-rc5-mm-workingset	overload-100/5.7-rc5-mm-lrubalance-realfile
    Scanned				 533633725.00 (    +0.00%)			  595687785.00 (   +11.63%)
    Reclaimed			 494325440.00 (    +0.00%)			  518154380.00 (    +4.82%)
    Reclaim efficiency %			92.63 (    +0.00%)				 86.98 (    -6.03%)
    Scanned file			 484532894.00 (    +0.00%)			  456937722.00 (    -5.70%)
    Scanned anon			  49100831.00 (    +0.00%)			  138750063.00 (  +182.58%)
    Swapouts			   8096423.00 (    +0.00%)			   48982142.00 (  +504.98%)
    Swapins				  10027384.00 (    +0.00%)			   62325044.00 (  +521.55%)
    Refaults			 479819973.00 (    +0.00%)			  451309483.00 (    -5.94%)
    Restore refaults		 426422087.00 (    +0.00%)			  399914067.00 (    -6.22%)
    Total paging IO			 497943780.00 (    +0.00%)			  562616669.00 (   +12.99%)
    
    	Test #4: Parallel IO
    
    It's important to note that these patches only affect the situation where
    the kernel has to reclaim workingset memory, which is usually a
    transitionary period.  The vast majority of page reclaim occuring in a
    system is from trimming the ever-expanding page cache.
    
    These patches don't affect cache trimming behavior.  We never swap as long
    as we only have use-once cache moving through the file LRU, we only
    consider swapping when the cache is actively thrashing.
    
    The following test demonstrates this.  It has an anon workingset that
    takes up half of RAM and then writes a file that is twice the size of RAM
    out to disk.
    
    As the cache is funneled through the inactive file list, no anon pages are
    scanned (aside from apparently some background noise of 10 pages):
    
    					  5.7-rc5-mm		          5.7-rc5-mm-lrubalance
    Scanned			    10714722.00 (    +0.00%)		       10723445.00 (    +0.08%)
    Reclaimed		    10703596.00 (    +0.00%)		       10712166.00 (    +0.08%)
    Reclaim efficiency %		  99.90 (    +0.00%)			     99.89 (    -0.00%)
    Scanned file		    10714722.00 (    +0.00%)		       10723435.00 (    +0.08%)
    Scanned anon			   0.00 (    +0.00%)			     10.00 (          )
    Swapouts			   0.00 (    +0.00%)			      7.00 (          )
    Swapins				   0.00 (    +0.00%)			      0.00 (    +0.00%)
    Refaults			  92.00 (    +0.00%)			     41.00 (   -54.84%)
    Restore refaults		   0.00 (    +0.00%)			      0.00 (    +0.00%)
    Total paging IO			  92.00 (    +0.00%)			     48.00 (   -47.31%)
    
    This patch (of 14):
    
    Currently, THP are counted as single pages until they are split right
    before being swapped out.  However, at that point the VM is already in the
    middle of reclaim, and adjusting the LRU balance then is useless.
    
    Always account THP by the number of basepages, and remove the fixup from
    the splitting path.
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Reviewed-by: default avatarRik van Riel <riel@surriel.com>
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Acked-by: default avatarMinchan Kim <minchan@kernel.org>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Link: http://lkml.kernel.org/r/20200520232525.798933-1-hannes@cmpxchg.org
    Link: http://lkml.kernel.org/r/20200520232525.798933-2-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    5df74196
swap.c 32.1 KB