• Mel Gorman's avatar
    mm: numa: quickly fail allocations for NUMA balancing on full nodes · 18b75e0b
    Mel Gorman authored
    commit 8479eba7 upstream.
    
    Commit 4167e9b2 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
    flag combination due to confusing semantics.  It noted that
    alloc_misplaced_dst_page() was one such user after changes made by
    commit e97ca8e5 ("mm: fix GFP_THISNODE callers and clarify").
    
    Unfortunately when GFP_THISNODE was removed, users of
    alloc_misplaced_dst_page() started waking kswapd and entering direct
    reclaim because the wrong GFP flags are cleared.  The consequence is
    that workloads that used to fit into memory now get reclaimed which is
    addressed by this patch.
    
    The problem can be demonstrated with "mutilate" that exercises memcached
    which is software dedicated to memory object caching.  The configuration
    uses 80% of memory and is run 3 times for varying numbers of clients.
    The results on a 4-socket NUMA box are
    
    mutilate
                                4.4.0                 4.4.0
                              vanilla           numaswap-v1
    Hmean    1      8394.71 (  0.00%)     8395.32 (  0.01%)
    Hmean    4     30024.62 (  0.00%)    34513.54 ( 14.95%)
    Hmean    7     32821.08 (  0.00%)    70542.96 (114.93%)
    Hmean    12    55229.67 (  0.00%)    93866.34 ( 69.96%)
    Hmean    21    39438.96 (  0.00%)    85749.21 (117.42%)
    Hmean    30    37796.10 (  0.00%)    50231.49 ( 32.90%)
    Hmean    47    18070.91 (  0.00%)    38530.13 (113.22%)
    
    The metric is queries/second with the more the better.  The results are
    way outside of the noise and the reason for the improvement is obvious
    from some of the vmstats
    
                                     4.4.0       4.4.0
                                   vanillanumaswap-v1r1
    Minor Faults                1929399272  2146148218
    Major Faults                  19746529        3567
    Swap Ins                      57307366        9913
    Swap Outs                     50623229       17094
    Allocation stalls                35909         443
    DMA allocs                           0           0
    DMA32 allocs                  72976349   170567396
    Normal allocs               5306640898  5310651252
    Movable allocs                       0           0
    Direct pages scanned         404130893      799577
    Kswapd pages scanned         160230174           0
    Kswapd pages reclaimed        55928786           0
    Direct pages reclaimed         1843936       41921
    Page writes file                  2391           0
    Page writes anon              50623229       17094
    
    The vanilla kernel is swapping like crazy with large amounts of direct
    reclaim and kswapd activity.  The figures are aggregate but it's known
    that the bad activity is throughout the entire test.
    
    Note that simple streaming anon/file memory consumers also see this
    problem but it's not as obvious.  In those cases, kswapd is awake when
    it should not be.
    
    As there are at least two reclaim-related bugs out there, it's worth
    spelling out the user-visible impact.  This patch only addresses bugs
    related to excessive reclaim on NUMA hardware when the working set is
    larger than a NUMA node.  There is a bug related to high kswapd CPU
    usage but the reports are against laptops and other UMA hardware and is
    not addressed by this patch.
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: David Rientjes <rientjes@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    18b75e0b
migrate.c 47.6 KB