• Nitin Gupta's avatar
    mm: proactive compaction · facdaa91
    Nitin Gupta authored
    For some applications, we need to allocate almost all memory as hugepages.
    However, on a running system, higher-order allocations can fail if the
    memory is fragmented.  Linux kernel currently does on-demand compaction as
    we request more hugepages, but this style of compaction incurs very high
    latency.  Experiments with one-time full memory compaction (followed by
    hugepage allocations) show that kernel is able to restore a highly
    fragmented memory state to a fairly compacted memory state within <1 sec
    for a 32G system.  Such data suggests that a more proactive compaction can
    help us allocate a large fraction of memory as hugepages keeping
    allocation latencies low.
    
    For a more proactive compaction, the approach taken here is to define a
    new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
    external fragmentation which kcompactd tries to maintain.
    
    The tunable takes a value in range [0, 100], with a default of 20.
    
    Note that a previous version of this patch [1] was found to introduce too
    many tunables (per-order extfrag{low, high}), but this one reduces them to
    just one sysctl.  Also, the new tunable is an opaque value instead of
    asking for specific bounds of "external fragmentation", which would have
    been difficult to estimate.  The internal interpretation of this opaque
    value allows for future fine-tuning.
    
    Currently, we use a simple translation from this tunable to [low, high]
    "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
    The score for a node is defined as weighted mean of per-zone external
    fragmentation.  A zone's present_pages determines its weight.
    
    To periodically check per-node score, we reuse per-node kcompactd threads,
    which are woken up every 500 milliseconds to check the same.  If a node's
    score exceeds its high threshold (as derived from user-provided
    proactiveness value), proactive compaction is started until its score
    reaches its low threshold value.  By default, proactiveness is set to 20,
    which implies threshold values of low=80 and high=90.
    
    This patch is largely based on ideas from Michal Hocko [2].  See also the
    LWN article [3].
    
    Performance data
    ================
    
    System: x64_64, 1T RAM, 80 CPU threads.
    Kernel: 5.6.0-rc3 + this patch
    
    echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
    echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
    
    Before starting the driver, the system was fragmented from a userspace
    program that allocates all memory and then for each 2M aligned section,
    frees 3/4 of base pages using munmap.  The workload is mainly anonymous
    userspace pages, which are easy to move around.  I intentionally avoided
    unmovable pages in this test to see how much latency we incur when
    hugepage allocations hit direct compaction.
    
    1. Kernel hugepage allocation latencies
    
    With the system in such a fragmented state, a kernel driver then allocates
    as many hugepages as possible and measures allocation latency:
    
    (all latency values are in microseconds)
    
    - With vanilla 5.6.0-rc3
    
      percentile latency
      –––––––––– –––––––
    	   5    7894
    	  10    9496
    	  25   12561
    	  30   15295
    	  40   18244
    	  50   21229
    	  60   27556
    	  75   30147
    	  80   31047
    	  90   32859
    	  95   33799
    
    Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
    total free => 98% of free memory could be allocated as hugepages)
    
    - With 5.6.0-rc3 + this patch, with proactiveness=20
    
    sysctl -w vm.compaction_proactiveness=20
    
      percentile latency
      –––––––––– –––––––
    	   5       2
    	  10       2
    	  25       3
    	  30       3
    	  40       3
    	  50       4
    	  60       4
    	  75       4
    	  80       4
    	  90       5
    	  95     429
    
    Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
    total free => 98% of free memory could be allocated as hugepages)
    
    2. JAVA heap allocation
    
    In this test, we first fragment memory using the same method as for (1).
    
    Then, we start a Java process with a heap size set to 700G and request the
    heap to be allocated with THP hugepages.  We also set THP to madvise to
    allow hugepage backing of this heap.
    
    /usr/bin/time
     java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
    
    The above command allocates 700G of Java heap using hugepages.
    
    - With vanilla 5.6.0-rc3
    
    17.39user 1666.48system 27:37.89elapsed
    
    - With 5.6.0-rc3 + this patch, with proactiveness=20
    
    8.35user 194.58system 3:19.62elapsed
    
    Elapsed time remains around 3:15, as proactiveness is further increased.
    
    Note that proactive compaction happens throughout the runtime of these
    workloads.  The situation of one-time compaction, sufficient to supply
    hugepages for following allocation stream, can probably happen for more
    extreme proactiveness values, like 80 or 90.
    
    In the above Java workload, proactiveness is set to 20.  The test starts
    with a node's score of 80 or higher, depending on the delay between the
    fragmentation step and starting the benchmark, which gives more-or-less
    time for the initial round of compaction.  As t he benchmark consumes
    hugepages, node's score quickly rises above the high threshold (90) and
    proactive compaction starts again, which brings down the score to the low
    threshold level (80).  Repeat.
    
    bpftrace also confirms proactive compaction running 20+ times during the
    runtime of this Java benchmark.  kcompactd threads consume 100% of one of
    the CPUs while it tries to bring a node's score within thresholds.
    
    Backoff behavior
    ================
    
    Above workloads produce a memory state which is easy to compact.  However,
    if memory is filled with unmovable pages, proactive compaction should
    essentially back off.  To test this aspect:
    
    - Created a kernel driver that allocates almost all memory as hugepages
      followed by freeing first 3/4 of each hugepage.
    - Set proactiveness=40
    - Note that proactive_compact_node() is deferred maximum number of times
      with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
      (=> ~30 seconds between retries).
    
    [1] https://patchwork.kernel.org/patch/11098289/
    [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
    [3] https://lwn.net/Articles/817905/Signed-off-by: default avatarNitin Gupta <nigupta@nvidia.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Tested-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
    Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
    Reviewed-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Nitin Gupta <ngupta@nitingupta.dev>
    Cc: Oleksandr Natalenko <oleksandr@redhat.com>
    Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    facdaa91
internal.h 19 KB