• Mel Gorman's avatar
    mm: migration: take a reference to the anon_vma before migrating · 3f6c8272
    Mel Gorman authored
    This patchset is a memory compaction mechanism that reduces external
    fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
    pageblocks.  The term "compaction" was chosen as there are is a number of
    mechanisms that are not mutually exclusive that can be used to defragment
    memory.  For example, lumpy reclaim is a form of defragmentation as was
    slub "defragmentation" (really a form of targeted reclaim).  Hence, this
    is called "compaction" to distinguish it from other forms of
    defragmentation.
    
    In this implementation, a full compaction run involves two scanners
    operating within a zone - a migration and a free scanner.  The migration
    scanner starts at the beginning of a zone and finds all movable pages
    within one pageblock_nr_pages-sized area and isolates them on a
    migratepages list.  The free scanner begins at the end of the zone and
    searches on a per-area basis for enough free pages to migrate all the
    pages on the migratepages list.  As each area is respectively migrated or
    exhausted of free pages, the scanners are advanced one area.  A compaction
    run completes within a zone when the two scanners meet.
    
    This method is a bit primitive but is easy to understand and greater
    sophistication would require maintenance of counters on a per-pageblock
    basis.  This would have a big impact on allocator fast-paths to improve
    compaction which is a poor trade-off.
    
    It also does not try relocate virtually contiguous pages to be physically
    contiguous.  However, assuming transparent hugepages were in use, a
    hypothetical khugepaged might reuse compaction code to isolate free pages,
    split them and relocate userspace pages for promotion.
    
    Memory compaction can be triggered in one of three ways.  It may be
    triggered explicitly by writing any value to /proc/sys/vm/compact_memory
    and compacting all of memory.  It can be triggered on a per-node basis by
    writing any value to /sys/devices/system/node/nodeN/compact where N is the
    node ID to be compacted.  When a process fails to allocate a high-order
    page, it may compact memory in an attempt to satisfy the allocation
    instead of entering direct reclaim.  Explicit compaction does not finish
    until the two scanners meet and direct compaction ends if a suitable page
    becomes available that would meet watermarks.
    
    The series is in 14 patches.  The first three are not "core" to the series
    but are important pre-requisites.
    
    Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
    	patch, it's possible to use anon_vma after free if the caller is
    	not holding a VMA or mmap_sem for the pages in question. While
    	there should be no existing user that causes this problem,
    	it's a requirement for memory compaction to be stable. The patch
    	is at the start of the series for bisection reasons.
    Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
    	but would be slightly harder to review.
    Patch 3 skips over unmapped anon pages during migration as there are no
    	guarantees about the anon_vma existing. There is a window between
    	when a page was isolated and migration started during which anon_vma
    	could disappear.
    Patch 4 notes that PageSwapCache pages can still be migrated even if they
    	are unmapped.
    Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
    Patch 6 exports a "unusable free space index" via debugfs. It's
    	a measure of external fragmentation that takes the size of the
    	allocation request into account. It can also be calculated from
    	userspace so can be dropped if requested
    Patch 7 exports a "fragmentation index" which only has meaning when an
    	allocation request fails. It determines if an allocation failure
    	would be due to a lack of memory or external fragmentation.
    Patch 8 moves the definition for LRU isolation modes for use by compaction
    Patch 9 is the compaction mechanism although it's unreachable at this point
    Patch 10 adds a means of compacting all of memory with a proc trgger
    Patch 11 adds a means of compacting a specific node with a sysfs trigger
    Patch 12 adds "direct compaction" before "direct reclaim" if it is
    	determined there is a good chance of success.
    Patch 13 adds a sysctl that allows tuning of the threshold at which the
    	kernel will compact or direct reclaim
    Patch 14 temporarily disables compaction if an allocation failure occurs
    	after compaction.
    
    Testing of compaction was in three stages.  For the test, debugging,
    preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
    popped out.  min_free_kbytes was tuned as recommended by hugeadm to help
    fragmentation avoidance and high-order allocations.  It was tested on X86,
    X86-64 and PPC64.
    
    Ths first test represents one of the easiest cases that can be faced for
    lumpy reclaim or memory compaction.
    
    1. Machine freshly booted and configured for hugepage usage with
    	a) hugeadm --create-global-mounts
    	b) hugeadm --pool-pages-max DEFAULT:8G
    	c) hugeadm --set-recommended-min_free_kbytes
    	d) hugeadm --set-recommended-shmmax
    
    	The min_free_kbytes here is important. Anti-fragmentation works best
    	when pageblocks don't mix. hugeadm knows how to calculate a value that
    	will significantly reduce the worst of external-fragmentation-related
    	events as reported by the mm_page_alloc_extfrag tracepoint.
    
    2. Load up memory
    	a) Start updatedb
    	b) Create in parallel a X files of pagesize*128 in size. Wait
    	   until files are created. By parallel, I mean that 4096 instances
    	   of dd were launched, one after the other using &. The crude
    	   objective being to mix filesystem metadata allocations with
    	   the buffer cache.
    	c) Delete every second file so that pageblocks are likely to
    	   have holes
    	d) kill updatedb if it's still running
    
    	At this point, the system is quiet, memory is full but it's full with
    	clean filesystem metadata and clean buffer cache that is unmapped.
    	This is readily migrated or discarded so you'd expect lumpy reclaim
    	to have no significant advantage over compaction but this is at
    	the POC stage.
    
    3. In increments, attempt to allocate 5% of memory as hugepages.
    	   Measure how long it took, how successful it was, how many
    	   direct reclaims took place and how how many compactions. Note
    	   the compaction figures might not fully add up as compactions
    	   can take place for orders other than the hugepage size
    
    X86				vanilla		compaction
    Final page count                    913                916 (attempted 1002)
    pages reclaimed                   68296               9791
    
    X86-64				vanilla		compaction
    Final page count:                   901                902 (attempted 1002)
    Total pages reclaimed:           112599              53234
    
    PPC64				vanilla		compaction
    Final page count:                    93                 94 (attempted 110)
    Total pages reclaimed:           103216              61838
    
    There was not a dramatic improvement in success rates but it wouldn't be
    expected in this case either.  What was important is that fewer pages were
    reclaimed in all cases reducing the amount of IO required to satisfy a
    huge page allocation.
    
    The second tests were all performance related - kernbench, netperf, iozone
    and sysbench.  None showed anything too remarkable.
    
    The last test was a high-order allocation stress test.  Many kernel
    compiles are started to fill memory with a pressured mix of unmovable and
    movable allocations.  During this, an attempt is made to allocate 90% of
    memory as huge pages - one at a time with small delays between attempts to
    avoid flooding the IO queue.
    
                                                 vanilla   compaction
    Percentage of request allocated X86               98           99
    Percentage of request allocated X86-64            95           98
    Percentage of request allocated PPC64             55           70
    
    This patch:
    
    rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
    locking an anon_vma and it does not appear to have sufficient locking to
    ensure the anon_vma does not disappear from under it.
    
    This patch copies an approach used by KSM to take a reference on the
    anon_vma while pages are being migrated.  This should prevent rmap_walk()
    running into nasty surprises later because anon_vma has been freed.
    Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
    Acked-by: default avatarRik van Riel <riel@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    3f6c8272
rmap.c 40 KB