1. 01 Aug, 2012 37 commits
    • David Rientjes's avatar
      cpusets: avoid looping when storing to mems_allowed if one node remains set · 6b63ea81
      David Rientjes authored
      commit 89e8a244 upstream.
      
      Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is
      	extremely expensive and severely impacted page allocator performance.
      	This is part of a series of patches that reduce page allocator
      	overhead.
      
      {get,put}_mems_allowed() exist so that general kernel code may locklessly
      access a task's set of allowable nodes without having the chance that a
      concurrent write will cause the nodemask to be empty on configurations
      where MAX_NUMNODES > BITS_PER_LONG.
      
      This could incur a significant delay, however, especially in low memory
      conditions because the page allocator is blocking and reclaim requires
      get_mems_allowed() itself.  It is not atypical to see writes to
      cpuset.mems take over 2 seconds to complete, for example.  In low memory
      conditions, this is problematic because it's one of the most imporant
      times to change cpuset.mems in the first place!
      
      The only way a task's set of allowable nodes may change is through cpusets
      by writing to cpuset.mems and when attaching a task to a generic code is
      not reading the nodemask with get_mems_allowed() at the same time, and
      then clearing all the old nodes.  This prevents the possibility that a
      reader will see an empty nodemask at the same time the writer is storing a
      new nodemask.
      
      If at least one node remains unchanged, though, it's possible to simply
      set all new nodes and then clear all the old nodes.  Changing a task's
      nodemask is protected by cgroup_mutex so it's guaranteed that two threads
      are not changing the same task's nodemask at the same time, so the
      nodemask is guaranteed to be stored before another thread changes it and
      determines whether a node remains set or not.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Paul Menage <paul@paulmenage.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b63ea81
    • Johannes Weiner's avatar
      mm: vmscan: convert global reclaim to per-memcg LRU lists · 4d01a2e3
      Johannes Weiner authored
      commit b95a2f2d upstream - WARNING: this is a substitute patch.
      
      Stable note: Not tracked in Bugzilla. This is a partial backport of an
      	upstream commit addressing a completely different issue
      	that accidentally contained an important fix. The workload
      	this patch helps was memcached when IO is started in the
      	background. memcached should stay resident but without this patch
      	it gets swapped. Sometimes this manifests as a drop in throughput
      	but mostly it was observed through /proc/vmstat.
      
      Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant
      to fix a problem whereby small scan targets on memcg were ignored causing
      priority to raise too sharply. It forced scanning to take place if the
      target was small, memcg or kswapd.
      
      From the time it was introduced it caused excessive reclaim by kswapd
      with workloads being pushed to swap that previously would have stayed
      resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan:
      convert global reclaim to per-memcg LRU lists] by making it harder for
      kswapd to force scan small targets but that patchset is not suitable for
      backporting. This was later changed again by commit [90126375: mm/vmscan:
      push lruvec pointer into get_scan_count()] into a format that looks
      like it would be a straight-forward backport but there is a subtle
      difference due to the use of lruvecs.
      
      The impact of the accidental fix is to make it harder for kswapd to force
      scan small targets by taking zone->all_unreclaimable into account. This
      patch is the closest equivalent available based on what is backported.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d01a2e3
    • Hugh Dickins's avatar
      mm: test PageSwapBacked in lumpy reclaim · d2b02236
      Hugh Dickins authored
      commit 043bcbe5 upstream.
      
      Stable note: Not tracked in Bugzilla. There were reports of shared
      	mapped pages being unfairly reclaimed in comparison to older kernels.
      	This is being addressed over time. Even though the subject
      	refers to lumpy reclaim, it impacts compaction as well.
      
      Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
      better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      d2b02236
    • Minchan Kim's avatar
      mm/vmscan.c: consider swap space when deciding whether to continue reclaim · 503e973c
      Minchan Kim authored
      commit 86cfd3a4 upstream.
      
      Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU
      	usage on swapless systems with high anonymous memory usage.
      
      It's pointless to continue reclaiming when we have no swap space and lots
      of anon pages in the inactive list.
      
      Without this patch, it is possible when swap is disabled to continue
      trying to reclaim when there are only anonymous pages in the system even
      though that will not make any progress.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      503e973c
    • Konstantin Khlebnikov's avatar
      vmscan: activate executable pages after first usage · 4391b5f4
      Konstantin Khlebnikov authored
      commit c909e993 upstream.
      
      Stable note: Not tracked in Bugzilla. There were reports of shared
      	mapped pages being unfairly reclaimed in comparison to older kernels.
      	This is being addressed over time.
      
      Logic added in commit 8cab4754 ("vmscan: make mapped executable pages
      the first class citizen") was noticeably weakened in commit
      64574746 ("vmscan: detect mapped file pages used only once").
      
      Currently these pages can become "first class citizens" only after second
      usage.  After this patch page_check_references() will activate they after
      first usage, and executable code gets yet better chance to stay in memory.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4391b5f4
    • Konstantin Khlebnikov's avatar
      vmscan: promote shared file mapped pages · 03722816
      Konstantin Khlebnikov authored
      commit 34dbc67a upstream.
      
      Stable note: Not tracked in Bugzilla. There were reports of shared
      	mapped pages being unfairly reclaimed in comparison to older kernels.
      	This is being addressed over time. The specific workload being
      	addressed here in described in paragraph four and while paragraph
      	five says it did not help performance as such, it made a difference
      	to major page faults. I'm aware of at least one bug for a large
      	vendor that was due to increased major faults.
      
      Commit 64574746 ("vmscan: detect mapped file pages used only once")
      greatly decreases lifetime of single-used mapped file pages.
      Unfortunately it also decreases life time of all shared mapped file
      pages.  Because after commit bf3f3bc5 ("mm: don't mark_page_accessed
      in fault path") page-fault handler does not mark page active or even
      referenced.
      
      Thus page_check_references() activates file page only if it was used twice
      while it stays in inactive list, meanwhile it activates anon pages after
      first access.  Inactive list can be small enough, this way reclaimer can
      accidentally throw away any widely used page if it wasn't used twice in
      short period.
      
      After this patch page_check_references() also activate file mapped page at
      first inactive list scan if this page is already used multiple times via
      several ptes.
      
      I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
      (~2.6.18).  There a complete mess with >100 web/mail/spam/ftp containers,
      they share all their files but there a lot of anonymous pages: ~500mb
      shared file mapped memory and 15-20Gb non-shared anonymous memory.  In
      this situation major-pagefaults are very costly, because all containers
      share the same page.  In my load kernel created a disproportionate
      pressure on the file memory, compared with the anonymous, they equaled
      only if I raise swappiness up to 150 =)
      
      These patches actually wasn't helped a lot in my problem, but I saw
      noticable (10-20 times) reduce in count and average time of
      major-pagefault in file-mapped areas.
      
      Actually both patches are fixes for commit v2.6.33-5448-g64574746, because
      it was aimed at one scenario (singly used pages), but it breaks the logic
      in other scenarios (shared and/or executable pages)
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      03722816
    • Mel Gorman's avatar
      mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone · 9cad5d6a
      Mel Gorman authored
      commit 0cee34fd upstream.
      
      Stable note: Not tracked on Bugzilla. THP and compaction was found to
      	aggressively reclaim pages and stall systems under different
      	situations that was addressed piecemeal over time.
      
      If compaction can proceed for a given zone, shrink_zones() does not
      reclaim any more pages from it.  After commit [e0c23279: vmscan: abort
      reclaim/compaction if compaction can proceed], do_try_to_free_pages()
      tries to finish as soon as possible once one zone can compact.
      
      This was intended to prevent slabs being shrunk unnecessarily but there
      are side-effects.  One is that a small zone that is ready for compaction
      will abort reclaim even if the chances of successfully allocating a THP
      from that zone is small.  It also means that reclaim can return too early
      even though sc->nr_to_reclaim pages were not reclaimed.
      
      This partially reverts the commit until it is proven that slabs are really
      being shrunk unnecessarily but preserves the check to return 1 to avoid
      OOM if reclaim was aborted prematurely.
      
      [aarcange@redhat.com: This patch replaces a revert from Andrea]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9cad5d6a
    • Mel Gorman's avatar
      mm: vmscan: do not OOM if aborting reclaim to start compaction · da0dc52b
      Mel Gorman authored
      commit 7335084d upstream.
      
      Stable note: Not tracked in Bugzilla. This patch makes later patches
      	easier to apply but otherwise has little to justify it. The
      	problem it fixes was never observed but the source of the
      	theoretical problem did not exist for very long.
      
      During direct reclaim it is possible that reclaim will be aborted so that
      compaction can be attempted to satisfy a high-order allocation.  If this
      decision is made before any pages are reclaimed, it is possible that 0 is
      returned to the page allocator potentially triggering an OOM.  This has
      not been observed but it is a possibility so this patch addresses it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      da0dc52b
    • Mel Gorman's avatar
      mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available · d50462a3
      Mel Gorman authored
      commit fe4b1b24 upstream.
      
      Stable note: Not tracked on Bugzilla. THP and compaction was found to
      	aggressively reclaim pages and stall systems under different
      	situations that was addressed piecemeal over time. This patch
      	addresses a problem where the fix regressed THP allocation
      	success rates.
      
      In commit e0887c19 ("vmscan: limit direct reclaim for higher order
      allocations"), Rik noted that reclaim was too aggressive when THP was
      enabled.  In his initial patch he used the number of free pages to decide
      if reclaim should abort for compaction.  My feedback was that reclaim and
      compaction should be using the same logic when deciding if reclaim should
      be aborted.
      
      Unfortunately, this had the effect of reducing THP success rates when the
      workload included something like streaming reads that continually
      allocated pages.  The window during which compaction could run and return
      a THP was too small.
      
      This patch combines Rik's two patches together.  compaction_suitable() is
      still used to decide if reclaim should be aborted to allow compaction is
      used.  However, it will also ensure that there is a reasonable buffer of
      free pages available.  This improves upon the THP allocation success rates
      but bounds the number of pages that are freed for compaction.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d50462a3
    • Mel Gorman's avatar
      mm: compaction: introduce sync-light migration for use by compaction · f869774c
      Mel Gorman authored
      commit a6bc32b8 upstream.
      
      Stable note: Not tracked in Buzilla. This was part of a series that
      	reduced interactivity stalls experienced when THP was enabled.
      	These stalls were particularly noticable when copying data
      	to a USB stick but the experiences for users varied a lot.
      
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f869774c
    • Alex Shi's avatar
      kswapd: assign new_order and new_classzone_idx after wakeup in sleeping · 9203b3fa
      Alex Shi authored
      commit f0dfcde0 upstream.
      
      Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019.  This
      	patch reduces kswapd CPU usage.
      
      There 2 places to read pgdat in kswapd.  One is return from a successful
      balance, another is waked up from kswapd sleeping.  The new_order and
      new_classzone_idx represent the balance input order and classzone_idx.
      
      But current new_order and new_classzone_idx are not assigned after
      kswapd_try_to_sleep(), that will cause a bug in the following scenario.
      
      1: after a successful balance, kswapd goes to sleep, and new_order = 0;
         new_classzone_idx = __MAX_NR_ZONES - 1;
      
      2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL
      
      3: in the balance_pgdat() running, a new balance wakeup happened with
         order = 5, and classzone_idx = ZONE_NORMAL
      
      4: the first wakeup(order = 3) finished successufly, return order = 3
         but, the new_order is still 0, so, this balancing will be treated as a
         failed balance.  And then the second tighter balancing will be missed.
      
      So, to avoid the above problem, the new_order and new_classzone_idx need
      to be assigned for later successful comparison.
      Signed-off-by: default avatarAlex Shi <alex.shi@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Tested-by: default avatarPádraig Brady <P@draigBrady.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9203b3fa
    • Alex Shi's avatar
      kswapd: avoid unnecessary rebalance after an unsuccessful balancing · 5d62e5ca
      Alex Shi authored
      commit d2ebd0f6 upstream.
      
      Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019.  This
      	patch reduces kswapd CPU usage.
      
      In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
      when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
      after a unsuccessful balancing if there is tighter reclaim request pending
      in the balancing.  But in the following scenario, kswapd do something that
      is not matched our expectation.  The patch fixes this issue.
      
      1, Read pgdat request A (classzone_idx, order = 3)
      2, balance_pgdat()
      3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
      4, balance_pgdat() returns but failed since returned order = 0
      5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
         While the expectation behavior of kswapd should try to sleep.
      Signed-off-by: default avatarAlex Shi <alex.shi@intel.com>
      Reviewed-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Tested-by: default avatarPádraig Brady <P@draigBrady.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5d62e5ca
    • Mel Gorman's avatar
      mm: compaction: make isolate_lru_page() filter-aware again · a7e32d7a
      Mel Gorman authored
      commit c8244935 upstream.
      
      Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
      	information by reducing LRU list churning had the side-effect of
      	reducing THP allocation success rates. This was part of a series
      	to restore the success rates while preserving the reclaim fix.
      
      Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
      noted that compaction does not migrate dirty or writeback pages and that
      is was meaningless to pick the page and re-add it to the LRU list.  This
      had to be partially reverted because some dirty pages can be migrated by
      compaction without blocking.
      
      This patch updates "mm: compaction: make isolate_lru_page" by skipping
      over pages that migration has no possibility of migrating to minimise LRU
      disruption.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7e32d7a
    • Mel Gorman's avatar
      mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred · c17a3665
      Mel Gorman authored
      commit 66199712 upstream.
      
      Stable note: Not tracked in Buzilla. This was part of a series that
      	reduced interactivity stalls experienced when THP was enabled.
      
      If compaction is deferred, direct reclaim is used to try to free enough
      pages for the allocation to succeed.  For small high-orders, this has a
      reasonable chance of success.  However, if the caller has specified
      __GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
      to fail the allocation rather than stall the caller in direct reclaim.
      This patch skips direct reclaim if compaction is deferred and the caller
      specifies __GFP_NO_KSWAPD.
      
      Async compaction only considers a subset of pages so it is possible for
      compaction to be deferred prematurely and not enter direct reclaim even in
      cases where it should.  To compensate for this, this patch also defers
      compaction only if sync compaction failed.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c17a3665
    • Mel Gorman's avatar
      mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage · 397d9c50
      Mel Gorman authored
      commit b969c4ab upstream.
      
      Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
      	aging information by reducing LRU list churning had the side-effect
      	of reducing THP allocation success rates. This was part of a series
      	to restore the success rates while preserving the reclaim fix.
      
      Asynchronous compaction is used when allocating transparent hugepages to
      avoid blocking for long periods of time.  Due to reports of stalling,
      there was a debate on disabling synchronous compaction but this severely
      impacted allocation success rates.  Part of the reason was that many dirty
      pages are skipped in asynchronous compaction by the following check;
      
      	if (PageDirty(page) && !sync &&
      		mapping->a_ops->migratepage != migrate_page)
      			rc = -EBUSY;
      
      This skips over all mapping aops using buffer_migrate_page() even though
      it is possible to migrate some of these pages without blocking.  This
      patch updates the ->migratepage callback with a "sync" parameter.  It is
      the responsibility of the callback to fail gracefully if migration would
      block.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      397d9c50
    • Mel Gorman's avatar
      mm: compaction: allow compaction to isolate dirty pages · ec46a9e8
      Mel Gorman authored
      commit a77ebd33 upstream.
      
      Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
      	information by reducing LRU list churning had the side-effect of
      	reducing THP allocation success rates. This was part of a series
      	to restore the success rates while preserving the reclaim fix.
      
      Short summary: There are severe stalls when a USB stick using VFAT is
      used with THP enabled that are reduced by this series.  If you are
      experiencing this problem, please test and report back and considering I
      have seen complaints from openSUSE and Fedora users on this as well as a
      few private mails, I'm guessing it's a widespread issue.  This is a new
      type of USB-related stall because it is due to synchronous compaction
      writing where as in the past the big problem was dirty pages reaching
      the end of the LRU and being written by reclaim.
      
      Am cc'ing Andrew this time and this series would replace
      mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
      I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
      for wider testing and ideally it would be reverted and replaced by this
      series.
      
      That said, the later patches could really do with some review.  If this
      series is not the answer then a new direction needs to be discussed
      because as it is, the stalls are unacceptable as the results in this
      leader show.
      
      For testers that try backporting this to 3.1, it won't work because
      there is a non-obvious dependency on not writing back pages in direct
      reclaim so you need those patches too.
      
      Changelog since V5
      o Rebase to 3.2-rc5
      o Tidy up the changelogs a bit
      
      Changelog since V4
      o Added reviewed-bys, credited Andrea properly for sync-light
      o Allow dirty pages without mappings to be considered for migration
      o Bound the number of pages freed for compaction
      o Isolate PageReclaim pages on their own LRU list
      
      This is against 3.2-rc5 and follows on from discussions on "mm: Do
      not stall in synchronous compaction for THP allocations" and "[RFC
      PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
      patch eliminated stalls due to compaction which sometimes resulted in
      user-visible interactivity problems on browsers by simply never using
      sync compaction. The downside was that THP success allocation rates
      were lower because dirty pages were not being migrated as reported by
      Andrea. His approach at fixing this was nacked on the grounds that
      it reverted fixes from Rik merged that reduced the amount of pages
      reclaimed as it severely impacted his workloads performance.
      
      This series attempts to reconcile the requirements of maximising THP
      usage, without stalling in a user-visible fashion due to compaction
      or cheating by reclaiming an excessive number of pages.
      
      Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
      	dirty pages. This is because migration can move some dirty
      	pages without blocking.
      
      Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
      	synchronous compaction when it should be. This is unrelated
      	to the reported stalls but is worth fixing.
      
      Patch 3 checks if we isolated a compound page during lumpy scan and
      	account for it properly. For the most part, this affects
      	tracing so it's unrelated to the stalls but worth fixing.
      
      Patch 4 notes that it is possible to abort reclaim early for compaction
      	and return 0 to the page allocator potentially entering the
      	"may oom" path. This has not been observed in practice but
      	the rest of the series potentially makes it easier to happen.
      
      Patch 5 adds a sync parameter to the migratepage callback and gives
      	the callback responsibility for migrating the page without
      	blocking if sync==false. For example, fallback_migrate_page
      	will not call writepage if sync==false. This increases the
      	number of pages that can be handled by asynchronous compaction
      	thereby reducing stalls.
      
      Patch 6 restores filter-awareness to isolate_lru_page for migration.
      	In practice, it means that pages under writeback and pages
      	without a ->migratepage callback will not be isolated
      	for migration.
      
      Patch 7 avoids calling direct reclaim if compaction is deferred but
      	makes sure that compaction is only deferred if sync
      	compaction was used.
      
      Patch 8 introduces a sync-light migration mechanism that sync compaction
      	uses. The objective is to allow some stalls but to not call
      	->writepage which can lead to significant user-visible stalls.
      
      Patch 9 notes that while we want to abort reclaim ASAP to allow
      	compation to go ahead that we leave a very small window of
      	opportunity for compaction to run. This patch allows more pages
      	to be freed by reclaim but bounds the number to a reasonable
      	level based on the high watermark on each zone.
      
      Patch 10 allows slabs to be shrunk even after compaction_ready() is
      	true for one zone. This is to avoid a problem whereby a single
      	small zone can abort reclaim even though no pages have been
      	reclaimed and no suitably large zone is in a usable state.
      
      Patch 11 fixes a problem with the rate of page scanning. As reclaim is
      	rarely stalling on pages under writeback it means that scan
      	rates are very high. This is particularly true for direct
      	reclaim which is not calling writepage. The vmstat figures
      	implied that much of this was busy work with PageReclaim pages
      	marked for immediate reclaim. This patch is a prototype that
      	moves these pages to their own LRU list.
      
      This has been tested and other than 2 USB keys getting trashed,
      nothing horrible fell out. That said, I am a bit unhappy with the
      rescue logic in patch 11 but did not find a better way around it. It
      does significantly reduce scan rates and System CPU time indicating
      it is the right direction to take.
      
      What is of critical importance is that stalls due to compaction
      are massively reduced even though sync compaction was still
      allowed. Testing from people complaining about stalls copying to USBs
      with THP enabled are particularly welcome.
      
      The following tests all involve THP usage and USB keys in some
      way. Each test follows this type of pattern
      
      1. Read from some fast fast storage, be it raw device or file. Each time
         the copy finishes, start again until the test ends
      2. Write a large file to a filesystem on a USB stick. Each time the copy
         finishes, start again until the test ends
      3. When memory is low, start an alloc process that creates a mapping
         the size of physical memory to stress THP allocation. This is the
         "real" part of the test and the part that is meant to trigger
         stalls when THP is enabled. Copying continues in the background.
      4. Record the CPU usage and time to execute of the alloc process
      5. Record the number of THP allocs and fallbacks as well as the number of THP
         pages in use a the end of the test just before alloc exited
      6. Run the test 5 times to get an idea of variability
      7. Between each run, sync is run and caches dropped and the test
         waits until nr_dirty is a small number to avoid interference
         or caching between iterations that would skew the figures.
      
      The individual tests were then
      
      writebackCPDeviceBasevfat
      	Disable THP, read from a raw device (sda), vfat on USB stick
      writebackCPDeviceBaseext4
      	Disable THP, read from a raw device (sda), ext4 on USB stick
      writebackCPDevicevfat
      	THP enabled, read from a raw device (sda), vfat on USB stick
      writebackCPDeviceext4
      	THP enabled, read from a raw device (sda), ext4 on USB stick
      writebackCPFilevfat
      	THP enabled, read from a file on fast storage and USB, both vfat
      writebackCPFileext4
      	THP enabled, read from a file on fast storage and USB, both ext4
      
      The kernels tested were
      
      3.1		3.1
      vanilla		3.2-rc5
      freemore	Patches 1-10
      immediate	Patches 1-11
      andrea		The 8 patches Andrea posted as a basis of comparison
      
      The results are very long unfortunately. I'll start with the case
      where we are not using THP at all
      
      writebackCPDeviceBasevfat
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.28 (    0.00%)   54.49 (-4143.46%)   48.63 (-3687.69%)    4.69 ( -265.11%)   51.88 (-3940.81%)
      +/-                 0.06 (    0.00%)    2.45 (-4305.55%)    4.75 (-8430.57%)    7.46 (-13282.76%)    4.76 (-8440.70%)
      User Time           0.09 (    0.00%)    0.05 (   40.91%)    0.06 (   29.55%)    0.07 (   15.91%)    0.06 (   27.27%)
      +/-                 0.02 (    0.00%)    0.01 (   45.39%)    0.02 (   25.07%)    0.00 (   77.06%)    0.01 (   52.24%)
      Elapsed Time      110.27 (    0.00%)   56.38 (   48.87%)   49.95 (   54.70%)   11.77 (   89.33%)   53.43 (   51.54%)
      +/-                 7.33 (    0.00%)    3.77 (   48.61%)    4.94 (   32.63%)    6.71 (    8.50%)    4.76 (   35.03%)
      THP Active          0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      Fault Alloc         0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      Fault Fallback      0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      
      The THP figures are obviously all 0 because THP was enabled. The
      main thing to watch is the elapsed times and how they compare to
      times when THP is enabled later. It's also important to note that
      elapsed time is improved by this series as System CPu time is much
      reduced.
      
      writebackCPDevicevfat
      
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.22 (    0.00%)   13.89 (-1040.72%)   46.40 (-3709.20%)    4.44 ( -264.37%)   47.37 (-3789.33%)
      +/-                 0.06 (    0.00%)   22.82 (-37635.56%)    3.84 (-6249.44%)    6.48 (-10618.92%)    6.60
      (-10818.53%)
      User Time           0.06 (    0.00%)    0.06 (   -6.90%)    0.05 (   17.24%)    0.05 (   13.79%)    0.04 (   31.03%)
      +/-                 0.01 (    0.00%)    0.01 (   33.33%)    0.01 (   33.33%)    0.01 (   39.14%)    0.01 (   25.46%)
      Elapsed Time     10445.54 (    0.00%) 2249.92 (   78.46%)   70.06 (   99.33%)   16.59 (   99.84%)  472.43 (
      95.48%)
      +/-               643.98 (    0.00%)  811.62 (  -26.03%)   10.02 (   98.44%)    7.03 (   98.91%)   59.99 (   90.68%)
      THP Active         15.60 (    0.00%)   35.20 (  225.64%)   65.00 (  416.67%)   70.80 (  453.85%)   62.20 (  398.72%)
      +/-                18.48 (    0.00%)   51.29 (  277.59%)   15.99 (   86.52%)   37.91 (  205.18%)   22.02 (  119.18%)
      Fault Alloc       121.80 (    0.00%)   76.60 (   62.89%)  155.40 (  127.59%)  181.20 (  148.77%)  286.60 (  235.30%)
      +/-                73.51 (    0.00%)   61.11 (   83.12%)   34.89 (   47.46%)   31.88 (   43.36%)   68.13 (   92.68%)
      Fault Fallback    881.20 (    0.00%)  926.60 (   -5.15%)  847.60 (    3.81%)  822.00 (    6.72%)  716.60 (   18.68%)
      +/-                73.51 (    0.00%)   61.26 (   16.67%)   34.89 (   52.54%)   31.65 (   56.94%)   67.75 (    7.84%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       3540.88   1945.37    716.04     64.97   1937.03
      Total Elapsed Time (seconds)              52417.33  11425.90    501.02    230.95   2520.28
      
      The first thing to note is the "Elapsed Time" for the vanilla kernels
      of 2249 seconds versus 56 with THP disabled which might explain the
      reports of USB stalls with THP enabled. Applying the patches brings
      performance in line with THP-disabled performance while isolating
      pages for immediate reclaim from the LRU cuts down System CPU time.
      
      The "Fault Alloc" success rate figures are also improved. The vanilla
      kernel only managed to allocate 76.6 pages on average over the course
      of 5 iterations where as applying the series allocated 181.20 on
      average albeit it is well within variance. It's worth noting that
      applies the series at least descreases the amount of variance which
      implies an improvement.
      
      Andrea's series had a higher success rate for THP allocations but
      at a severe cost to elapsed time which is still better than vanilla
      but still much worse than disabling THP altogether. One can bring my
      series close to Andrea's by removing this check
      
              /*
               * If compaction is deferred for high-order allocations, it is because
               * sync compaction recently failed. In this is the case and the caller
               * has requested the system not be heavily disrupted, fail the
               * allocation now instead of entering direct reclaim
               */
              if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
                      goto nopage;
      
      I didn't include a patch that removed the above check because hurting
      overall performance to improve the THP figure is not what the average
      user wants. It's something to consider though if someone really wants
      to maximise THP usage no matter what it does to the workload initially.
      
      This is summary of vmstat figures from the same test.
      
                                             3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
      Page Ins                                  3257266139  1111844061    17263623    10901575   161423219
      Page Outs                                   81054922    30364312     3626530     3657687     8753730
      Swap Ins                                        3294        2851        6560        4964        4592
      Swap Outs                                     390073      528094      620197      790912      698285
      Direct pages scanned                      1077581700  3024951463  1764930052   115140570  5901188831
      Kswapd pages scanned                        34826043     7112868     2131265     1686942     1893966
      Kswapd pages reclaimed                      28950067     4911036     1246044      966475     1497726
      Direct pages reclaimed                     805148398   280167837     3623473     2215044    40809360
      Kswapd efficiency                                83%         69%         58%         57%         79%
      Kswapd velocity                              664.399     622.521    4253.852    7304.360     751.490
      Direct efficiency                                74%          9%          0%          1%          0%
      Direct velocity                            20557.737  264745.137 3522673.849  498551.938 2341481.435
      Percentage direct scans                          96%         99%         99%         98%         99%
      Page writes by reclaim                        722646      529174      620319      791018      699198
      Page writes file                              332573        1080         122         106         913
      Page writes anon                              390073      528094      620197      790912      698285
      Page reclaim immediate                             0  2552514720  1635858848   111281140  5478375032
      Page rescued immediate                             0           0           0       87848           0
      Slabs scanned                                  23552       23552        9216        8192        9216
      Direct inode steals                              231           0           0           0           0
      Kswapd inode steals                                0           0           0           0           0
      Kswapd skipped wait                            28076         786           0          61           6
      THP fault alloc                                  609         383         753         906        1433
      THP collapse alloc                                12           6           0           0           6
      THP splits                                       536         211         456         593        1136
      THP fault fallback                              4406        4633        4263        4110        3583
      THP collapse fail                                120         127           0           0           4
      Compaction stalls                               1810         728         623         779        3200
      Compaction success                               196          53          60          80         123
      Compaction failures                             1614         675         563         699        3077
      Compaction pages moved                        193158       53545      243185      333457      226688
      Compaction move failure                         9952        9396       16424       23676       45070
      
      The main things to look at are
      
      1. Page In/out figures are much reduced by the series.
      
      2. Direct page scanning is incredibly high (264745.137 pages scanned
         per second on the vanilla kernel) but isolating PageReclaim pages
         on their own list reduces the number of pages scanned significantly.
      
      3. The fact that "Page rescued immediate" is a positive number implies
         that we sometimes race removing pages from the LRU_IMMEDIATE list
         that need to be put back on a normal LRU but it happens only for
         0.07% of the pages marked for immediate reclaim.
      
      writebackCPDeviceext4
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
      +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
      User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
      +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
      Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
      +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
      THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
      +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
      Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
      +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
      Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
      +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
      Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
      
      Similar test but the USB stick is using ext4 instead of vfat. As
      ext4 does not use writepage for migration, the large stalls due to
      compaction when THP is enabled are not observed. Still, isolating
      PageReclaim pages on their own list helped completion time largely
      by reducing the number of pages scanned by direct reclaim although
      time spend in congestion_wait could also be a factor.
      
      Again, Andrea's series had far higher success rates for THP allocation
      at the cost of elapsed time. I didn't look too closely but a quick
      look at the vmstat figures tells me kswapd reclaimed 8 times more pages
      than the patch series and direct reclaim reclaimed roughly three times
      as many pages. It follows that if memory is aggressively reclaimed,
      there will be more available for THP.
      
      writebackCPFilevfat
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.76 (    0.00%)   29.10 (-1555.52%)   46.01 (-2517.18%)    4.79 ( -172.35%)   54.89 (-3022.53%)
      +/-                 0.14 (    0.00%)   25.61 (-18185.17%)    2.15 (-1434.83%)    6.60 (-4610.03%)    9.75
      (-6863.76%)
      User Time           0.05 (    0.00%)    0.07 (  -45.83%)    0.05 (   -4.17%)    0.06 (  -29.17%)    0.06 (  -16.67%)
      +/-                 0.02 (    0.00%)    0.02 (   20.11%)    0.02 (   -3.14%)    0.01 (   31.58%)    0.01 (   47.41%)
      Elapsed Time     22520.79 (    0.00%) 1082.85 (   95.19%)   73.30 (   99.67%)   32.43 (   99.86%)  291.84 (  98.70%)
      +/-              7277.23 (    0.00%)  706.29 (   90.29%)   19.05 (   99.74%)   17.05 (   99.77%)  125.55 (   98.27%)
      THP Active         83.80 (    0.00%)   12.80 (   15.27%)   15.60 (   18.62%)   13.00 (   15.51%)    0.80 (    0.95%)
      +/-                66.81 (    0.00%)   20.19 (   30.22%)    5.92 (    8.86%)   15.06 (   22.54%)    1.17 (    1.75%)
      Fault Alloc       171.00 (    0.00%)   67.80 (   39.65%)   97.40 (   56.96%)  125.60 (   73.45%)  133.00 (   77.78%)
      +/-                82.91 (    0.00%)   30.69 (   37.02%)   53.91 (   65.02%)   55.05 (   66.40%)   21.19 (   25.56%)
      Fault Fallback    832.00 (    0.00%)  935.20 (  -12.40%)  906.00 (   -8.89%)  877.40 (   -5.46%)  870.20 (   -4.59%)
      +/-                82.91 (    0.00%)   30.69 (   62.98%)   54.01 (   34.86%)   55.05 (   33.60%)   20.91 (   74.78%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       7229.81    928.42    704.52     80.68   1330.76
      Total Elapsed Time (seconds)             112849.04   5618.69    571.11    360.54   1664.28
      
      In this case, the test is reading/writing only from filesystems but as
      it's vfat, it's slow due to calling writepage during compaction. Little
      to observe really - the time to complete the test goes way down
      with the series applied and THP allocation success rates go up in
      comparison to 3.2-rc5.  The success rates are lower than 3.1.0 but
      the elapsed time for that kernel is abysmal so it is not really a
      sensible comparison.
      
      As before, Andrea's series allocates more THPs at the cost of overall
      performance.
      
      writebackCPFileext4
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
      +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
      User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
      +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
      Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
      +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
      THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
      +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
      Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
      +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
      Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
      +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
      Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
      
      Same type of story - elapsed times go down. In this case, allocation
      success rates are roughtly the same. As before, Andrea's has higher
      success rates but takes a lot longer.
      
      Overall the series does reduce latencies and while the tests are
      inherency racy as alloc competes with the cp processes, the variability
      was included. The THP allocation rates are not as high as they could
      be but that is because we would have to be more aggressive about
      reclaim and compaction impacting overall performance.
      
      This patch:
      
      Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
      noted that compaction does not migrate dirty or writeback pages and that
      is was meaningless to pick the page and re-add it to the LRU list.
      
      What was missed during review is that asynchronous migration moves dirty
      pages if their ->migratepage callback is migrate_page() because these can
      be moved without blocking.  This potentially impacted hugepage allocation
      success rates by a factor depending on how many dirty pages are in the
      system.
      
      This patch partially reverts 39deaf85 to allow migration to isolate dirty
      pages again.  This increases how much compaction disrupts the LRU but that
      is addressed later in the series.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec46a9e8
    • Minchan Kim's avatar
      mm: migration: clean up unmap_and_move() · 331fae62
      Minchan Kim authored
      commit 0dabec93 upstream.
      
      Stable note: Not tracked in Bugzilla. This patch makes later patches
      	easier to apply but has no other impact.
      
      unmap_and_move() is one a big messy function.  Clean it up.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      331fae62
    • Minchan Kim's avatar
      mm: zone_reclaim: make isolate_lru_page() filter-aware · 5e02dde6
      Minchan Kim authored
      commit f80c0673 upstream.
      
      Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
      	leading to poor reclaim decisions which has a variable
      	performance impact.
      
      In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
      we have isolated mapped page and re-add it into LRU's head.  It's
      unnecessary CPU overhead and makes LRU churning.
      
      Of course, when we isolate the page, the page might be mapped but when we
      try to migrate the page, the page would be not mapped.  So it could be
      migrated.  But race is rare and although it happens, it's no big deal.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e02dde6
    • Minchan Kim's avatar
      mm: compaction: make isolate_lru_page() filter-aware · 19faec05
      Minchan Kim authored
      commit 39deaf85 upstream.
      
      Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
      	list leading to poor reclaim decisions which has a variable
      	performance impact.
      
      In async mode, compaction doesn't migrate dirty or writeback pages.  So,
      it's meaningless to pick the page and re-add it to lru list.
      
      Of course, when we isolate the page in compaction, the page might be dirty
      or writeback but when we try to migrate the page, the page would be not
      dirty, writeback.  So it could be migrated.  But it's very unlikely as
      isolate and migration cycle is much faster than writeout.
      
      So, this patch helps cpu overhead and prevent unnecessary LRU churning.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19faec05
    • Minchan Kim's avatar
      mm: change isolate mode from #define to bitwise type · a15a3971
      Minchan Kim authored
      commit 4356f21d upstream.
      
      Stable note: Not tracked in Bugzilla. This patch makes later patches
      	easier to apply but has no other impact.
      
      Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
      macro isn't recommended as it's type-unsafe and making debugging harder as
      symbol cannot be passed throught to the debugger.
      
      Quote from Johannes
      " Hmm, it would probably be cleaner to fully convert the isolation mode
      into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
      tri-state among flags, which is a bit ugly."
      
      This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      a15a3971
    • Minchan Kim's avatar
      mm: compaction: trivial clean up in acct_isolated() · f665a680
      Minchan Kim authored
      commit b9e84ac1 upstream.
      
      Stable note: Not tracked in Bugzilla. This patch makes later patches
      	easier to apply but has no other impact.
      
      acct_isolated of compaction uses page_lru_base_type which returns only
      base type of LRU list so it never returns LRU_ACTIVE_ANON or
      LRU_ACTIVE_FILE.  In addtion, cc->nr_[anon|file] is used in only
      acct_isolated so it doesn't have fields in conpact_control.
      
      This patch removes fields from compact_control and makes clear function of
      acct_issolated which counts the number of anon|file pages isolated.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f665a680
    • Mel Gorman's avatar
      vmscan: abort reclaim/compaction if compaction can proceed · 4682e89d
      Mel Gorman authored
      commit e0c23279 upstream.
      
      Stable note: Not tracked on Bugzilla. THP and compaction was found to
      	aggressively reclaim pages and stall systems under different
      	situations that was addressed piecemeal over time.
      
      If compaction can proceed, shrink_zones() stops doing any work but its
      callers still call shrink_slab() which raises the priority and potentially
      sleeps.  This is unnecessary and wasteful so this patch aborts direct
      reclaim/compaction entirely if compaction can proceed.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4682e89d
    • Rik van Riel's avatar
      vmscan: limit direct reclaim for higher order allocations · 4d472406
      Rik van Riel authored
      commit e0887c19 upstream.
      
      Stable note: Not tracked on Bugzilla. THP and compaction was found to
      	aggressively reclaim pages and stall systems under different
      	situations that was addressed piecemeal over time.  Paragraph
      	3 of this changelog is the motivation for this patch.
      
      When suffering from memory fragmentation due to unfreeable pages, THP page
      faults will repeatedly try to compact memory.  Due to the unfreeable
      pages, compaction fails.
      
      Needless to say, at that point page reclaim also fails to create free
      contiguous 2MB areas.  However, that doesn't stop the current code from
      trying, over and over again, and freeing a minimum of 4MB (2UL <<
      sc->order pages) at every single invocation.
      
      This resulted in my 12GB system having 2-3GB free memory, a corresponding
      amount of used swap and very sluggish response times.
      
      This can be avoided by having the direct reclaim code not reclaim from
      zones that already have plenty of free memory available for compaction.
      
      If compaction still fails due to unmovable memory, doing additional
      reclaim will only hurt the system, not help.
      
      [jweiner@redhat.com: change comment to explain the order check]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d472406
    • Dave Chinner's avatar
      vmscan: reduce wind up shrinker->nr when shrinker can't do work · 7554e344
      Dave Chinner authored
      commit 3567b59a upstream.
      
      Stable note: Not tracked in Bugzilla. This patch reduces excessive
      	reclaim of slab objects reducing the amount of information that
      	has to be brought back in from disk. The third and fourth paragram
      	in the series describes the impact.
      
      When a shrinker returns -1 to shrink_slab() to indicate it cannot do
      any work given the current memory reclaim requirements, it adds the
      entire total_scan count to shrinker->nr. The idea ehind this is that
      whenteh shrinker is next called and can do work, it will do the work
      of the previously aborted shrinker call as well.
      
      However, if a filesystem is doing lots of allocation with GFP_NOFS
      set, then we get many, many more aborts from the shrinkers than we
      do successful calls. The result is that shrinker->nr winds up to
      it's maximum permissible value (twice the current cache size) and
      then when the next shrinker call that can do work is issued, it
      has enough scan count built up to free the entire cache twice over.
      
      This manifests itself in the cache going from full to empty in a
      matter of seconds, even when only a small part of the cache is
      needed to be emptied to free sufficient memory.
      
      Under metadata intensive workloads on ext4 and XFS, I'm seeing the
      VFS caches increase memory consumption up to 75% of memory (no page
      cache pressure) over a period of 30-60s, and then the shrinker
      empties them down to zero in the space of 2-3s. This cycle repeats
      over and over again, with the shrinker completely trashing the inode
      and dentry caches every minute or so the workload continues.
      
      This behaviour was made obvious by the shrink_slab tracepoints added
      earlier in the series, and made worse by the patch that corrected
      the concurrent accounting of shrinker->nr.
      
      To avoid this problem, stop repeated small increments of the total
      scan value from winding shrinker->nr up to a value that can cause
      the entire cache to be freed. We still need to allow it to wind up,
      so use the delta as the "large scan" threshold check - if the delta
      is more than a quarter of the entire cache size, then it is a large
      scan and allowed to cause lots of windup because we are clearly
      needing to free lots of memory.
      
      If it isn't a large scan then limit the total scan to half the size
      of the cache so that windup never increases to consume the whole
      cache. Reducing the total scan limit further does not allow enough
      wind-up to maintain the current levels of performance, whilst a
      higher threshold does not prevent the windup from freeing the entire
      cache under sustained workloads.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7554e344
    • Dave Chinner's avatar
      vmscan: shrinker->nr updates race and go wrong · 6a5091a0
      Dave Chinner authored
      commit acf92b48 upstream.
      
      Stable note: Not tracked in Bugzilla. This patch reduces excessive
      	reclaim of slab objects reducing the amount of information
      	that has to be brought back in from disk.
      
      shrink_slab() allows shrinkers to be called in parallel so the
      struct shrinker can be updated concurrently. It does not provide any
      exclusio for such updates, so we can get the shrinker->nr value
      increasing or decreasing incorrectly.
      
      As a result, when a shrinker repeatedly returns a value of -1 (e.g.
      a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
      sometimes updating with the scan count that wasn't used, sometimes
      losing it altogether. Worse is when a shrinker does work and that
      update is lost due to racy updates, which means the shrinker will do
      the work again!
      
      Fix this by making the total_scan calculations independent of
      shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
      other updates via cmpxchg loops.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a5091a0
    • Dave Chinner's avatar
      vmscan: add shrink_slab tracepoints · 5e5b3d2e
      Dave Chinner authored
      commit 09576073 upstream.
      
      Stable note: This patch makes later patches easier to apply but otherwise
              has little to justify it. It is a diagnostic patch that was part
              of a series addressing excessive slab shrinking after GFP_NOFS
              failures. There is detailed information on the series' motivation
              at https://lkml.org/lkml/2011/6/2/42 .
      
      It is impossible to understand what the shrinkers are actually doing
      without instrumenting the code, so add a some tracepoints to allow
      insight to be gained.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      5e5b3d2e
    • Shaohua Li's avatar
      vmscan: clear ZONE_CONGESTED for zone with good watermark · 564ea9dd
      Shaohua Li authored
      commit 439423f6 upstream.
      
      Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing
      	ZONE_CONGESTED after it balances a zone and this patch fixes a bug
      	where that was failing to happen. Without this patch, processes
      	can stall in wait_iff_congested unnecessarily. For users, this can
      	look like an interactivity stall but some workloads would see it
      	as sudden drop in throughput.
      
      ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
      task.  It's possible ZONE_CONGESTED isn't cleared in some cases:
      
       1. the zone is already balanced just entering balance_pgdat() for
          order-0 because concurrent tasks free memory.  In this case, later
          check will skip the zone as it's balanced so the flag isn't cleared.
      
       2. high order balance fallbacks to order-0.  quote from Mel: At the
          end of balance_pgdat(), kswapd uses the following logic;
      
      	If reclaiming at high order {
      		for each zone {
      			if all_unreclaimable
      				skip
      			if watermark is not met
      				order = 0
      				loop again
      
      			/* watermark is met */
      			clear congested
      		}
      	}
      
          i.e. it clears ZONE_CONGESTED if it the zone is balanced.  if not,
          it restarts balancing at order-0.  However, if the higher zones are
          balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
          that only happens after a zone is shrunk.  This can mean that
          wait_iff_congested() stalls unnecessarily.
      
      This patch makes kswapd clear ZONE_CONGESTED during its initial
      highmem->dma scan for zones that are already balanced.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      564ea9dd
    • Johannes Weiner's avatar
      mm: vmscan: fix force-scanning small targets without swap · 33c17eaf
      Johannes Weiner authored
      commit a4d3e9e7 upstream.
      
      Stable note: Not tracked in Bugzilla. This patch augments an earlier commit
              that avoids scanning priority being artificially raised. The older
      	fix was particularly important for small memcgs to avoid calling
      	wait_iff_congested() unnecessarily.
      
      Without swap, anonymous pages are not scanned.  As such, they should not
      count when considering force-scanning a small target if there is no swap.
      
      Otherwise, targets are not force-scanned even when their effective scan
      number is zero and the other conditions--kswapd/memcg--apply.
      
      This fixes 246e87a9 ("memcg: fix get_scan_count() for small
      targets").
      
      [akpm@linux-foundation.org: fix comment]
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33c17eaf
    • Mel Gorman's avatar
      mm: reduce the amount of work done when updating min_free_kbytes · 71a07f4c
      Mel Gorman authored
      commit 938929f1 upstream.
      
      Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
              Large machines with 1TB or more of RAM take a long time to boot
              without this patch and may spew out soft lockup warnings.
      
      When min_free_kbytes is updated, some pageblocks are marked
      MIGRATE_RESERVE.  Ordinarily, this work is unnoticable as it happens early
      in boot but on large machines with 1TB of memory, this has been reported
      to delay boot times, probably due to the NUMA distances involved.
      
      The bulk of the work is due to calling calling pageblock_is_reserved() an
      unnecessary amount of times and accessing far more struct page metadata
      than is necessary.  This patch significantly reduces the amount of work
      done by setup_zone_migrate_reserve() improving boot times on 1TB machines.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71a07f4c
    • Mel Gorman's avatar
      mm: memory hotplug: Check if pages are correctly reserved on a per-section basis · 1126e709
      Mel Gorman authored
      commit 2bbcb878 upstream.
      
      Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=721039 .
              Without the patch, memory hot-add can fail for kernel configurations
              that do not set CONFIG_SPARSEMEM_VMEMMAP.
      
      (Resending as I am not seeing it in -next so maybe it got lost)
      
      mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
      
      It is expected that memory being brought online is PageReserved
      similar to what happens when the page allocator is being brought up.
      Memory is onlined in "memory blocks" which consist of one or more
      sections. Unfortunately, the code that verifies PageReserved is
      currently assuming that the memmap backing all these pages is virtually
      contiguous which is only the case when CONFIG_SPARSEMEM_VMEMMAP is set.
      As a result, memory hot-add is failing on those configurations with
      the message;
      
      kernel: section number XXX page number 256 not reserved, was it already online?
      
      This patch updates the PageReserved check to lookup struct page once
      per section to guarantee the correct struct page is being checked.
      
      [Check pages within sections properly: rientjes@google.com]
      [original patch by: nfont@linux.vnet.ibm.com]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Tested-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      1126e709
    • Dimitri Sivanich's avatar
      mm/vmstat.c: cache align vm_stat · 9116bc4f
      Dimitri Sivanich authored
      commit a1cb2c60 upstream.
      
      Stable note: Not tracked on Bugzilla. This patch is known to make a big
              difference to tmpfs performance on larger machines.
      
      This was found to adversely affect tmpfs I/O performance.
      
      Tests run on a 640 cpu UV system.
      
      With 120 threads doing parallel writes, each to different tmpfs mounts:
      No patch:		~300 MB/sec
      With vm_stat alignment:	~430 MB/sec
      Signed-off-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Acked-by: default avatarChristoph Lameter <cl@gentwo.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      9116bc4f
    • Mikulas Patocka's avatar
      dm raid1: fix crash with mirror recovery and discard · fbb41f55
      Mikulas Patocka authored
      commit 751f188d upstream.
      
      This patch fixes a crash when a discard request is sent during mirror
      recovery.
      
      Firstly, some background.  Generally, the following sequence happens during
      mirror synchronization:
      - function do_recovery is called
      - do_recovery calls dm_rh_recovery_prepare
      - dm_rh_recovery_prepare uses a semaphore to limit the number
        simultaneously recovered regions (by default the semaphore value is 1,
        so only one region at a time is recovered)
      - dm_rh_recovery_prepare calls __rh_recovery_prepare,
        __rh_recovery_prepare asks the log driver for the next region to
        recover. Then, it sets the region state to DM_RH_RECOVERING. If there
        are no pending I/Os on this region, the region is added to
        quiesced_regions list. If there are pending I/Os, the region is not
        added to any list. It is added to the quiesced_regions list later (by
        dm_rh_dec function) when all I/Os finish.
      - when the region is on quiesced_regions list, there are no I/Os in
        flight on this region. The region is popped from the list in
        dm_rh_recovery_start function. Then, a kcopyd job is started in the
        recover function.
      - when the kcopyd job finishes, recovery_complete is called. It calls
        dm_rh_recovery_end. dm_rh_recovery_end adds the region to
        recovered_regions or failed_recovered_regions list (depending on
        whether the copy operation was successful or not).
      
      The above mechanism assumes that if the region is in DM_RH_RECOVERING
      state, no new I/Os are started on this region. When I/O is started,
      dm_rh_inc_pending is called, which increases reg->pending count. When
      I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
      If the count is zero and the region was in DM_RH_RECOVERING state,
      dm_rh_dec adds it to the quiesced_regions list.
      
      Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
      in DM_RH_RECOVERING state, it could be added to quiesced_regions list
      multiple times or it could be added to this list when kcopyd is copying
      data (it is assumed that the region is not on any list while kcopyd does
      its jobs). This results in memory corruption and crash.
      
      There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
      do not belong to any region, so they are always added to the sync list
      in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
      requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
      requests. These bypasses avoid the crash possibility described above.
      
      These bypasses were improperly implemented for REQ_DISCARD when
      the mirror target gained discard support in commit
      5fc2ffea (dm raid1: support discard).
      
      In do_writes, REQ_DISCARD requests is always added to the sync queue and
      immediately dispatched (even if the region is in DM_RH_RECOVERING).  However,
      dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts.  So it violates the
      rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
      corruption described above.
      
      This patch changes it so that REQ_DISCARD requests follow the same path
      as REQ_FLUSH. This avoids the crash.
      
      Reference: https://bugzilla.redhat.com/837607Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbb41f55
    • Artem Bityutskiy's avatar
      UBIFS: fix a bug in empty space fix-up · cd050f56
      Artem Bityutskiy authored
      commit c6727932 upstream.
      
      UBIFS has a feature called "empty space fix-up" which is a quirk to work-around
      limitations of dumb flasher programs. Namely, of those flashers that are unable
      to skip NAND pages full of 0xFFs while flashing, resulting in empty space at
      the end of half-filled eraseblocks to be unusable for UBIFS. This feature is
      relatively new (introduced in v3.0).
      
      The fix-up routine (fixup_free_space()) is executed only once at the very first
      mount if the superblock has the 'space_fixup' flag set (can be done with -F
      option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and
      writes it back to the same LEB. The routine assumes the image is pristine and
      does not have anything in the journal.
      
      There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly.
      All but one LEB of the log of a pristine file-system are empty. And one
      contains just a commit start node. And 'fixup_free_space()' just unmapped this
      LEB, which resulted in wiping the commit start node. As a result, some users
      were unable to mount the file-system next time with the following symptom:
      
      UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node
      UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0
      
      The root-cause of this bug was that 'fixup_free_space()' wrongly assumed
      that the beginning of empty space in the log head (c->lhead_offs) was known
      on mount. However, it is not the case - it was always 0. UBIFS does not store
      in it the master node and finds out by scanning the log on every mount.
      
      The fix is simple - just pass commit start node size instead of 0 to
      'fixup_leb()'.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@linux.intel.com>
      Reported-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Tested-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Reported-by: default avatarJames Nute <newten82@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd050f56
    • David Daney's avatar
      MIPS: Properly align the .data..init_task section. · 689415c1
      David Daney authored
      commit 7b1c0d26 upstream.
      
      Improper alignment can lead to unbootable systems and/or random
      crashes.
      
      [ralf@linux-mips.org: This is a lond standing bug since
      6eb10bc9 (kernel.org) rsp.
      c422a10917f75fd19fa7fe070aaaa23e384dae6f (lmo) [MIPS: Clean up linker script
      using new linker script macros.] so dates back to 2.6.32.]
      Signed-off-by: default avatarDavid Daney <david.daney@cavium.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/3881/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      689415c1
    • Aaditya Kumar's avatar
      mm: fix lost kswapd wakeup in kswapd_stop() · 6d40de83
      Aaditya Kumar authored
      commit 1c7e7f6c upstream.
      
      Offlining memory may block forever, waiting for kswapd() to wake up
      because kswapd() does not check the event kthread->should_stop before
      sleeping.
      
      The proper pattern, from Documentation/memory-barriers.txt, is:
      
         ---  waker  ---
         event_indicated = 1;
         wake_up_process(event_daemon);
      
         ---  sleeper  ---
         for (;;) {
            set_current_state(TASK_UNINTERRUPTIBLE);
            if (event_indicated)
               break;
            schedule();
         }
      
         set_current_state() may be wrapped by:
            prepare_to_wait();
      
      In the kswapd() case, event_indicated is kthread->should_stop.
      
        === offlining memory (waker) ===
         kswapd_stop()
            kthread_stop()
               kthread->should_stop = 1
               wake_up_process()
               wait_for_completion()
      
        ===  kswapd_try_to_sleep (sleeper) ===
         kswapd_try_to_sleep()
            prepare_to_wait()
                 .
                 .
            schedule()
                 .
                 .
            finish_wait()
      
      The schedule() needs to be protected by a test of kthread->should_stop,
      which is wrapped by kthread_should_stop().
      
      Reproducer:
         Do heavy file I/O in background.
         Do a memory offline/online in a tight loop
      Signed-off-by: default avatarAaditya Kumar <aaditya.kumar@ap.sony.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d40de83
    • John Stultz's avatar
      ntp: Fix STA_INS/DEL clearing bug · dccecc64
      John Stultz authored
      commit 6b1859db upstream.
      
      In commit 6b43ae8a, I
      introduced a bug that kept the STA_INS or STA_DEL bit
      from being cleared from time_status via adjtimex()
      without forcing STA_PLL first.
      
      Usually once the STA_INS is set, it isn't cleared
      until the leap second is applied, so its unlikely this
      affected anyone. However during testing I noticed it
      took some effort to cancel a leap second once STA_INS
      was set.
      Signed-off-by: default avatarJohn Stultz <johnstul@us.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dccecc64
    • Jeff Layton's avatar
      cifs: always update the inode cache with the results from a FIND_* · adccea44
      Jeff Layton authored
      commit cd60042c upstream.
      
      When we get back a FIND_FIRST/NEXT result, we have some info about the
      dentry that we use to instantiate a new inode. We were ignoring and
      discarding that info when we had an existing dentry in the cache.
      
      Fix this by updating the inode in place when we find an existing dentry
      and the uniqueid is the same.
      Reported-and-Tested-by: default avatarAndrew Bartlett <abartlet@samba.org>
      Reported-by: default avatarBill Robertson <bill_robertson@debortoli.com.au>
      Reported-by: default avatarDion Edwards <dion_edwards@debortoli.com.au>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      adccea44
  2. 19 Jul, 2012 3 commits