1. 03 May, 2017 40 commits
    • Mel Gorman's avatar
      mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx · e716f2eb
      Mel Gorman authored
      kswapd is woken to reclaim a node based on a failed allocation request
      from any eligible zone.  Once reclaiming in balance_pgdat(), it will
      continue reclaiming until there is an eligible zone available for the
      zone it was woken for.  kswapd tracks what zone it was recently woken
      for in pgdat->kswapd_classzone_idx.  If it has not been woken recently,
      this zone will be 0.
      
      However, the decision on whether to sleep is made on
      kswapd_classzone_idx which is 0 without a recent wakeup request and that
      classzone does not account for lowmem reserves.  This allows kswapd to
      sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
      request even if a stream of allocations cannot use that zone.  While
      kswapd may be woken again shortly in the near future there are two
      consequences -- the pgdat bits that control congestion are cleared
      prematurely and direct reclaim is more likely as kswapd slept
      prematurely.
      
      This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
      invalid index) when there has been no recent wakeups.  If there are no
      wakeups, it'll decide whether to sleep based on the highest possible
      zone available (MAX_NR_ZONES - 1).  It then becomes critical that the
      "pgdat balanced" decisions during reclaim and when deciding to sleep are
      the same.  If there is a mismatch, kswapd can stay awake continually
      trying to balance tiny zones.
      
      simoop was used to evaluate it again.  Two of the preparation patches
      regressed the workload so they are included as the second set of
      results.  Otherwise this patch looks artifically excellent
      
                                               4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                                  vanilla              clear-v2          keepawake-v2
      Amean    p50-Read             21670074.18 (  0.00%) 19786774.76 (  8.69%) 22668332.52 ( -4.61%)
      Amean    p95-Read             25456267.64 (  0.00%) 24101956.27 (  5.32%) 26738688.00 ( -5.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 27691872.71 (  5.71%) 30991404.52 ( -5.52%)
      Amean    p50-Write                1390.30 (  0.00%)     1011.91 ( 27.22%)      924.91 ( 33.47%)
      Amean    p95-Write              412901.57 (  0.00%)    34874.98 ( 91.55%)     1362.62 ( 99.67%)
      Amean    p99-Write             6668722.09 (  0.00%)   575449.60 ( 91.37%)    16854.04 ( 99.75%)
      Amean    p50-Allocation          78714.31 (  0.00%)    84246.26 ( -7.03%)    74729.74 (  5.06%)
      Amean    p95-Allocation         175533.51 (  0.00%)   400058.43 (-127.91%)   101609.74 ( 42.11%)
      Amean    p99-Allocation         247003.02 (  0.00%) 10905600.00 (-4315.17%)   125765.57 ( 49.08%)
      
      With this patch on top, write and allocation latencies are massively
      improved.  The read latencies are slightly impaired but it's worth
      noting that this is mostly due to the IO scheduler and not directly
      related to reclaim.  The vmstats are a bit of a mix but the relevant
      ones are as follows;
      
                                  4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
                                mmots-20170209 clear-v1r25keepawake-v1r25
      Swap Ins                             0           0           0
      Swap Outs                            0         608           0
      Direct pages scanned           6910672     3132699     6357298
      Kswapd pages scanned          57036946    82488665    56986286
      Kswapd pages reclaimed        55993488    63474329    55939113
      Direct pages reclaimed         6905990     2964843     6352115
      Kswapd efficiency                  98%         76%         98%
      Kswapd velocity              12494.375   17597.507   12488.065
      Direct efficiency                  99%         94%         99%
      Direct velocity               1513.835     668.306    1393.148
      Page writes by reclaim           0.000 4410243.000       0.000
      Page writes file                     0     4409635           0
      Page writes anon                     0         608           0
      Page reclaim immediate         1036792    14175203     1042571
      
                                  4.11.0-rc1  4.11.0-rc1  4.11.0-rc1
                                     vanilla  clear-v2  keepawake-v2
      Swap Ins                             0          12           0
      Swap Outs                            0         838           0
      Direct pages scanned           6579706     3237270     6256811
      Kswapd pages scanned          61853702    79961486    54837791
      Kswapd pages reclaimed        60768764    60755788    53849586
      Direct pages reclaimed         6579055     2987453     6256151
      Kswapd efficiency                  98%         75%         98%
      Page writes by reclaim           0.000 4389496.000       0.000
      Page writes file                     0     4388658           0
      Page writes anon                     0         838           0
      Page reclaim immediate         1073573    14473009      982507
      
      Swap-outs are equivalent to baseline.
      
      Direct reclaim is reduced but not eliminated.  It's worth noting that
      there are two periods of direct reclaim for this workload.  The first is
      when it switches from preparing the files for the actual test itself.
      It's a lot of file IO followed by a lot of allocs that reclaims heavily
      for a brief window.  While direct reclaim is lower with clear-v2, it is
      due to kswapd scanning aggressively and trying to reclaim the world
      which is not the right thing to do.  With the patches applied, there is
      still direct reclaim but the phase change from "creating work files" to
      starting multiple threads that allocate a lot of anonymous memory faster
      than kswapd can reclaim.
      
      Scanning/reclaim efficiency is restored by this patch.
      
      Page writes from reclaim context are back at 0 which is ideal.
      
      Pages immediately reclaimed after IO completes is slightly improved but
      it is expected this will vary slightly.
      
      On UMA, there is almost no change so this is not expected to be a
      universal win.
      
      [mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
        Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
      Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e716f2eb
    • Mel Gorman's avatar
      mm, vmscan: only clear pgdat congested/dirty/writeback state when balanced · 631b6e08
      Mel Gorman authored
      A pgdat tracks if recent reclaim encountered too many dirty, writeback
      or congested pages.  The flags control whether kswapd writes pages back
      from reclaim context, tags pages for immediate reclaim when IO
      completes, whether processes block on wait_iff_congested and whether
      kswapd blocks when too many pages marked for immediate reclaim are
      encountered.
      
      The state is cleared in a check function with side-effects.  With the
      patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the
      timing of when the bits get cleared changed.  Due to the way the check
      works, it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA
      allocation because it does not account for lowmem reserves properly.
      
      For the simoop workload, kswapd is not stalling when it should due to
      the premature clearing, writing pages from reclaim context like crazy
      and generally being unhelpful.
      
      This patch resets the pgdat bits related to page reclaim only when
      kswapd is going to sleep.  The comparison with simoop is then
      
                                               4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                                  vanilla           fixcheck-v2              clear-v2
      Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%) 19786774.76 (  8.69%)
      Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%) 24101956.27 (  5.32%)
      Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%) 27691872.71 (  5.71%)
      Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)     1011.91 ( 27.22%)
      Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)    34874.98 ( 91.55%)
      Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)   575449.60 ( 91.37%)
      Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)    84246.26 ( -7.03%)
      Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)   400058.43 (-127.91%)
      Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%) 10905600.00 (-4315.17%)
      
      Read latency is improved, write latency is mostly improved but
      allocation latency is regressed.  kswapd is still reclaiming
      inefficiently, pages are being written back from writeback context and a
      host of other issues.  However, given the change, it needed to be
      spelled out why the side-effect was moved.
      
      Link: http://lkml.kernel.org/r/20170309075657.25121-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      631b6e08
    • Shantanu Goel's avatar
      mm, vmscan: fix zone balance check in prepare_kswapd_sleep · 333b0a45
      Shantanu Goel authored
      Patch series "Reduce amount of time kswapd sleeps prematurely", v2.
      
      The series is unusual in that the first patch fixes one problem and
      introduces other issues that are noted in the changelog.  Patch 2 makes
      a minor modification that is worth considering on its own but leaves the
      kernel in a state where it behaves badly.  It's not until patch 3 that
      there is an improvement against baseline.
      
      This was mostly motivated by examining Chris Mason's "simoop" benchmark
      which puts the VM under similar pressure to HADOOP.  It has been
      reported that the benchmark has regressed severely during the last
      number of releases.  While I cannot reproduce all the same problems
      Chris experienced due to hardware limitations, there was a number of
      problems on a 2-socket machine with a single disk.
      
      simoop latencies
                                               4.11.0-rc1            4.11.0-rc1
                                                  vanilla          keepawake-v2
      Amean    p50-Read             21670074.18 (  0.00%) 22668332.52 ( -4.61%)
      Amean    p95-Read             25456267.64 (  0.00%) 26738688.00 ( -5.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 30991404.52 ( -5.52%)
      Amean    p50-Write                1390.30 (  0.00%)      924.91 ( 33.47%)
      Amean    p95-Write              412901.57 (  0.00%)     1362.62 ( 99.67%)
      Amean    p99-Write             6668722.09 (  0.00%)    16854.04 ( 99.75%)
      Amean    p50-Allocation          78714.31 (  0.00%)    74729.74 (  5.06%)
      Amean    p95-Allocation         175533.51 (  0.00%)   101609.74 ( 42.11%)
      Amean    p99-Allocation         247003.02 (  0.00%)   125765.57 ( 49.08%)
      
      These are latencies.  Read/write are threads reading fixed-size random
      blocks from a simulated database.  The allocation latency is mmaping and
      faulting regions of memory.  The p50, 95 and p99 reports the worst
      latencies for 50% of the samples, 95% and 99% respectively.
      
      For example, the report indicates that while the test was running 99% of
      writes completed 99.75% faster.  It's worth noting that on a UMA machine
      that no difference in performance with simoop was observed so milage
      will vary.
      
      It's noted that there is a slight impact to read latencies but it's
      mostly due to IO scheduler decisions and offset by the large reduction
      in other latencies.
      
      This patch (of 3):
      
      The check in prepare_kswapd_sleep needs to match the one in
      balance_pgdat since the latter will return as soon as any one of the
      zones in the classzone is above the watermark.  This is specially
      important for higher order allocations since balance_pgdat will
      typically reset the order to zero relying on compaction to create the
      higher order pages.  Without this patch, prepare_kswapd_sleep fails to
      wake up kcompactd since the zone balance check fails.
      
      It was first reported against 4.9.7 that kswapd is failing to wake up
      kcompactd due to a mismatch in the zone balance check between
      balance_pgdat() and prepare_kswapd_sleep().
      
      balance_pgdat() returns as soon as a single zone satisfies the
      allocation but prepare_kswapd_sleep() requires all zones to do +the
      same.  This causes prepare_kswapd_sleep() to never succeed except in the
      order == 0 case and consequently, wakeup_kcompactd() is never called.
      For the machine that originally motivated this patch, the state of
      compaction from /proc/vmstat looked this way after a day and a half +of
      uptime:
      
      compact_migrate_scanned 240496
      compact_free_scanned 76238632
      compact_isolated 123472
      compact_stall 1791
      compact_fail 29
      compact_success 1762
      compact_daemon_wake 0
      
      After applying the patch and about 10 hours of uptime the state looks
      like this:
      
      compact_migrate_scanned 59927299
      compact_free_scanned 2021075136
      compact_isolated 640926
      compact_stall 4
      compact_fail 2
      compact_success 2
      compact_daemon_wake 5160
      
      Further notes from Mel that motivated him to pick this patch up and
      resend it;
      
      It was observed for the simoop workload (pressures the VM similar to
      HADOOP) that kswapd was failing to keep ahead of direct reclaim.  The
      investigation noted that there was a need to rationalise kswapd
      decisions to reclaim with kswapd decisions to sleep.  With this patch on
      a 2-socket box, there was a 49% reduction in direct reclaim scanning.
      
      However, the impact otherwise is extremely negative.  Kswapd reclaim
      efficiency dropped from 98% to 76%.  simoop has three latency-related
      metrics for read, write and allocation (an anonymous mmap and fault).
      
                                               4.11.0-rc1            4.11.0-rc1
                                                  vanilla           fixcheck-v2
      Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%)
      Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%)
      Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)
      Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)
      Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)
      Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)
      Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)
      Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%)
      
      Of greater concern is that the patch causes swapping and page writes
      from kswapd context rose from 0 pages to 4189753 pages during the hour
      the workload ran for.  By and large, the patch has very bad behaviour
      but easily missed as the impact on a UMA machine is negligible.
      
      This patch is included with the data in case a bisection leads to this
      area.  This patch is also a pre-requisite for the rest of the series.
      
      Link: http://lkml.kernel.org/r/20170309075657.25121-2-mgorman@techsingularity.netSigned-off-by: default avatarShantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      333b0a45
    • Minchan Kim's avatar
      mm: do not use double negation for testing page flags · 2948be5a
      Minchan Kim authored
      With the discussion[1], I found it seems there are every PageFlags
      functions return bool at this moment so we don't need double negation
      any more.  Although it's not a problem to keep it, it makes future users
      confused to use double negation for them, too.
      
      Remove such possibility.
      
      [1] https://marc.info/?l=linux-kernel&m=148881578820434
      
      Frankly sepaking, I like every PageFlags to return bool instead of int.
      It will make it clear.  AFAIR, Chen Gang had tried it but don't know why
      it was not merged at that time.
      
      http://lkml.kernel.org/r/1469336184-1904-1-git-send-email-chengang@emindsoft.com.cn
      
      Link: http://lkml.kernel.org/r/1488868597-32222-1-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2948be5a
    • Kees Cook's avatar
      mm: remove rodata_test_data export, add pr_fmt · 056b9d8a
      Kees Cook authored
      Since commit 3ad38ceb ("x86/mm: Remove CONFIG_DEBUG_NX_TEST"),
      nothing is using the exported rodata_test_data variable, so drop the
      export.
      
      This additionally updates the pr_fmt to avoid redundant strings and
      adjusts some whitespace.
      
      Link: http://lkml.kernel.org/r/20170307005313.GA85809@beastSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      056b9d8a
    • Matthew Wilcox's avatar
      mm: tighten up the fault path a little · 9ab2594f
      Matthew Wilcox authored
      The round_up() macro generates a couple of unnecessary instructions
      in this usage:
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 83 e8 01             sub    $0x1,%rax
          48d5:       48 0d ff 0f 00 00       or     $0xfff,%rax
          48db:       48 83 c0 01             add    $0x1,%rax
          48df:       48 c1 f8 0c             sar    $0xc,%rax
          48e3:       48 39 c3                cmp    %rax,%rbx
          48e6:       72 2e                   jb     4916 <filemap_fault+0x96>
      
      If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
      then GCC can see through it and remove the mask (because that would be
      dead code given the subsequent shift):
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 05 ff 0f 00 00       add    $0xfff,%rax
          48d7:       48 c1 e8 0c             shr    $0xc,%rax
          48db:       48 39 c3                cmp    %rax,%rbx
          48de:       72 2e                   jb     490e <filemap_fault+0x8e>
      
      But that's problematic because we'd evaluate 'y' twice.  Converting
      round_up into an inline function prevents it from being used in other
      definitions.  The easiest thing to do is just change these three usages
      of round_up to use DIV_ROUND_UP.  Also add an unlikely() because GCC's
      heuristic is wrong in this case.
      
      Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ab2594f
    • Michal Hocko's avatar
      jbd2: make the whole kjournald2 kthread NOFS safe · eb52da3f
      Michal Hocko authored
      kjournald2 is central to the transaction commit processing.  As such any
      potential allocation from this kernel thread has to be GFP_NOFS.  Make
      sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170306131408.9828-8-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb52da3f
    • Michal Hocko's avatar
      jbd2: mark the transaction context with the scope GFP_NOFS context · 81378da6
      Michal Hocko authored
      now that we have memalloc_nofs_{save,restore} api we can mark the whole
      transaction context as implicitly GFP_NOFS.  All allocations will
      automatically inherit GFP_NOFS this way.  This means that we do not have
      to mark any of those requests with GFP_NOFS and moreover all the
      ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
      GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.
      
      [akpm@linux-foundation.org: tweak comments]
      Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81378da6
    • Michal Hocko's avatar
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio* · 9ba1fb2c
      Michal Hocko authored
      kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
      API to prevent from reclaim recursion into the fs because vmalloc can
      invoke unconditional GFP_KERNEL allocations and these functions might be
      called from the NOFS contexts.  The memalloc_noio_save will enforce
      GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
      unnecessary.  Let's use memalloc_nofs_{save,restore} instead as it
      should provide exactly what we need here - implicit GFP_NOFS context.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-6-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ba1fb2c
    • Michal Hocko's avatar
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko authored
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • Michal Hocko's avatar
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS · 9070733b
      Michal Hocko authored
      xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
      some time ago.  We would like to make this concept more generic and use
      it for other filesystems as well.  Let's start by giving the flag a more
      generic name PF_MEMALLOC_NOFS which is in line with an exiting
      PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
      contexts.  Replace all PF_FSTRANS usage from the xfs code in the first
      step before we introduce a full API for it as xfs uses the flag directly
      anyway.
      
      This patch doesn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9070733b
    • Michal Hocko's avatar
      lockdep: allow to disable reclaim lockup detection · 7e784422
      Michal Hocko authored
      The current implementation of the reclaim lockup detection can lead to
      false positives and those even happen and usually lead to tweak the code
      to silence the lockdep by using GFP_NOFS even though the context can use
      __GFP_FS just fine.
      
      See
      
        http://lkml.kernel.org/r/20160512080321.GA18496@dastard
      
      as an example.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.5.0-rc2+ #4 Tainted: G           O
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
      
        (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]
      
        {RECLAIM_FS-ON-R} state was registered at:
          mark_held_locks+0x79/0xa0
          lockdep_trace_alloc+0xb3/0x100
          kmem_cache_alloc+0x33/0x230
          kmem_zone_alloc+0x81/0x120 [xfs]
          xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
          __xfs_refcount_find_shared+0x75/0x580 [xfs]
          xfs_refcount_find_shared+0x84/0xb0 [xfs]
          xfs_getbmap+0x608/0x8c0 [xfs]
          xfs_vn_fiemap+0xab/0xc0 [xfs]
          do_vfs_ioctl+0x498/0x670
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x12/0x6f
      
               CPU0
               ----
          lock(&xfs_nondir_ilock_class);
          <Interrupt>
            lock(&xfs_nondir_ilock_class);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/543:
      
        stack backtrace:
        CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
        Call Trace:
         lock_acquire+0xd8/0x1e0
         down_write_nested+0x5e/0xc0
         xfs_ilock+0x177/0x200 [xfs]
         xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
         xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
         evict+0xc5/0x190
         dispose_list+0x39/0x60
         prune_icache_sb+0x4b/0x60
         super_cache_scan+0x14f/0x1a0
         shrink_slab.part.63.constprop.79+0x1e9/0x4e0
         shrink_zone+0x15e/0x170
         kswapd+0x4f1/0xa80
         kthread+0xf2/0x110
         ret_from_fork+0x3f/0x70
      
      To quote Dave:
       "Ignoring whether reflink should be doing anything or not, that's a
        "xfs_refcountbt_init_cursor() gets called both outside and inside
        transactions" lockdep false positive case. The problem here is lockdep
        has seen this allocation from within a transaction, hence a GFP_NOFS
        allocation, and now it's seeing it in a GFP_KERNEL context. Also note
        that we have an active reference to this inode.
      
        So, because the reclaim annotations overload the interrupt level
        detections and it's seen the inode ilock been taken in reclaim
        ("interrupt") context, this triggers a reclaim context warning where
        it thinks it is unsafe to do this allocation in GFP_KERNEL context
        holding the inode ilock..."
      
      This sounds like a fundamental problem of the reclaim lock detection.
      It is really impossible to annotate such a special usecase IMHO unless
      the reclaim lockup detection is reworked completely.  Until then it is
      much better to provide a way to add "I know what I am doing flag" and
      mark problematic places.  This would prevent from abusing GFP_NOFS flag
      which has a runtime effect even on configurations which have lockdep
      disabled.
      
      Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
      skip the current allocation request.
      
      While we are at it also make sure that the radix tree doesn't
      accidentaly override tags stored in the upper part of the gfp_mask.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e784422
    • Nikolay Borisov's avatar
      lockdep: teach lockdep about memalloc_noio_save · 6d7225f0
      Nikolay Borisov authored
      Patch series "scope GFP_NOFS api", v5.
      
      This patch (of 7):
      
      Commit 21caf2fc ("mm: teach mm by current context info to not do I/O
      during memory allocation") added the memalloc_noio_(save|restore)
      functions to enable people to modify the MM behavior by disabling I/O
      during memory allocation.
      
      This was further extended in commit 934f3072 ("mm: clear __GFP_FS
      when PF_MEMALLOC_NOIO is set").
      
      memalloc_noio_* functions prevent allocation paths recursing back into
      the filesystem without explicitly changing the flags for every
      allocation site.
      
      However, lockdep hasn't been keeping up with the changes and it entirely
      misses handling the memalloc_noio adjustments.  Instead, it is left to
      the callers of __lockdep_trace_alloc to call the function after they
      have shaven the respective GFP flags which can lead to false positives:
      
        =================================
         [ INFO: inconsistent lock state ]
         4.10.0-nbor #134 Not tainted
         ---------------------------------
         inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
         fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes:
          (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
         {IN-RECLAIM_FS-W} state was registered at:
           __lock_acquire+0x62a/0x17c0
           lock_acquire+0xc5/0x220
           down_write_nested+0x4f/0x90
           xfs_ilock+0x141/0x230
           xfs_reclaim_inode+0x12a/0x320
           xfs_reclaim_inodes_ag+0x2c8/0x4e0
           xfs_reclaim_inodes_nr+0x33/0x40
           xfs_fs_free_cached_objects+0x19/0x20
           super_cache_scan+0x191/0x1a0
           shrink_slab+0x26f/0x5f0
           shrink_node+0xf9/0x2f0
           kswapd+0x356/0x920
           kthread+0x10c/0x140
           ret_from_fork+0x31/0x40
         irq event stamp: 173777
         hardirqs last  enabled at (173777): __local_bh_enable_ip+0x70/0xc0
         hardirqs last disabled at (173775): __local_bh_enable_ip+0x37/0xc0
         softirqs last  enabled at (173776): _xfs_buf_find+0x67a/0xb70
         softirqs last disabled at (173774): _xfs_buf_find+0x5db/0xb70
      
         other info that might help us debug this:
          Possible unsafe locking scenario:
      
                CPU0
                ----
           lock(&xfs_nondir_ilock_class);
           <Interrupt>
             lock(&xfs_nondir_ilock_class);
      
          *** DEADLOCK ***
      
         4 locks held by fsstress/3365:
          #0:  (sb_writers#10){++++++}, at: mnt_want_write+0x24/0x50
          #1:  (&sb->s_type->i_mutex_key#12){++++++}, at: vfs_setxattr+0x6f/0xb0
          #2:  (sb_internal#2){++++++}, at: xfs_trans_alloc+0xfc/0x140
          #3:  (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
      
         stack backtrace:
         CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
         Call Trace:
          kmem_cache_alloc_node_trace+0x3a/0x2c0
          vm_map_ram+0x2a1/0x510
          _xfs_buf_map_pages+0x77/0x140
          xfs_buf_get_map+0x185/0x2a0
          xfs_attr_rmtval_set+0x233/0x430
          xfs_attr_leaf_addname+0x2d2/0x500
          xfs_attr_set+0x214/0x420
          xfs_xattr_set+0x59/0xb0
          __vfs_setxattr+0x76/0xa0
          __vfs_setxattr_noperm+0x5e/0xf0
          vfs_setxattr+0xae/0xb0
          setxattr+0x15e/0x1a0
          path_setxattr+0x8f/0xc0
          SyS_lsetxattr+0x11/0x20
          entry_SYSCALL_64_fastpath+0x23/0xc6
      
      Let's fix this by making lockdep explicitly do the shaving of respective
      GFP flags.
      
      Fixes: 934f3072 ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
      Link: http://lkml.kernel.org/r/20170306131408.9828-2-mhocko@kernel.orgSigned-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d7225f0
    • David Rientjes's avatar
      mm, vmstat: suppress pcp stats for unpopulated zones in zoneinfo · 7dfb8bf3
      David Rientjes authored
      After "mm, vmstat: print non-populated zones in zoneinfo",
      /proc/zoneinfo will show unpopulated zones.
      
      The per-cpu pageset statistics are not relevant for unpopulated zones
      and can be potentially lengthy, so supress them when they are not
      interesting.
      
      Also moves lowmem reserve protection information above pcp stats since
      it is relevant for all zones per vm.lowmem_reserve_ratio.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dfb8bf3
    • David Rientjes's avatar
      mm, vmstat: print non-populated zones in zoneinfo · b2bd8598
      David Rientjes authored
      Initscripts can use the information (protection levels) from
      /proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot.
      
      vm.lowmem_reserve_ratio is an array of ratios for each configured zone
      on the system.  If a zone is not populated on an arch, /proc/zoneinfo
      suppresses its output.
      
      This results in there not being a 1:1 mapping between the set of zones
      emitted by /proc/zoneinfo and the zones configured by
      vm.lowmem_reserve_ratio.
      
      This patch shows statistics for non-populated zones in /proc/zoneinfo.
      The zones exist and hold a spot in the vm.lowmem_reserve_ratio array.
      Without this patch, it is not possible to determine which index in the
      array controls which zone if one or more zones on the system are not
      populated.
      
      Remaining users of walk_zones_in_node() are unchanged.  Files such as
      /proc/pagetypeinfo require certain zone data to be initialized properly
      for display, which is not done for unpopulated zones.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2bd8598
    • Xishi Qiu's avatar
      mm: use is_migrate_isolate_page() to simplify the code · bbf9ce97
      Xishi Qiu authored
      Use is_migrate_isolate_page() to simplify the code, no functional
      changes.
      
      Link: http://lkml.kernel.org/r/58B94FB1.8020802@huawei.comSigned-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbf9ce97
    • Xishi Qiu's avatar
      mm: use is_migrate_highatomic() to simplify the code · a6ffdc07
      Xishi Qiu authored
      Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().
      
      Simplify the code, no functional changes.
      
      [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
      Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.comSigned-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6ffdc07
    • Huang Ying's avatar
      mm, swap: Fix a race in free_swap_and_cache() · 322b8afe
      Huang Ying authored
      Before using cluster lock in free_swap_and_cache(), the
      swap_info_struct->lock will be held during freeing the swap entry and
      acquiring page lock, so the page swap count will not change when testing
      page information later.  But after using cluster lock, the cluster lock
      (or swap_info_struct->lock) will be held only during freeing the swap
      entry.  So before acquiring the page lock, the page swap count may be
      changed in another thread.  If the page swap count is not 0, we should
      not delete the page from the swap cache.  This is fixed via checking
      page swap count again after acquiring the page lock.
      
      I found the race when I review the code, so I didn't trigger the race
      via a test program.  If the race occurs for an anonymous page shared by
      multiple processes via fork, multiple pages will be allocated and
      swapped in from the swap device for the previously shared one page.
      That is, the user-visible runtime effect is more memory will be used and
      the access latency for the page will be higher, that is, the performance
      regression.
      
      Link: http://lkml.kernel.org/r/20170301143905.12846-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      322b8afe
    • Johannes Weiner's avatar
      mm: memcontrol: provide shmem statistics · 9a4caf1e
      Johannes Weiner authored
      Cgroups currently don't report how much shmem they use, which can be
      useful data to have, in particular since shmem is included in the
      cache/file item while being reclaimed like anonymous memory.
      
      Add a counter to track shmem pages during charging and uncharging.
      
      Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarChris Down <cdown@fb.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a4caf1e
    • Shaohua Li's avatar
      proc: show MADV_FREE pages info in smaps · cf8496ea
      Shaohua Li authored
      Show MADV_FREE pages info of each vma in smaps.  The interface is for
      diganose or monitoring purpose, userspace could use it to understand
      what happens in the application.  Since userspace could dirty MADV_FREE
      pages without notice from kernel, this interface is the only place we
      can get accurate accounting info about MADV_FREE pages.
      
      [mhocko@kernel.org: update Documentation/filesystems/proc.txt]
      Link: http://lkml.kernel.org/r/89efde633559de1ec07444f2ef0f4963a97a2ce8.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf8496ea
    • Shaohua Li's avatar
      mm: enable MADV_FREE for swapless system · 93e06c7a
      Shaohua Li authored
      Now MADV_FREE pages can be easily reclaimed even for swapless system.
      We can safely enable MADV_FREE for all systems.
      
      Link: http://lkml.kernel.org/r/155648585589300bfae1d45078e7aebb3d988b87.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e06c7a
    • Minchan Kim's avatar
      mm: fix lazyfree BUG_ON check in try_to_unmap_one() · eb94a878
      Minchan Kim authored
      If a page is swapbacked, it means it should be in swapcache in
      try_to_unmap_one's path.
      
      If a page is !swapbacked, it mean it shouldn't be in swapcache in
      try_to_unmap_one's path.
      
      Check both two cases all at once and if it fails, warn and return
      SWAP_FAIL.  Such bug never mean we should shut down the kernel.
      
      [minchan@kernel.org: do not use VM_WARN_ON_ONCE as if condition[
        Link: http://lkml.kernel.org/r/20170309060226.GB854@bbox
      Link: http://lkml.kernel.org/r/20170307055551.GC29458@bboxSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb94a878
    • Shaohua Li's avatar
      mm: reclaim MADV_FREE pages · 802a3a92
      Shaohua Li authored
      When memory pressure is high, we free MADV_FREE pages.  If the pages are
      not dirty in pte, the pages could be freed immediately.  Otherwise we
      can't reclaim them.  We put the pages back to anonumous LRU list (by
      setting SwapBacked flag) and the pages will be reclaimed in normal
      swapout way.
      
      We use normal page reclaim policy.  Since MADV_FREE pages are put into
      inactive file list, such pages and inactive file pages are reclaimed
      according to their age.  This is expected, because we don't want to
      reclaim too many MADV_FREE pages before used once pages.
      
      Based on Minchan's original patch
      
      [minchan@kernel.org: clean up lazyfree page handling]
        Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
      Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      802a3a92
    • Shaohua Li's avatar
      mm: move MADV_FREE pages into LRU_INACTIVE_FILE list · f7ad2a6c
      Shaohua Li authored
      madv()'s MADV_FREE indicate pages are 'lazyfree'.  They are still
      anonymous pages, but they can be freed without pageout.  To distinguish
      these from normal anonymous pages, we clear their SwapBacked flag.
      
      MADV_FREE pages could be freed without pageout, so they pretty much like
      used once file pages.  For such pages, we'd like to reclaim them once
      there is memory pressure.  Also it might be unfair reclaiming MADV_FREE
      pages always before used once file pages and we definitively want to
      reclaim the pages before other anonymous and file pages.
      
      To speed up MADV_FREE pages reclaim, we put the pages into
      LRU_INACTIVE_FILE list.  The rationale is LRU_INACTIVE_FILE list is tiny
      nowadays and should be full of used once file pages.  Reclaiming
      MADV_FREE pages will not have much interfere of anonymous and active
      file pages.  And the inactive file pages and MADV_FREE pages will be
      reclaimed according to their age, so we don't reclaim too many MADV_FREE
      pages too.  Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
      means we can reclaim the pages without swap support.  This idea is
      suggested by Johannes.
      
      This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
      avoid bisect failure, next patch will do it.
      
      The patch is based on Minchan's original patch.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ad2a6c
    • Shaohua Li's avatar
      mm: don't assume anonymous pages have SwapBacked flag · d44d363f
      Shaohua Li authored
      There are a few places the code assumes anonymous pages should have
      SwapBacked flag set.  MADV_FREE pages are anonymous pages but we are
      going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
      for them.  The assumption doesn't hold any more, so fix them.
      
      Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d44d363f
    • Shaohua Li's avatar
      mm: delete unnecessary TTU_* flags · a128ca71
      Shaohua Li authored
      Patch series "mm: fix some MADV_FREE issues", v5.
      
      We are trying to use MADV_FREE in jemalloc.  Several issues are found.
      Without solving the issues, jemalloc can't use the MADV_FREE feature.
      
       - Doesn't support system without swap enabled. Because if swap is off,
         we can't or can't efficiently age anonymous pages. And since
         MADV_FREE pages are mixed with other anonymous pages, we can't
         reclaim MADV_FREE pages. In current implementation, MADV_FREE will
         fallback to MADV_DONTNEED without swap enabled. But in our
         environment, a lot of machines don't enable swap. This will prevent
         our setup using MADV_FREE.
      
       - Increases memory pressure. page reclaim bias file pages reclaim
         against anonymous pages. This doesn't make sense for MADV_FREE pages,
         because those pages could be freed easily and refilled with very
         slight penality. Even page reclaim doesn't bias file pages, there is
         still an issue, because MADV_FREE pages and other anonymous pages are
         mixed together. To reclaim a MADV_FREE page, we probably must scan a
         lot of other anonymous pages, which is inefficient. In our test, we
         usually see oom with MADV_FREE enabled and nothing without it.
      
       - Accounting. There are two accounting problems. We don't have a global
         accounting. If the system is abnormal, we don't know if it's a
         problem from MADV_FREE side. The other problem is RSS accounting.
         MADV_FREE pages are accounted as normal anon pages and reclaimed
         lazily, so application's RSS becomes bigger. This confuses our
         workloads. We have monitoring daemon running and if it finds
         applications' RSS becomes abnormal, the daemon will kill the
         applications even kernel can reclaim the memory easily.
      
      To address the first the two issues, we can either put MADV_FREE pages
      into a separate LRU list (Minchan's previous patches and V1 patches), or
      put them into LRU_INACTIVE_FILE list (suggested by Johannes).  The
      patchset use the second idea.  The reason is LRU_INACTIVE_FILE list is
      tiny nowadays and should be full of used once file pages.  So we can
      still efficiently reclaim MADV_FREE pages there without interference
      with other anon and active file pages.  Putting the pages into inactive
      file list also has an advantage which allows page reclaim to prioritize
      MADV_FREE pages and used once file pages.  MADV_FREE pages are put into
      the lru list and clear SwapBacked flag, so PageAnon(page) &&
      !PageSwapBacked(page) will indicate a MADV_FREE pages.  These pages will
      directly freed without pageout if they are clean, otherwise normal swap
      will reclaim them.
      
      For the third issue, the previous post adds global accounting and a
      separate RSS count for MADV_FREE pages.  The problem is we never get
      accurate accounting for MADV_FREE pages.  The pages are mapped to
      userspace, can be dirtied without notice from kernel side.  To get
      accurate accounting, we could write protect the page, but then there is
      extra page fault overhead, which people don't want to pay.  Jemalloc
      guys have concerns about the inaccurate accounting, so this post drops
      the accounting patches temporarily.  The info exported to
      /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
      can get accurate accounting right now.
      
      This patch (of 6):
      
      Johannes pointed out TTU_LZFREE is unnecessary.  It's true because we
      always have the flag set if we want to do an unmap.  For cases we don't
      do an unmap, the TTU_LZFREE part of code should never run.
      
      Also the TTU_UNMAP is unnecessary.  If no other flags set (for example,
      TTU_MIGRATION), an unmap is implied.
      
      The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
      code
      
      Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a128ca71
    • Geliang Tang's avatar
      mm/page-writeback.c: use setup_deferrable_timer · 0a372d09
      Geliang Tang authored
      Use setup_deferrable_timer() instead of init_timer_deferrable() to
      simplify the code.
      
      Link: http://lkml.kernel.org/r/e8e3d4280a34facbc007346f31df833cec28801e.1488070291.git.geliangtang@gmail.comSigned-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a372d09
    • Johannes Weiner's avatar
      mm: remove unnecessary back-off function when retrying page reclaim · 491d79ae
      Johannes Weiner authored
      The backoff mechanism is not needed.  If we have MAX_RECLAIM_RETRIES
      loops without progress, we'll OOM anyway; backing off might cut one or
      two iterations off that in the rare OOM case.  If we have intermittent
      success reclaiming a few pages, the backoff function gets reset also,
      and so is of little help in these scenarios.
      
      We might want a backoff function for when there IS progress, but not
      enough to be satisfactory.  But this isn't that.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-10-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      491d79ae
    • Johannes Weiner's avatar
      Revert "mm, vmscan: account for skipped pages as a partial scan" · 3db65812
      Johannes Weiner authored
      This reverts commit d7f05528.
      
      Now that reclaimability of a node is no longer based on the ratio
      between pages scanned and theoretically reclaimable pages, we can remove
      accounting tricks for pages skipped due to zone constraints.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-9-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3db65812
    • Johannes Weiner's avatar
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner authored
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • Johannes Weiner's avatar
      mm: don't avoid high-priority reclaim on memcg limit reclaim · 688035f7
      Johannes Weiner authored
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      sought to avoid high reclaim priorities for memcg by forcing it to scan
      a minimum amount of pages when lru_pages >> priority yielded nothing.
      This was done at a time when reclaim decisions like dirty throttling
      were tied to the priority level.
      
      Nowadays, the only meaningful thing still tied to priority dropping
      below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
      allowed to write.  But that is from an era where direct reclaim was
      still allowed to call ->writepage, and kswapd nowadays avoids writes
      until it's scanned every clean page in the system.  Potential changes to
      how quick sc->may_writepage could trigger are of little concern.
      
      Remove the force_scan stuff, as well as the ugly multi-pass target
      calculation that it necessitated.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-7-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      688035f7
    • Johannes Weiner's avatar
      mm: don't avoid high-priority reclaim on unreclaimable nodes · a2d7f8e4
      Johannes Weiner authored
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      sought to avoid high reclaim priorities for kswapd by forcing it to scan
      a minimum amount of pages when lru_pages >> priority yielded nothing.
      
      Commit b95a2f2d ("mm: vmscan: convert global reclaim to per-memcg
      LRU lists"), due to switching global reclaim to a round-robin scheme
      over all cgroups, had to restrict this forceful behavior to
      unreclaimable zones in order to prevent massive overreclaim with many
      cgroups.
      
      The latter patch effectively neutered the behavior completely for all
      but extreme memory pressure.  But in those situations we might as well
      drop the reclaimers to lower priority levels.  Remove the check.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-6-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2d7f8e4
    • Johannes Weiner's avatar
      mm: remove unnecessary reclaimability check from NUMA balancing target · 15038d0d
      Johannes Weiner authored
      NUMA balancing already checks the watermarks of the target node to
      decide whether it's a suitable balancing target.  Whether the node is
      reclaimable or not is irrelevant when we don't intend to reclaim.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-5-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15038d0d
    • Johannes Weiner's avatar
      mm: remove seemingly spurious reclaimability check from laptop_mode gating · 047d72c3
      Johannes Weiner authored
      Commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") allowed laptop_mode=1 to start writing not just when the
      priority drops to DEF_PRIORITY - 2 but also when the node is
      unreclaimable.
      
      That appears to be a spurious change in this patch as I doubt the series
      was tested with laptop_mode, and neither is that particular change
      mentioned in the changelog.  Remove it, it's still recent.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-4-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      047d72c3
    • Johannes Weiner's avatar
      mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling · d450abd8
      Johannes Weiner authored
      PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
      all free pages in each zone fall below half the min watermark.  During
      the summation, we want to exclude zones that don't have reclaimables.
      Checking the same pgdat over and over again doesn't make sense.
      
      Fixes: 599d0c95 ("mm, vmscan: move LRU lists to node")
      Link: http://lkml.kernel.org/r/20170228214007.5621-3-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d450abd8
    • Johannes Weiner's avatar
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner authored
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarJia He <hejianet@gmail.com>
      Tested-by: default avatarJia He <hejianet@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
    • Greg Thelen's avatar
      slab: avoid IPIs when creating kmem caches · a87c75fb
      Greg Thelen authored
      Each slab kmem cache has per cpu array caches.  The array caches are
      created when the kmem_cache is created, either via kmem_cache_create()
      or lazily when the first object is allocated in context of a kmem
      enabled memcg.  Array caches are replaced by writing to /proc/slabinfo.
      
      Array caches are protected by holding slab_mutex or disabling
      interrupts.  Array cache allocation and replacement is done by
      __do_tune_cpucache() which holds slab_mutex and calls
      kick_all_cpus_sync() to interrupt all remote processors which confirms
      there are no references to the old array caches.
      
      IPIs are needed when replacing array caches.  But when creating a new
      array cache, there's no need to send IPIs because there cannot be any
      references to the new cache.  Outside of memcg kmem accounting these
      IPIs occur at boot time, so they're not a problem.  But with memcg kmem
      accounting each container can create kmem caches, so the IPIs are
      wasteful.
      
      Avoid unnecessary IPIs when creating array caches.
      
      Test which reports the IPI count of allocating slab in 10000 memcg:
      
      	import os
      
      	def ipi_count():
      		with open("/proc/interrupts") as f:
      			for l in f:
      				if 'Function call interrupts' in l:
      					return int(l.split()[1])
      
      	def echo(val, path):
      		with open(path, "w") as f:
      			f.write(val)
      
      	n = 10000
      	os.chdir("/mnt/cgroup/memory")
      	pid = str(os.getpid())
      	a = ipi_count()
      	for i in range(n):
      		os.mkdir(str(i))
      		echo("1G\n", "%d/memory.limit_in_bytes" % i)
      		echo("1G\n", "%d/memory.kmem.limit_in_bytes" % i)
      		echo(pid, "%d/cgroup.procs" % i)
      		open("/tmp/x", "w").close()
      		os.unlink("/tmp/x")
      	b = ipi_count()
      	print "%d loops: %d => %d (+%d ipis)" % (n, a, b, b-a)
      	echo(pid, "cgroup.procs")
      	for i in range(n):
      		os.rmdir(str(i))
      
      patched:   10000 loops: 1069 => 1170 (+101 ipis)
      unpatched: 10000 loops: 1192 => 48933 (+47741 ipis)
      
      Link: http://lkml.kernel.org/r/20170416214544.109476-1-gthelen@google.comSigned-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a87c75fb
    • Geliang Tang's avatar
      fs/ocfs2/cluster: use offset_in_page() macro · d47736fa
      Geliang Tang authored
      Use offset_in_page() macro instead of open-coding.
      
      Link: http://lkml.kernel.org/r/4dbc77ccaaed98b183cf4dba58a4fa325fd65048.1492758503.git.geliangtang@gmail.comSigned-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d47736fa
    • Junxiao Bi's avatar
      ocfs2: o2hb: revert hb threshold to keep compatible · 33496c3c
      Junxiao Bi authored
      Configfs is the interface for ocfs2-tools to set configure to kernel and
      $configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
      used to configure heartbeat dead threshold.  Kernel has a default value
      of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
      to override it.
      
      Commit 45b99773 ("ocfs2/cluster: use per-attribute show and store
      methods") changed heartbeat dead threshold name while ocfs2-tools did
      not, so ocfs2-tools won't set this configurable and the default value is
      always used.  So revert it.
      
      Fixes: 45b99773 ("ocfs2/cluster: use per-attribute show and store methods")
      Link: http://lkml.kernel.org/r/1490665245-15374-1-git-send-email-junxiao.bi@oracle.comSigned-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Acked-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33496c3c
    • Geliang Tang's avatar