1. 03 May, 2017 40 commits
    • Minchan Kim's avatar
      mm: make ttu's return boolean · 666e5a40
      Minchan Kim authored
      try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
      boolean return.  This patch changes it.
      
      Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      666e5a40
    • Minchan Kim's avatar
      mm: remove SWAP_AGAIN in ttu · 33fc80e2
      Minchan Kim authored
      In 2002, [1] introduced SWAP_AGAIN.  At that time, try_to_unmap_one used
      spin_trylock(&mm->page_table_lock) so it's really easy to contend and
      fail to hold a lock so SWAP_AGAIN to keep LRU status makes sense.
      
      However, now we changed it to mutex-based lock and be able to block
      without skip pte so there is few of small window to return SWAP_AGAIN so
      remove SWAP_AGAIN and just return SWAP_FAIL.
      
      [1] c48c43e, minimal rmap
      
      Link: http://lkml.kernel.org/r/1489555493-14659-7-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33fc80e2
    • Minchan Kim's avatar
      mm: remove SWAP_MLOCK in ttu · ad6b6704
      Minchan Kim authored
      ttu doesn't need to return SWAP_MLOCK.  Instead, just return SWAP_FAIL
      because it means the page is not-swappable so it should move to another
      LRU list(active or unevictable).  putback friends will move it to right
      list depending on the page's LRU flag.
      
      Link: http://lkml.kernel.org/r/1489555493-14659-6-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad6b6704
    • Minchan Kim's avatar
      mm: make try_to_munlock() return void · 192d7232
      Minchan Kim authored
      try_to_munlock returns SWAP_MLOCK if the one of VMAs mapped the page has
      VM_LOCKED flag.  In that time, VM set PG_mlocked to the page if the page
      is not pte-mapped THP which cannot be mlocked, either.
      
      With that, __munlock_isolated_page can use PageMlocked to check whether
      try_to_munlock is successful or not without relying on try_to_munlock's
      retval.  It helps to make try_to_unmap/try_to_unmap_one simple with
      upcoming patches.
      
      [minchan@kernel.org: remove PG_Mlocked VM_BUG_ON check]
        Link: http://lkml.kernel.org/r/20170411025615.GA6545@bbox
      Link: http://lkml.kernel.org/r/1489555493-14659-5-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      192d7232
    • Minchan Kim's avatar
      mm: remove SWAP_MLOCK check for SWAP_SUCCESS in ttu · 22ffb33f
      Minchan Kim authored
      If the page is mapped and rescue in try_to_unmap_one, the
      page_mapcount() of a page cannot be zero, so the page_mapcount check in
      try_to_unmap is enough to return SWAP_SUCCESS.  IOW, SWAP_MLOCK check is
      redundant so remove it.
      
      Link: http://lkml.kernel.org/r/1489555493-14659-4-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22ffb33f
    • Minchan Kim's avatar
      mm: remove SWAP_DIRTY in ttu · 18863d3a
      Minchan Kim authored
      If we found lazyfree page is dirty, try_to_unmap_one can just
      SetPageSwapBakced in there like PG_mlocked page and just return with
      SWAP_FAIL which is very natural because the page is not swappable right
      now so that vmscan can activate it.  There is no point to introduce new
      return value SWAP_DIRTY in try_to_unmap at the moment.
      
      Link: http://lkml.kernel.org/r/1489555493-14659-3-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18863d3a
    • Minchan Kim's avatar
      mm: remove unncessary ret in page_referenced · c24f386c
      Minchan Kim authored
      Nobody uses ret variable. Remove it.
      
      Link: http://lkml.kernel.org/r/1489555493-14659-2-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c24f386c
    • Yisheng Xie's avatar
      mm/vmscan: more restrictive condition for retry in do_try_to_free_pages · d6622f63
      Yisheng Xie authored
      By reviewing code, I find that when enter do_try_to_free_pages, the
      may_thrash is always clear, and it will retry shrink zones to tap
      cgroup's reserves memory by setting may_thrash when the former
      shrink_zones reclaim nothing.
      
      However, when memcg is disabled or on legacy hierarchy, or there do not
      have any memcg protected by low limit, it should not do this useless
      retry at all, for we do not have any cgroup's reserves memory to tap,
      and we have already done hard work but made no progress, which as Michal
      pointed out in former version, we are trying hard to control the retry
      logical of page alloctor, and the current additional round of reclaim is
      just lame.
      
      Therefore, to avoid this unneeded retrying and make code more readable,
      we remove the may_thrash field in scan_control, instead, introduce
      memcg_low_reclaim and memcg_low_skipped, and only retry when
      memcg_low_skipped, by setting memcg_low_reclaim.
      
      [xieyisheng1@huawei.com: remove may_thrash field, introduce mem_cgroup_reclaim]
        Link: http://lkml.kernel.org/r/1490191893-5923-1-git-send-email-ysxie@foxmail.com
      Link: http://lkml.kernel.org/r/1490191893-5923-1-git-send-email-ysxie@foxmail.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Suggested-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6622f63
    • Yisheng Xie's avatar
      mm/compaction: ignore block suitable after check large free page · 1ef36db2
      Yisheng Xie authored
      By reviewing code, I find that if the migrate target is a large free
      page and we ignore suitable, it may splite large target free page into
      smaller block which is not good for defrag.  So move the ignore block
      suitable after check large free page.
      
      As Vlastimil pointed out in RFC version that this patch is just based on
      logical analyses which might be better for future-proofing the function
      and it is most likely won't have any visible effect right now, for
      direct compaction shouldn't have to be called if there's a
      >=pageblock_order page already available.
      
      Link: http://lkml.kernel.org/r/1489490743-5364-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ef36db2
    • Wei Yang's avatar
      mm/sparse: refine usemap_size() a little · 60a7a88d
      Wei Yang authored
      The current implementation calculates usemap_size in two steps:
          * calculate number of bytes to cover these bits
          * calculate number of "unsigned long" to cover these bytes
      
      It would be more clear by:
          * calculate number of "unsigned long" to cover these bits
          * multiple it with sizeof(unsigned long)
      
      This patch refine usemap_size() a little to make it more easy to
      understand.
      
      Link: http://lkml.kernel.org/r/20170310043713.96871-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60a7a88d
    • Johannes Weiner's avatar
      mm: page_alloc: __GFP_NOWARN shouldn't suppress stall warnings · 82251963
      Johannes Weiner authored
      __GFP_NOWARN, which is usually added to avoid warnings from callsites
      that expect to fail and have fallbacks, currently also suppresses
      allocation stall warnings.  These trigger when an allocation is stuck
      inside the allocator for 10 seconds or longer.
      
      But there is no class of allocations that can get legitimately stuck in
      the allocator for this long.  This always indicates a problem.
      
      Always emit stall warnings.  Restrict __GFP_NOWARN to alloc failures.
      
      Link: http://lkml.kernel.org/r/20170125181150.GA16398@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82251963
    • Mel Gorman's avatar
      mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx · e716f2eb
      Mel Gorman authored
      kswapd is woken to reclaim a node based on a failed allocation request
      from any eligible zone.  Once reclaiming in balance_pgdat(), it will
      continue reclaiming until there is an eligible zone available for the
      zone it was woken for.  kswapd tracks what zone it was recently woken
      for in pgdat->kswapd_classzone_idx.  If it has not been woken recently,
      this zone will be 0.
      
      However, the decision on whether to sleep is made on
      kswapd_classzone_idx which is 0 without a recent wakeup request and that
      classzone does not account for lowmem reserves.  This allows kswapd to
      sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
      request even if a stream of allocations cannot use that zone.  While
      kswapd may be woken again shortly in the near future there are two
      consequences -- the pgdat bits that control congestion are cleared
      prematurely and direct reclaim is more likely as kswapd slept
      prematurely.
      
      This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
      invalid index) when there has been no recent wakeups.  If there are no
      wakeups, it'll decide whether to sleep based on the highest possible
      zone available (MAX_NR_ZONES - 1).  It then becomes critical that the
      "pgdat balanced" decisions during reclaim and when deciding to sleep are
      the same.  If there is a mismatch, kswapd can stay awake continually
      trying to balance tiny zones.
      
      simoop was used to evaluate it again.  Two of the preparation patches
      regressed the workload so they are included as the second set of
      results.  Otherwise this patch looks artifically excellent
      
                                               4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                                  vanilla              clear-v2          keepawake-v2
      Amean    p50-Read             21670074.18 (  0.00%) 19786774.76 (  8.69%) 22668332.52 ( -4.61%)
      Amean    p95-Read             25456267.64 (  0.00%) 24101956.27 (  5.32%) 26738688.00 ( -5.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 27691872.71 (  5.71%) 30991404.52 ( -5.52%)
      Amean    p50-Write                1390.30 (  0.00%)     1011.91 ( 27.22%)      924.91 ( 33.47%)
      Amean    p95-Write              412901.57 (  0.00%)    34874.98 ( 91.55%)     1362.62 ( 99.67%)
      Amean    p99-Write             6668722.09 (  0.00%)   575449.60 ( 91.37%)    16854.04 ( 99.75%)
      Amean    p50-Allocation          78714.31 (  0.00%)    84246.26 ( -7.03%)    74729.74 (  5.06%)
      Amean    p95-Allocation         175533.51 (  0.00%)   400058.43 (-127.91%)   101609.74 ( 42.11%)
      Amean    p99-Allocation         247003.02 (  0.00%) 10905600.00 (-4315.17%)   125765.57 ( 49.08%)
      
      With this patch on top, write and allocation latencies are massively
      improved.  The read latencies are slightly impaired but it's worth
      noting that this is mostly due to the IO scheduler and not directly
      related to reclaim.  The vmstats are a bit of a mix but the relevant
      ones are as follows;
      
                                  4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
                                mmots-20170209 clear-v1r25keepawake-v1r25
      Swap Ins                             0           0           0
      Swap Outs                            0         608           0
      Direct pages scanned           6910672     3132699     6357298
      Kswapd pages scanned          57036946    82488665    56986286
      Kswapd pages reclaimed        55993488    63474329    55939113
      Direct pages reclaimed         6905990     2964843     6352115
      Kswapd efficiency                  98%         76%         98%
      Kswapd velocity              12494.375   17597.507   12488.065
      Direct efficiency                  99%         94%         99%
      Direct velocity               1513.835     668.306    1393.148
      Page writes by reclaim           0.000 4410243.000       0.000
      Page writes file                     0     4409635           0
      Page writes anon                     0         608           0
      Page reclaim immediate         1036792    14175203     1042571
      
                                  4.11.0-rc1  4.11.0-rc1  4.11.0-rc1
                                     vanilla  clear-v2  keepawake-v2
      Swap Ins                             0          12           0
      Swap Outs                            0         838           0
      Direct pages scanned           6579706     3237270     6256811
      Kswapd pages scanned          61853702    79961486    54837791
      Kswapd pages reclaimed        60768764    60755788    53849586
      Direct pages reclaimed         6579055     2987453     6256151
      Kswapd efficiency                  98%         75%         98%
      Page writes by reclaim           0.000 4389496.000       0.000
      Page writes file                     0     4388658           0
      Page writes anon                     0         838           0
      Page reclaim immediate         1073573    14473009      982507
      
      Swap-outs are equivalent to baseline.
      
      Direct reclaim is reduced but not eliminated.  It's worth noting that
      there are two periods of direct reclaim for this workload.  The first is
      when it switches from preparing the files for the actual test itself.
      It's a lot of file IO followed by a lot of allocs that reclaims heavily
      for a brief window.  While direct reclaim is lower with clear-v2, it is
      due to kswapd scanning aggressively and trying to reclaim the world
      which is not the right thing to do.  With the patches applied, there is
      still direct reclaim but the phase change from "creating work files" to
      starting multiple threads that allocate a lot of anonymous memory faster
      than kswapd can reclaim.
      
      Scanning/reclaim efficiency is restored by this patch.
      
      Page writes from reclaim context are back at 0 which is ideal.
      
      Pages immediately reclaimed after IO completes is slightly improved but
      it is expected this will vary slightly.
      
      On UMA, there is almost no change so this is not expected to be a
      universal win.
      
      [mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
        Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
      Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e716f2eb
    • Mel Gorman's avatar
      mm, vmscan: only clear pgdat congested/dirty/writeback state when balanced · 631b6e08
      Mel Gorman authored
      A pgdat tracks if recent reclaim encountered too many dirty, writeback
      or congested pages.  The flags control whether kswapd writes pages back
      from reclaim context, tags pages for immediate reclaim when IO
      completes, whether processes block on wait_iff_congested and whether
      kswapd blocks when too many pages marked for immediate reclaim are
      encountered.
      
      The state is cleared in a check function with side-effects.  With the
      patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the
      timing of when the bits get cleared changed.  Due to the way the check
      works, it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA
      allocation because it does not account for lowmem reserves properly.
      
      For the simoop workload, kswapd is not stalling when it should due to
      the premature clearing, writing pages from reclaim context like crazy
      and generally being unhelpful.
      
      This patch resets the pgdat bits related to page reclaim only when
      kswapd is going to sleep.  The comparison with simoop is then
      
                                               4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                                  vanilla           fixcheck-v2              clear-v2
      Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%) 19786774.76 (  8.69%)
      Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%) 24101956.27 (  5.32%)
      Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%) 27691872.71 (  5.71%)
      Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)     1011.91 ( 27.22%)
      Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)    34874.98 ( 91.55%)
      Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)   575449.60 ( 91.37%)
      Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)    84246.26 ( -7.03%)
      Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)   400058.43 (-127.91%)
      Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%) 10905600.00 (-4315.17%)
      
      Read latency is improved, write latency is mostly improved but
      allocation latency is regressed.  kswapd is still reclaiming
      inefficiently, pages are being written back from writeback context and a
      host of other issues.  However, given the change, it needed to be
      spelled out why the side-effect was moved.
      
      Link: http://lkml.kernel.org/r/20170309075657.25121-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      631b6e08
    • Shantanu Goel's avatar
      mm, vmscan: fix zone balance check in prepare_kswapd_sleep · 333b0a45
      Shantanu Goel authored
      Patch series "Reduce amount of time kswapd sleeps prematurely", v2.
      
      The series is unusual in that the first patch fixes one problem and
      introduces other issues that are noted in the changelog.  Patch 2 makes
      a minor modification that is worth considering on its own but leaves the
      kernel in a state where it behaves badly.  It's not until patch 3 that
      there is an improvement against baseline.
      
      This was mostly motivated by examining Chris Mason's "simoop" benchmark
      which puts the VM under similar pressure to HADOOP.  It has been
      reported that the benchmark has regressed severely during the last
      number of releases.  While I cannot reproduce all the same problems
      Chris experienced due to hardware limitations, there was a number of
      problems on a 2-socket machine with a single disk.
      
      simoop latencies
                                               4.11.0-rc1            4.11.0-rc1
                                                  vanilla          keepawake-v2
      Amean    p50-Read             21670074.18 (  0.00%) 22668332.52 ( -4.61%)
      Amean    p95-Read             25456267.64 (  0.00%) 26738688.00 ( -5.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 30991404.52 ( -5.52%)
      Amean    p50-Write                1390.30 (  0.00%)      924.91 ( 33.47%)
      Amean    p95-Write              412901.57 (  0.00%)     1362.62 ( 99.67%)
      Amean    p99-Write             6668722.09 (  0.00%)    16854.04 ( 99.75%)
      Amean    p50-Allocation          78714.31 (  0.00%)    74729.74 (  5.06%)
      Amean    p95-Allocation         175533.51 (  0.00%)   101609.74 ( 42.11%)
      Amean    p99-Allocation         247003.02 (  0.00%)   125765.57 ( 49.08%)
      
      These are latencies.  Read/write are threads reading fixed-size random
      blocks from a simulated database.  The allocation latency is mmaping and
      faulting regions of memory.  The p50, 95 and p99 reports the worst
      latencies for 50% of the samples, 95% and 99% respectively.
      
      For example, the report indicates that while the test was running 99% of
      writes completed 99.75% faster.  It's worth noting that on a UMA machine
      that no difference in performance with simoop was observed so milage
      will vary.
      
      It's noted that there is a slight impact to read latencies but it's
      mostly due to IO scheduler decisions and offset by the large reduction
      in other latencies.
      
      This patch (of 3):
      
      The check in prepare_kswapd_sleep needs to match the one in
      balance_pgdat since the latter will return as soon as any one of the
      zones in the classzone is above the watermark.  This is specially
      important for higher order allocations since balance_pgdat will
      typically reset the order to zero relying on compaction to create the
      higher order pages.  Without this patch, prepare_kswapd_sleep fails to
      wake up kcompactd since the zone balance check fails.
      
      It was first reported against 4.9.7 that kswapd is failing to wake up
      kcompactd due to a mismatch in the zone balance check between
      balance_pgdat() and prepare_kswapd_sleep().
      
      balance_pgdat() returns as soon as a single zone satisfies the
      allocation but prepare_kswapd_sleep() requires all zones to do +the
      same.  This causes prepare_kswapd_sleep() to never succeed except in the
      order == 0 case and consequently, wakeup_kcompactd() is never called.
      For the machine that originally motivated this patch, the state of
      compaction from /proc/vmstat looked this way after a day and a half +of
      uptime:
      
      compact_migrate_scanned 240496
      compact_free_scanned 76238632
      compact_isolated 123472
      compact_stall 1791
      compact_fail 29
      compact_success 1762
      compact_daemon_wake 0
      
      After applying the patch and about 10 hours of uptime the state looks
      like this:
      
      compact_migrate_scanned 59927299
      compact_free_scanned 2021075136
      compact_isolated 640926
      compact_stall 4
      compact_fail 2
      compact_success 2
      compact_daemon_wake 5160
      
      Further notes from Mel that motivated him to pick this patch up and
      resend it;
      
      It was observed for the simoop workload (pressures the VM similar to
      HADOOP) that kswapd was failing to keep ahead of direct reclaim.  The
      investigation noted that there was a need to rationalise kswapd
      decisions to reclaim with kswapd decisions to sleep.  With this patch on
      a 2-socket box, there was a 49% reduction in direct reclaim scanning.
      
      However, the impact otherwise is extremely negative.  Kswapd reclaim
      efficiency dropped from 98% to 76%.  simoop has three latency-related
      metrics for read, write and allocation (an anonymous mmap and fault).
      
                                               4.11.0-rc1            4.11.0-rc1
                                                  vanilla           fixcheck-v2
      Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%)
      Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%)
      Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)
      Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)
      Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)
      Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)
      Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)
      Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%)
      
      Of greater concern is that the patch causes swapping and page writes
      from kswapd context rose from 0 pages to 4189753 pages during the hour
      the workload ran for.  By and large, the patch has very bad behaviour
      but easily missed as the impact on a UMA machine is negligible.
      
      This patch is included with the data in case a bisection leads to this
      area.  This patch is also a pre-requisite for the rest of the series.
      
      Link: http://lkml.kernel.org/r/20170309075657.25121-2-mgorman@techsingularity.netSigned-off-by: default avatarShantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      333b0a45
    • Minchan Kim's avatar
      mm: do not use double negation for testing page flags · 2948be5a
      Minchan Kim authored
      With the discussion[1], I found it seems there are every PageFlags
      functions return bool at this moment so we don't need double negation
      any more.  Although it's not a problem to keep it, it makes future users
      confused to use double negation for them, too.
      
      Remove such possibility.
      
      [1] https://marc.info/?l=linux-kernel&m=148881578820434
      
      Frankly sepaking, I like every PageFlags to return bool instead of int.
      It will make it clear.  AFAIR, Chen Gang had tried it but don't know why
      it was not merged at that time.
      
      http://lkml.kernel.org/r/1469336184-1904-1-git-send-email-chengang@emindsoft.com.cn
      
      Link: http://lkml.kernel.org/r/1488868597-32222-1-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2948be5a
    • Kees Cook's avatar
      mm: remove rodata_test_data export, add pr_fmt · 056b9d8a
      Kees Cook authored
      Since commit 3ad38ceb ("x86/mm: Remove CONFIG_DEBUG_NX_TEST"),
      nothing is using the exported rodata_test_data variable, so drop the
      export.
      
      This additionally updates the pr_fmt to avoid redundant strings and
      adjusts some whitespace.
      
      Link: http://lkml.kernel.org/r/20170307005313.GA85809@beastSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      056b9d8a
    • Matthew Wilcox's avatar
      mm: tighten up the fault path a little · 9ab2594f
      Matthew Wilcox authored
      The round_up() macro generates a couple of unnecessary instructions
      in this usage:
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 83 e8 01             sub    $0x1,%rax
          48d5:       48 0d ff 0f 00 00       or     $0xfff,%rax
          48db:       48 83 c0 01             add    $0x1,%rax
          48df:       48 c1 f8 0c             sar    $0xc,%rax
          48e3:       48 39 c3                cmp    %rax,%rbx
          48e6:       72 2e                   jb     4916 <filemap_fault+0x96>
      
      If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
      then GCC can see through it and remove the mask (because that would be
      dead code given the subsequent shift):
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 05 ff 0f 00 00       add    $0xfff,%rax
          48d7:       48 c1 e8 0c             shr    $0xc,%rax
          48db:       48 39 c3                cmp    %rax,%rbx
          48de:       72 2e                   jb     490e <filemap_fault+0x8e>
      
      But that's problematic because we'd evaluate 'y' twice.  Converting
      round_up into an inline function prevents it from being used in other
      definitions.  The easiest thing to do is just change these three usages
      of round_up to use DIV_ROUND_UP.  Also add an unlikely() because GCC's
      heuristic is wrong in this case.
      
      Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ab2594f
    • Michal Hocko's avatar
      jbd2: make the whole kjournald2 kthread NOFS safe · eb52da3f
      Michal Hocko authored
      kjournald2 is central to the transaction commit processing.  As such any
      potential allocation from this kernel thread has to be GFP_NOFS.  Make
      sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170306131408.9828-8-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb52da3f
    • Michal Hocko's avatar
      jbd2: mark the transaction context with the scope GFP_NOFS context · 81378da6
      Michal Hocko authored
      now that we have memalloc_nofs_{save,restore} api we can mark the whole
      transaction context as implicitly GFP_NOFS.  All allocations will
      automatically inherit GFP_NOFS this way.  This means that we do not have
      to mark any of those requests with GFP_NOFS and moreover all the
      ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
      GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.
      
      [akpm@linux-foundation.org: tweak comments]
      Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81378da6
    • Michal Hocko's avatar
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio* · 9ba1fb2c
      Michal Hocko authored
      kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
      API to prevent from reclaim recursion into the fs because vmalloc can
      invoke unconditional GFP_KERNEL allocations and these functions might be
      called from the NOFS contexts.  The memalloc_noio_save will enforce
      GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
      unnecessary.  Let's use memalloc_nofs_{save,restore} instead as it
      should provide exactly what we need here - implicit GFP_NOFS context.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-6-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ba1fb2c
    • Michal Hocko's avatar
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko authored
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • Michal Hocko's avatar
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS · 9070733b
      Michal Hocko authored
      xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
      some time ago.  We would like to make this concept more generic and use
      it for other filesystems as well.  Let's start by giving the flag a more
      generic name PF_MEMALLOC_NOFS which is in line with an exiting
      PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
      contexts.  Replace all PF_FSTRANS usage from the xfs code in the first
      step before we introduce a full API for it as xfs uses the flag directly
      anyway.
      
      This patch doesn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9070733b
    • Michal Hocko's avatar
      lockdep: allow to disable reclaim lockup detection · 7e784422
      Michal Hocko authored
      The current implementation of the reclaim lockup detection can lead to
      false positives and those even happen and usually lead to tweak the code
      to silence the lockdep by using GFP_NOFS even though the context can use
      __GFP_FS just fine.
      
      See
      
        http://lkml.kernel.org/r/20160512080321.GA18496@dastard
      
      as an example.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.5.0-rc2+ #4 Tainted: G           O
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
      
        (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]
      
        {RECLAIM_FS-ON-R} state was registered at:
          mark_held_locks+0x79/0xa0
          lockdep_trace_alloc+0xb3/0x100
          kmem_cache_alloc+0x33/0x230
          kmem_zone_alloc+0x81/0x120 [xfs]
          xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
          __xfs_refcount_find_shared+0x75/0x580 [xfs]
          xfs_refcount_find_shared+0x84/0xb0 [xfs]
          xfs_getbmap+0x608/0x8c0 [xfs]
          xfs_vn_fiemap+0xab/0xc0 [xfs]
          do_vfs_ioctl+0x498/0x670
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x12/0x6f
      
               CPU0
               ----
          lock(&xfs_nondir_ilock_class);
          <Interrupt>
            lock(&xfs_nondir_ilock_class);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/543:
      
        stack backtrace:
        CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
        Call Trace:
         lock_acquire+0xd8/0x1e0
         down_write_nested+0x5e/0xc0
         xfs_ilock+0x177/0x200 [xfs]
         xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
         xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
         evict+0xc5/0x190
         dispose_list+0x39/0x60
         prune_icache_sb+0x4b/0x60
         super_cache_scan+0x14f/0x1a0
         shrink_slab.part.63.constprop.79+0x1e9/0x4e0
         shrink_zone+0x15e/0x170
         kswapd+0x4f1/0xa80
         kthread+0xf2/0x110
         ret_from_fork+0x3f/0x70
      
      To quote Dave:
       "Ignoring whether reflink should be doing anything or not, that's a
        "xfs_refcountbt_init_cursor() gets called both outside and inside
        transactions" lockdep false positive case. The problem here is lockdep
        has seen this allocation from within a transaction, hence a GFP_NOFS
        allocation, and now it's seeing it in a GFP_KERNEL context. Also note
        that we have an active reference to this inode.
      
        So, because the reclaim annotations overload the interrupt level
        detections and it's seen the inode ilock been taken in reclaim
        ("interrupt") context, this triggers a reclaim context warning where
        it thinks it is unsafe to do this allocation in GFP_KERNEL context
        holding the inode ilock..."
      
      This sounds like a fundamental problem of the reclaim lock detection.
      It is really impossible to annotate such a special usecase IMHO unless
      the reclaim lockup detection is reworked completely.  Until then it is
      much better to provide a way to add "I know what I am doing flag" and
      mark problematic places.  This would prevent from abusing GFP_NOFS flag
      which has a runtime effect even on configurations which have lockdep
      disabled.
      
      Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
      skip the current allocation request.
      
      While we are at it also make sure that the radix tree doesn't
      accidentaly override tags stored in the upper part of the gfp_mask.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e784422
    • Nikolay Borisov's avatar
      lockdep: teach lockdep about memalloc_noio_save · 6d7225f0
      Nikolay Borisov authored
      Patch series "scope GFP_NOFS api", v5.
      
      This patch (of 7):
      
      Commit 21caf2fc ("mm: teach mm by current context info to not do I/O
      during memory allocation") added the memalloc_noio_(save|restore)
      functions to enable people to modify the MM behavior by disabling I/O
      during memory allocation.
      
      This was further extended in commit 934f3072 ("mm: clear __GFP_FS
      when PF_MEMALLOC_NOIO is set").
      
      memalloc_noio_* functions prevent allocation paths recursing back into
      the filesystem without explicitly changing the flags for every
      allocation site.
      
      However, lockdep hasn't been keeping up with the changes and it entirely
      misses handling the memalloc_noio adjustments.  Instead, it is left to
      the callers of __lockdep_trace_alloc to call the function after they
      have shaven the respective GFP flags which can lead to false positives:
      
        =================================
         [ INFO: inconsistent lock state ]
         4.10.0-nbor #134 Not tainted
         ---------------------------------
         inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
         fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes:
          (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
         {IN-RECLAIM_FS-W} state was registered at:
           __lock_acquire+0x62a/0x17c0
           lock_acquire+0xc5/0x220
           down_write_nested+0x4f/0x90
           xfs_ilock+0x141/0x230
           xfs_reclaim_inode+0x12a/0x320
           xfs_reclaim_inodes_ag+0x2c8/0x4e0
           xfs_reclaim_inodes_nr+0x33/0x40
           xfs_fs_free_cached_objects+0x19/0x20
           super_cache_scan+0x191/0x1a0
           shrink_slab+0x26f/0x5f0
           shrink_node+0xf9/0x2f0
           kswapd+0x356/0x920
           kthread+0x10c/0x140
           ret_from_fork+0x31/0x40
         irq event stamp: 173777
         hardirqs last  enabled at (173777): __local_bh_enable_ip+0x70/0xc0
         hardirqs last disabled at (173775): __local_bh_enable_ip+0x37/0xc0
         softirqs last  enabled at (173776): _xfs_buf_find+0x67a/0xb70
         softirqs last disabled at (173774): _xfs_buf_find+0x5db/0xb70
      
         other info that might help us debug this:
          Possible unsafe locking scenario:
      
                CPU0
                ----
           lock(&xfs_nondir_ilock_class);
           <Interrupt>
             lock(&xfs_nondir_ilock_class);
      
          *** DEADLOCK ***
      
         4 locks held by fsstress/3365:
          #0:  (sb_writers#10){++++++}, at: mnt_want_write+0x24/0x50
          #1:  (&sb->s_type->i_mutex_key#12){++++++}, at: vfs_setxattr+0x6f/0xb0
          #2:  (sb_internal#2){++++++}, at: xfs_trans_alloc+0xfc/0x140
          #3:  (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
      
         stack backtrace:
         CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
         Call Trace:
          kmem_cache_alloc_node_trace+0x3a/0x2c0
          vm_map_ram+0x2a1/0x510
          _xfs_buf_map_pages+0x77/0x140
          xfs_buf_get_map+0x185/0x2a0
          xfs_attr_rmtval_set+0x233/0x430
          xfs_attr_leaf_addname+0x2d2/0x500
          xfs_attr_set+0x214/0x420
          xfs_xattr_set+0x59/0xb0
          __vfs_setxattr+0x76/0xa0
          __vfs_setxattr_noperm+0x5e/0xf0
          vfs_setxattr+0xae/0xb0
          setxattr+0x15e/0x1a0
          path_setxattr+0x8f/0xc0
          SyS_lsetxattr+0x11/0x20
          entry_SYSCALL_64_fastpath+0x23/0xc6
      
      Let's fix this by making lockdep explicitly do the shaving of respective
      GFP flags.
      
      Fixes: 934f3072 ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
      Link: http://lkml.kernel.org/r/20170306131408.9828-2-mhocko@kernel.orgSigned-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d7225f0
    • David Rientjes's avatar
      mm, vmstat: suppress pcp stats for unpopulated zones in zoneinfo · 7dfb8bf3
      David Rientjes authored
      After "mm, vmstat: print non-populated zones in zoneinfo",
      /proc/zoneinfo will show unpopulated zones.
      
      The per-cpu pageset statistics are not relevant for unpopulated zones
      and can be potentially lengthy, so supress them when they are not
      interesting.
      
      Also moves lowmem reserve protection information above pcp stats since
      it is relevant for all zones per vm.lowmem_reserve_ratio.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dfb8bf3
    • David Rientjes's avatar
      mm, vmstat: print non-populated zones in zoneinfo · b2bd8598
      David Rientjes authored
      Initscripts can use the information (protection levels) from
      /proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot.
      
      vm.lowmem_reserve_ratio is an array of ratios for each configured zone
      on the system.  If a zone is not populated on an arch, /proc/zoneinfo
      suppresses its output.
      
      This results in there not being a 1:1 mapping between the set of zones
      emitted by /proc/zoneinfo and the zones configured by
      vm.lowmem_reserve_ratio.
      
      This patch shows statistics for non-populated zones in /proc/zoneinfo.
      The zones exist and hold a spot in the vm.lowmem_reserve_ratio array.
      Without this patch, it is not possible to determine which index in the
      array controls which zone if one or more zones on the system are not
      populated.
      
      Remaining users of walk_zones_in_node() are unchanged.  Files such as
      /proc/pagetypeinfo require certain zone data to be initialized properly
      for display, which is not done for unpopulated zones.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2bd8598
    • Xishi Qiu's avatar
      mm: use is_migrate_isolate_page() to simplify the code · bbf9ce97
      Xishi Qiu authored
      Use is_migrate_isolate_page() to simplify the code, no functional
      changes.
      
      Link: http://lkml.kernel.org/r/58B94FB1.8020802@huawei.comSigned-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbf9ce97
    • Xishi Qiu's avatar
      mm: use is_migrate_highatomic() to simplify the code · a6ffdc07
      Xishi Qiu authored
      Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().
      
      Simplify the code, no functional changes.
      
      [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
      Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.comSigned-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6ffdc07
    • Huang Ying's avatar
      mm, swap: Fix a race in free_swap_and_cache() · 322b8afe
      Huang Ying authored
      Before using cluster lock in free_swap_and_cache(), the
      swap_info_struct->lock will be held during freeing the swap entry and
      acquiring page lock, so the page swap count will not change when testing
      page information later.  But after using cluster lock, the cluster lock
      (or swap_info_struct->lock) will be held only during freeing the swap
      entry.  So before acquiring the page lock, the page swap count may be
      changed in another thread.  If the page swap count is not 0, we should
      not delete the page from the swap cache.  This is fixed via checking
      page swap count again after acquiring the page lock.
      
      I found the race when I review the code, so I didn't trigger the race
      via a test program.  If the race occurs for an anonymous page shared by
      multiple processes via fork, multiple pages will be allocated and
      swapped in from the swap device for the previously shared one page.
      That is, the user-visible runtime effect is more memory will be used and
      the access latency for the page will be higher, that is, the performance
      regression.
      
      Link: http://lkml.kernel.org/r/20170301143905.12846-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      322b8afe
    • Johannes Weiner's avatar
      mm: memcontrol: provide shmem statistics · 9a4caf1e
      Johannes Weiner authored
      Cgroups currently don't report how much shmem they use, which can be
      useful data to have, in particular since shmem is included in the
      cache/file item while being reclaimed like anonymous memory.
      
      Add a counter to track shmem pages during charging and uncharging.
      
      Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarChris Down <cdown@fb.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a4caf1e
    • Shaohua Li's avatar
      proc: show MADV_FREE pages info in smaps · cf8496ea
      Shaohua Li authored
      Show MADV_FREE pages info of each vma in smaps.  The interface is for
      diganose or monitoring purpose, userspace could use it to understand
      what happens in the application.  Since userspace could dirty MADV_FREE
      pages without notice from kernel, this interface is the only place we
      can get accurate accounting info about MADV_FREE pages.
      
      [mhocko@kernel.org: update Documentation/filesystems/proc.txt]
      Link: http://lkml.kernel.org/r/89efde633559de1ec07444f2ef0f4963a97a2ce8.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf8496ea
    • Shaohua Li's avatar
      mm: enable MADV_FREE for swapless system · 93e06c7a
      Shaohua Li authored
      Now MADV_FREE pages can be easily reclaimed even for swapless system.
      We can safely enable MADV_FREE for all systems.
      
      Link: http://lkml.kernel.org/r/155648585589300bfae1d45078e7aebb3d988b87.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e06c7a
    • Minchan Kim's avatar
      mm: fix lazyfree BUG_ON check in try_to_unmap_one() · eb94a878
      Minchan Kim authored
      If a page is swapbacked, it means it should be in swapcache in
      try_to_unmap_one's path.
      
      If a page is !swapbacked, it mean it shouldn't be in swapcache in
      try_to_unmap_one's path.
      
      Check both two cases all at once and if it fails, warn and return
      SWAP_FAIL.  Such bug never mean we should shut down the kernel.
      
      [minchan@kernel.org: do not use VM_WARN_ON_ONCE as if condition[
        Link: http://lkml.kernel.org/r/20170309060226.GB854@bbox
      Link: http://lkml.kernel.org/r/20170307055551.GC29458@bboxSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb94a878
    • Shaohua Li's avatar
      mm: reclaim MADV_FREE pages · 802a3a92
      Shaohua Li authored
      When memory pressure is high, we free MADV_FREE pages.  If the pages are
      not dirty in pte, the pages could be freed immediately.  Otherwise we
      can't reclaim them.  We put the pages back to anonumous LRU list (by
      setting SwapBacked flag) and the pages will be reclaimed in normal
      swapout way.
      
      We use normal page reclaim policy.  Since MADV_FREE pages are put into
      inactive file list, such pages and inactive file pages are reclaimed
      according to their age.  This is expected, because we don't want to
      reclaim too many MADV_FREE pages before used once pages.
      
      Based on Minchan's original patch
      
      [minchan@kernel.org: clean up lazyfree page handling]
        Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
      Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      802a3a92
    • Shaohua Li's avatar
      mm: move MADV_FREE pages into LRU_INACTIVE_FILE list · f7ad2a6c
      Shaohua Li authored
      madv()'s MADV_FREE indicate pages are 'lazyfree'.  They are still
      anonymous pages, but they can be freed without pageout.  To distinguish
      these from normal anonymous pages, we clear their SwapBacked flag.
      
      MADV_FREE pages could be freed without pageout, so they pretty much like
      used once file pages.  For such pages, we'd like to reclaim them once
      there is memory pressure.  Also it might be unfair reclaiming MADV_FREE
      pages always before used once file pages and we definitively want to
      reclaim the pages before other anonymous and file pages.
      
      To speed up MADV_FREE pages reclaim, we put the pages into
      LRU_INACTIVE_FILE list.  The rationale is LRU_INACTIVE_FILE list is tiny
      nowadays and should be full of used once file pages.  Reclaiming
      MADV_FREE pages will not have much interfere of anonymous and active
      file pages.  And the inactive file pages and MADV_FREE pages will be
      reclaimed according to their age, so we don't reclaim too many MADV_FREE
      pages too.  Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
      means we can reclaim the pages without swap support.  This idea is
      suggested by Johannes.
      
      This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
      avoid bisect failure, next patch will do it.
      
      The patch is based on Minchan's original patch.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ad2a6c
    • Shaohua Li's avatar
      mm: don't assume anonymous pages have SwapBacked flag · d44d363f
      Shaohua Li authored
      There are a few places the code assumes anonymous pages should have
      SwapBacked flag set.  MADV_FREE pages are anonymous pages but we are
      going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
      for them.  The assumption doesn't hold any more, so fix them.
      
      Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d44d363f
    • Shaohua Li's avatar
      mm: delete unnecessary TTU_* flags · a128ca71
      Shaohua Li authored
      Patch series "mm: fix some MADV_FREE issues", v5.
      
      We are trying to use MADV_FREE in jemalloc.  Several issues are found.
      Without solving the issues, jemalloc can't use the MADV_FREE feature.
      
       - Doesn't support system without swap enabled. Because if swap is off,
         we can't or can't efficiently age anonymous pages. And since
         MADV_FREE pages are mixed with other anonymous pages, we can't
         reclaim MADV_FREE pages. In current implementation, MADV_FREE will
         fallback to MADV_DONTNEED without swap enabled. But in our
         environment, a lot of machines don't enable swap. This will prevent
         our setup using MADV_FREE.
      
       - Increases memory pressure. page reclaim bias file pages reclaim
         against anonymous pages. This doesn't make sense for MADV_FREE pages,
         because those pages could be freed easily and refilled with very
         slight penality. Even page reclaim doesn't bias file pages, there is
         still an issue, because MADV_FREE pages and other anonymous pages are
         mixed together. To reclaim a MADV_FREE page, we probably must scan a
         lot of other anonymous pages, which is inefficient. In our test, we
         usually see oom with MADV_FREE enabled and nothing without it.
      
       - Accounting. There are two accounting problems. We don't have a global
         accounting. If the system is abnormal, we don't know if it's a
         problem from MADV_FREE side. The other problem is RSS accounting.
         MADV_FREE pages are accounted as normal anon pages and reclaimed
         lazily, so application's RSS becomes bigger. This confuses our
         workloads. We have monitoring daemon running and if it finds
         applications' RSS becomes abnormal, the daemon will kill the
         applications even kernel can reclaim the memory easily.
      
      To address the first the two issues, we can either put MADV_FREE pages
      into a separate LRU list (Minchan's previous patches and V1 patches), or
      put them into LRU_INACTIVE_FILE list (suggested by Johannes).  The
      patchset use the second idea.  The reason is LRU_INACTIVE_FILE list is
      tiny nowadays and should be full of used once file pages.  So we can
      still efficiently reclaim MADV_FREE pages there without interference
      with other anon and active file pages.  Putting the pages into inactive
      file list also has an advantage which allows page reclaim to prioritize
      MADV_FREE pages and used once file pages.  MADV_FREE pages are put into
      the lru list and clear SwapBacked flag, so PageAnon(page) &&
      !PageSwapBacked(page) will indicate a MADV_FREE pages.  These pages will
      directly freed without pageout if they are clean, otherwise normal swap
      will reclaim them.
      
      For the third issue, the previous post adds global accounting and a
      separate RSS count for MADV_FREE pages.  The problem is we never get
      accurate accounting for MADV_FREE pages.  The pages are mapped to
      userspace, can be dirtied without notice from kernel side.  To get
      accurate accounting, we could write protect the page, but then there is
      extra page fault overhead, which people don't want to pay.  Jemalloc
      guys have concerns about the inaccurate accounting, so this post drops
      the accounting patches temporarily.  The info exported to
      /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
      can get accurate accounting right now.
      
      This patch (of 6):
      
      Johannes pointed out TTU_LZFREE is unnecessary.  It's true because we
      always have the flag set if we want to do an unmap.  For cases we don't
      do an unmap, the TTU_LZFREE part of code should never run.
      
      Also the TTU_UNMAP is unnecessary.  If no other flags set (for example,
      TTU_MIGRATION), an unmap is implied.
      
      The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
      code
      
      Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a128ca71
    • Geliang Tang's avatar
      mm/page-writeback.c: use setup_deferrable_timer · 0a372d09
      Geliang Tang authored
      Use setup_deferrable_timer() instead of init_timer_deferrable() to
      simplify the code.
      
      Link: http://lkml.kernel.org/r/e8e3d4280a34facbc007346f31df833cec28801e.1488070291.git.geliangtang@gmail.comSigned-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a372d09
    • Johannes Weiner's avatar
      mm: remove unnecessary back-off function when retrying page reclaim · 491d79ae
      Johannes Weiner authored
      The backoff mechanism is not needed.  If we have MAX_RECLAIM_RETRIES
      loops without progress, we'll OOM anyway; backing off might cut one or
      two iterations off that in the rare OOM case.  If we have intermittent
      success reclaiming a few pages, the backoff function gets reset also,
      and so is of little help in these scenarios.
      
      We might want a backoff function for when there IS progress, but not
      enough to be satisfactory.  But this isn't that.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-10-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      491d79ae
    • Johannes Weiner's avatar
      Revert "mm, vmscan: account for skipped pages as a partial scan" · 3db65812
      Johannes Weiner authored
      This reverts commit d7f05528.
      
      Now that reclaimability of a node is no longer based on the ratio
      between pages scanned and theoretically reclaimable pages, we can remove
      accounting tricks for pages skipped due to zone constraints.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-9-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3db65812