1. 17 Mar, 2016 40 commits
    • Naoya Horiguchi's avatar
      tools/vm/page-types.c: avoid memset() in walk_pfn() when count == 1 · d9b2ddf8
      Naoya Horiguchi authored
      I found that page-types is very slow and my testing shows many timeout
      errors.  Here's an example with a simple program allocating 1000 thps.
      
        $ time ./page-types -p $(pgrep -f test_alloc)
        ...
        real    0m17.201s
        user    0m16.889s
        sys     0m0.312s
      
      Most of time is spent in memset().  Currently memset() clears over whole
      buffer for every walk_pfn() call, which is inefficient when walk_pfn()
      is called from walk_vma(), because in that case walk_pfn() is called for
      each pfn.  So this patch limits the zero initialization only for the
      first element.
      
        $ time ./page-types.patched -p $(pgrep -f test_alloc)
        ...
        real    0m0.182s
        user    0m0.046s
        sys     0m0.135s
      
      Fixes: 954e95584579 ("tools/vm/page-types.c: add memory cgroup dumping and filtering")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Suggested-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9b2ddf8
    • Li Zhang's avatar
      powerpc/mm: enable page parallel initialisation · 7f2bd006
      Li Zhang authored
      Parallel initialisation has been enabled for X86, boot time is improved
      greatly.  On Power8, it is improved greatly for small memory.  Here is
      the result from my test on Power8 platform:
      
      For 4GB of memory, boot time is improved by 59%, from 24.5s to 10s.
      
      For 50GB memory, boot time is improved by 22%, from 56.8s to 43.8s.
      Signed-off-by: default avatarLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f2bd006
    • Li Zhang's avatar
      mm: meminit: initialise more memory for inode/dentry hash tables in early boot · 987b3095
      Li Zhang authored
      Upstream has supported page parallel initialisation for X86 and the boot
      time is improved greately.  Some tests have been done for Power.
      
      Here is the result I have done with different memory size.
      
      * 4GB memory:
          boot time is as the following:
          with patch vs without patch: 10.4s vs 24.5s
          boot time is improved 57%
      * 200GB memory:
          boot time looks the same with and without patches.
          boot time is about 38s
      * 32TB memory:
          boot time looks the same with and without patches
          boot time is about 160s.
          The boot time is much shorter than X86 with 24TB memory.
          From community discussion, it costs about 694s for X86 24T system.
      
      Parallel initialisation improves the performance by deferring memory
      initilisation to kswap with N kthreads, it should improve the performance
      therotically.
      
      In testing on X86, performance is improved greatly with huge memory.  But
      on Power platform, it is improved greatly with less than 100GB memory.
      For huge memory, it is not improved greatly.  But it saves the time with
      several threads at least, as the following information shows(32TB system
      log):
      
      [   22.648169] node 9 initialised, 16607461 pages in 280ms
      [   22.783772] node 3 initialised, 23937243 pages in 410ms
      [   22.858877] node 6 initialised, 29179347 pages in 490ms
      [   22.863252] node 2 initialised, 29179347 pages in 490ms
      [   22.907545] node 0 initialised, 32049614 pages in 540ms
      [   22.920891] node 15 initialised, 32212280 pages in 550ms
      [   22.923236] node 4 initialised, 32306127 pages in 550ms
      [   22.923384] node 12 initialised, 32314319 pages in 550ms
      [   22.924754] node 8 initialised, 32314319 pages in 550ms
      [   22.940780] node 13 initialised, 33353677 pages in 570ms
      [   22.940796] node 11 initialised, 33353677 pages in 570ms
      [   22.941700] node 5 initialised, 33353677 pages in 570ms
      [   22.941721] node 10 initialised, 33353677 pages in 570ms
      [   22.941876] node 7 initialised, 33353677 pages in 570ms
      [   22.944946] node 14 initialised, 33353677 pages in 570ms
      [   22.946063] node 1 initialised, 33345485 pages in 580ms
      
      It saves the time about 550*16 ms at least, although it can be ignore to
      compare the boot time about 160 seconds.  What's more, the boot time is
      much shorter on Power even without patches than x86 for huge memory
      machine.
      
      So this patchset is still necessary to be enabled for Power.
      
      This patch (of 2):
      
      This patch is based on Mel Gorman's old patch in the mailing list,
      https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with
      a completion to wait for all memory initialised in page_alloc_init_late().
      It is to fix the OOM problem on X86 with 24TB memory which allocates
      memory in late initialisation.  But for Power platform with 32TB memory,
      it causes a call trace in vfs_caches_init->inode_init() and inode hash
      table needs more memory.  So this patch allocates 1GB for 0.25TB/node for
      large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627
      
      This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes.
      Currently, it only allocates 2GB*16=32GB for early initialisation.  But
      Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB.
      So the system have no enough memory for it.  The log from dmesg as the
      following:
      
        Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes)
        vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes
        swapper/0: page allocation failure: order:0,mode:0x2080020
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64
        Call Trace:
          .dump_stack+0xb4/0xb664 (unreliable)
          .warn_alloc_failed+0x114/0x160
          .__vmalloc_area_node+0x1a4/0x2b0
          .__vmalloc_node_range+0xe4/0x110
          .__vmalloc_node+0x40/0x50
          .alloc_large_system_hash+0x134/0x2a4
          .inode_init+0xa4/0xf0
          .vfs_caches_init+0x80/0x144
          .start_kernel+0x40c/0x4e0
          start_here_common+0x20/0x4a4
      Signed-off-by: default avatarLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      987b3095
    • Kirill A. Shutemov's avatar
      thp: fix deadlock in split_huge_pmd() · 5f737714
      Kirill A. Shutemov authored
      split_huge_pmd() tries to munlock page with munlock_vma_page().  That
      requires the page to locked.
      
      If the is locked by caller, we would get a deadlock:
      
      	Unable to find swap-space signature
      	INFO: task trinity-c85:1907 blocked for more than 120 seconds.
      	      Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
      	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      	trinity-c85     D ffff88084d997608     0  1907    309 0x00000000
      	Call Trace:
      	  schedule+0x9f/0x1c0
      	  schedule_timeout+0x48e/0x600
      	  io_schedule_timeout+0x1c3/0x390
      	  bit_wait_io+0x29/0xd0
      	  __wait_on_bit_lock+0x94/0x140
      	  __lock_page+0x1d4/0x280
      	  __split_huge_pmd+0x5a8/0x10f0
      	  split_huge_pmd_address+0x1d9/0x230
      	  try_to_unmap_one+0x540/0xc70
      	  rmap_walk_anon+0x284/0x810
      	  rmap_walk_locked+0x11e/0x190
      	  try_to_unmap+0x1b1/0x4b0
      	  split_huge_page_to_list+0x49d/0x18a0
      	  follow_page_mask+0xa36/0xea0
      	  SyS_move_pages+0xaf3/0x1570
      	  entry_SYSCALL_64_fastpath+0x12/0x6b
      	2 locks held by trinity-c85/1907:
      	 #0:  (&mm->mmap_sem){++++++}, at:  SyS_move_pages+0x933/0x1570
      	 #1:  (&anon_vma->rwsem){++++..}, at:  split_huge_page_to_list+0x402/0x18a0
      
      I don't think the deadlock is triggerable without split_huge_page()
      simplifilcation patchset.
      
      But munlock_vma_page() here is wrong: we want to munlock the page
      unconditionally, no need in rmap lookup, that munlock_vma_page() does.
      
      Let's use clear_page_mlock() instead.  It can be called under ptl.
      
      Fixes: e90309c9 ("thp: allow mlocked THP again")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f737714
    • Kirill A. Shutemov's avatar
      thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers · fec89c10
      Kirill A. Shutemov authored
      freeze_page() and unfreeze_page() helpers evolved in rather complex
      beasts.  It would be nice to cut complexity of this code.
      
      This patch rewrites freeze_page() using standard try_to_unmap().
      unfreeze_page() is rewritten with remove_migration_ptes().
      
      The result is much simpler.
      
      But the new variant is somewhat slower for PTE-mapped THPs.  Current
      helpers iterates over VMAs the compound page is mapped to, and then over
      ptes within this VMA.  New helpers iterates over small page, then over
      VMA the small page mapped to, and only then find relevant pte.
      
      We have short cut for PMD-mapped THP: we directly install migration
      entries on PMD split.
      
      I don't think the slowdown is critical, considering how much simpler
      result is and that split_huge_page() is quite rare nowadays.  It only
      happens due memory pressure or migration.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fec89c10
    • Kirill A. Shutemov's avatar
      mm: make remove_migration_ptes() beyond mm/migration.c · e388466d
      Kirill A. Shutemov authored
      Make remove_migration_ptes() available to be used in split_huge_page().
      
      New parameter 'locked' added: as with try_to_umap() we need a way to
      indicate that caller holds rmap lock.
      
      We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
      pages are never mlocked.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e388466d
    • Kirill A. Shutemov's avatar
      rmap: extend try_to_unmap() to be usable by split_huge_page() · 2a52bcbc
      Kirill A. Shutemov authored
      Add support for two ttu_flags:
      
        - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
          unmap page;
      
        - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;
      
      Also, change rwc->done to !page_mapcount() instead of !page_mapped().
      try_to_unmap() works on pte level, so we are really interested in the
      mappedness of this small page rather than of the compound page it's a
      part of.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a52bcbc
    • Kirill A. Shutemov's avatar
      rmap: introduce rmap_walk_locked() · b9773199
      Kirill A. Shutemov authored
      This patchset rewrites freeze_page() and unfreeze_page() using
      try_to_unmap() and remove_migration_ptes().  Result is much simpler, but
      somewhat slower.
      
      Migration 8GiB worth of PMD-mapped THP:
      
        Baseline	20.21 +/- 0.393
        Patched	20.73 +/- 0.082
        Slowdown	1.03x
      
      It's 3% slower, comparing to 14% in v1.  I don't it should be a stopper.
      
      Splitting of PTE-mapped pages slowed more.  But this is not a common
      case.
      
      Migration 8GiB worth of PMD-mapped THP:
      
        Baseline	20.39 +/- 0.225
        Patched	22.43 +/- 0.496
        Slowdown	1.10x
      
      rmap_walk_locked() is the same as rmap_walk(), but the caller takes care
      of the relevant rmap lock.
      
      This is preparation for switching THP splitting from custom rmap walk in
      freeze_page()/unfreeze_page() to the generic one.
      
      There is no support for KSM pages for now: not clear which lock is
      implied.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9773199
    • Dan Williams's avatar
      mm: ZONE_DEVICE depends on SPARSEMEM_VMEMMAP · 99490f16
      Dan Williams authored
      The primary use case for devm_memremap_pages() is to allocate an memmap
      array from persistent memory.  That capabilty requires vmem_altmap which
      requires SPARSEMEM_VMEMMAP.
      
      Also, without SPARSEMEM_VMEMMAP the addition of ZONE_DEVICE expands
      ZONES_WIDTH and triggers the:
      
      "Unfortunate NUMA and NUMA Balancing config, growing page-frame for
      last_cpupid."
      
      ...warning in mm/memory.c.  SPARSEMEM_VMEMMAP=n && ZONE_DEVICE=y is not
      a configuration we should worry about supporting.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99490f16
    • Jan Kara's avatar
      mm: remove VM_FAULT_MINOR · 0e8fb931
      Jan Kara authored
      The define has a comment from Nick Piggin from 2007:
      
       /* For backwards compat. Remove me quickly. */
      
      I guess 9 years should not be too hurried sense of 'quickly' even for
      kernel measures.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e8fb931
    • Joe Perches's avatar
      mm: percpu: use pr_fmt to prefix output · 870d4b12
      Joe Perches authored
      Use the normal mechanism to make the logging output consistently
      "percpu:" instead of a mix of "PERCPU:" and "percpu:"
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      870d4b12
    • Joe Perches's avatar
      mm: convert printk(KERN_<LEVEL> to pr_<level> · 1170532b
      Joe Perches authored
      Most of the mm subsystem uses pr_<level> so make it consistent.
      
      Miscellanea:
      
       - Realign arguments
       - Add missing newline to format
       - kmemleak-test.c has a "kmemleak: " prefix added to the
         "Kmemleak testing" logging message via pr_fmt
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1170532b
    • Joe Perches's avatar
      mm: coalesce split strings · 756a025f
      Joe Perches authored
      Kernel style prefers a single string over split strings when the string is
      'user-visible'.
      
      Miscellanea:
      
       - Add a missing newline
       - Realign arguments
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      756a025f
    • Joe Perches's avatar
      mm: convert pr_warning to pr_warn · 598d8091
      Joe Perches authored
      There are a mixture of pr_warning and pr_warn uses in mm.  Use pr_warn
      consistently.
      
      Miscellanea:
      
       - Coalesce formats
       - Realign arguments
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      598d8091
    • Dan Williams's avatar
      mm: exclude ZONE_DEVICE from GFP_ZONE_TABLE · b11a7b94
      Dan Williams authored
      ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
      mm zones that are bumping up against the current maximum limit of 4
      zones, i.e.  2 bits in page->flags for the GFP_ZONE_TABLE.
      
      The GFP_ZONE_TABLE poses an interesting constraint since
      include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
      build.  We need to be careful to only build the table for zones that
      have a corresponding gfp_t flag.  GFP_ZONES_SHIFT is introduced for this
      purpose.  This patch does not attempt to solve the problem of adding a
      new zone that also has a corresponding GFP_ flag.
      
      Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
      SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero.  In other words
      even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
      consume another bit in page->flags (expand ZONES_WIDTH) with room to
      spare.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
      Fixes: 033fbae9 ("mm: ZONE_DEVICE for "device memory"")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarMark <markk@clara.co.uk>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b11a7b94
    • Vladimir Davydov's avatar
      mm: memcontrol: cleanup css_reset callback · d334c9bc
      Vladimir Davydov authored
      - Do not take memcg_limit_mutex for resetting limits - the cgroup cannot
        be altered from userspace anymore, so no need to protect them.
      
      - Use plain page_counter_limit() for resetting ->memory and ->memsw
        limits instead of mem_cgrouop_resize_* helpers - we enlarge the limits,
        so no need in special handling.
      
      - Reset ->swap and ->tcpmem limits as well.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d334c9bc
    • Chen Yucong's avatar
      mm, memory hotplug: print debug message in the proper way for online_pages · e33e33b4
      Chen Yucong authored
      online_pages() simply returns an error value if
      memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
      want for successfully onlining target pages.  This patch arms to print
      more failure information like offline_pages() in online_pages.
      
      This patch also converts printk(KERN_<LEVEL>) to pr_<level>(), and moves
      __offline_pages() to not print failure information with KERN_INFO
      according to David Rientjes's suggestion[1].
      
      [1] https://lkml.org/lkml/2016/2/24/1094Signed-off-by: default avatarChen Yucong <slaoub@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e33e33b4
    • Michal Hocko's avatar
      mm: remove __GFP_NOFAIL is deprecated comment · 0f352e53
      Michal Hocko authored
      Commit 64775719 ("mm: clarify __GFP_NOFAIL deprecation status") was
      incomplete and didn't remove the comment about __GFP_NOFAIL being
      deprecated in buffered_rmqueue.
      
      Let's get rid of this leftover but keep the WARN_ON_ONCE for order > 1
      because we should really discourage from using __GFP_NOFAIL with higher
      order allocations because those are just too subtle.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarNikolay Borisov <kernel@kyup.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f352e53
    • Joonsoo Kim's avatar
      mm/page_ref: add tracepoint to track down page reference manipulation · 95813b8f
      Joonsoo Kim authored
      CMA allocation should be guaranteed to succeed by definition, but,
      unfortunately, it would be failed sometimes.  It is hard to track down
      the problem, because it is related to page reference manipulation and we
      don't have any facility to analyze it.
      
      This patch adds tracepoints to track down page reference manipulation.
      With it, we can find exact reason of failure and can fix the problem.
      Following is an example of tracepoint output.  (note: this example is
      stale version that printing flags as the number.  Recent version will
      print it as human readable string.)
      
      <...>-9018  [004]    92.678375: page_ref_set:         pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
      <...>-9018  [004]    92.678378: kernel_stack:
       => get_page_from_freelist (ffffffff81176659)
       => __alloc_pages_nodemask (ffffffff81176d22)
       => alloc_pages_vma (ffffffff811bf675)
       => handle_mm_fault (ffffffff8119e693)
       => __do_page_fault (ffffffff810631ea)
       => trace_do_page_fault (ffffffff81063543)
       => do_async_page_fault (ffffffff8105c40a)
       => async_page_fault (ffffffff817581d8)
      [snip]
      <...>-9018  [004]    92.678379: page_ref_mod:         pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
      [snip]
      ...
      ...
      <...>-9131  [001]    93.174468: test_pages_isolated:  start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
      [snip]
      <...>-9018  [004]    93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
       => release_pages (ffffffff8117c9e4)
       => free_pages_and_swap_cache (ffffffff811b0697)
       => tlb_flush_mmu_free (ffffffff81199616)
       => tlb_finish_mmu (ffffffff8119a62c)
       => exit_mmap (ffffffff811a53f7)
       => mmput (ffffffff81073f47)
       => do_exit (ffffffff810794e9)
       => do_group_exit (ffffffff81079def)
       => SyS_exit_group (ffffffff81079e74)
       => entry_SYSCALL_64_fastpath (ffffffff817560b6)
      
      This output shows that problem comes from exit path.  In exit path, to
      improve performance, pages are not freed immediately.  They are gathered
      and processed by batch.  During this process, migration cannot be
      possible and CMA allocation is failed.  This problem is hard to find
      without this page reference tracepoint facility.
      
      Enabling this feature bloat kernel text 30 KB in my configuration.
      
         text    data     bss     dec     hex filename
      12127327        2243616 1507328 15878271         f2487f vmlinux_disabled
      12157208        2258880 1507328 15923416         f2f8d8 vmlinux_enabled
      
      Note that, due to header file dependency problem between mm.h and
      tracepoint.h, this feature has to open code the static key functions for
      tracepoints.  Proposed by Steven Rostedt in following link.
      
      https://lkml.org/lkml/2015/12/9/699
      
      [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
      [iamjoonsoo.kim@lge.com: fix build failure for xtensa]
      [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95813b8f
    • Joonsoo Kim's avatar
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim authored
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
    • Mel Gorman's avatar
      mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4
      Mel Gorman authored
      THP defrag is enabled by default to direct reclaim/compact but not wake
      kswapd in the event of a THP allocation failure.  The problem is that
      THP allocation requests potentially enter reclaim/compaction.  This
      potentially incurs a severe stall that is not guaranteed to be offset by
      reduced TLB misses.  While there has been considerable effort to reduce
      the impact of reclaim/compaction, it is still a high cost and workloads
      that should fit in memory fail to do so.  Specifically, a simple
      anon/file streaming workload will enter direct reclaim on NUMA at least
      even though the working set size is 80% of RAM.  It's been years and
      it's time to throw in the towel.
      
      First, this patch defines THP defrag as follows;
      
       madvise: A failed allocation will direct reclaim/compact if the application requests it
       never:   Neither reclaim/compact nor wake kswapd
       defer:   A failed allocation will wake kswapd/kcompactd
       always:  A failed allocation will direct reclaim/compact (historical behaviour)
                khugepaged defrag will enter direct/reclaim but not wake kswapd.
      
      Next it sets the default defrag option to be "madvise" to only enter
      direct reclaim/compaction for applications that specifically requested
      it.
      
      Lastly, it removes a check from the page allocator slowpath that is
      related to __GFP_THISNODE to allow "defer" to work.  The callers that
      really cares are slub/slab and they are updated accordingly.  The slab
      one may be surprising because it also corrects a comment as kswapd was
      never woken up by that path.
      
      This means that a THP fault will no longer stall for most applications
      by default and the ideal for most users that get THP if they are
      immediately available.  There are still options for users that prefer a
      stall at startup of a new application by either restoring historical
      behaviour with "always" or pick a half-way point with "defer" where
      kswapd does some of the work in the background and wakes kcompactd if
      necessary.  THP defrag for khugepaged remains enabled and will enter
      direct/reclaim but no wakeup kswapd or kcompactd.
      
      After this patch a THP allocation failure will quickly fallback and rely
      on khugepaged to recover the situation at some time in the future.  In
      some cases, this will reduce THP usage but the benefit of THP is hard to
      measure and not a universal win where as a stall to reclaim/compaction
      is definitely measurable and can be painful.
      
      The first test for this is using "usemem" to read a large file and write
      a large anonymous mapping (to avoid the zero page) multiple times.  The
      total size of the mappings is 80% of RAM and the benchmark simply
      measures how long it takes to complete.  It uses multiple threads to see
      if that is a factor.  On UMA, the performance is almost identical so is
      not reported but on NUMA, we see this
      
      usemem
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
      Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
      Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
      Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
      Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
      Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
      Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
      Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
      Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
      Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
      Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
      Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
      Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
      Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
      
      For a single thread, the benchmark completes 43.23% faster with this
      patch applied with smaller benefits as the thread increases.  Similar,
      notice the large reduction in most cases in system CPU usage.  The
      overall CPU time is
      
                     4.4.0       4.4.0
              kcompactd-v1r1 nodefrag-v1r3
      User        10357.65    10438.33
      System       3988.88     3543.94
      Elapsed      2203.01     1634.41
      
      Which is substantial. Now, the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 128458477   278352931
      Major Faults                   2174976         225
      Swap Ins                      16904701           0
      Swap Outs                     17359627           0
      Allocation stalls                43611           0
      DMA allocs                           0           0
      DMA32 allocs                  19832646    19448017
      Normal allocs                614488453   580941839
      Movable allocs                       0           0
      Direct pages scanned          24163800           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed        20691346           0
      Compaction stalls                42263           0
      Compaction success                 938           0
      Compaction failures              41325           0
      
      This patch eliminates almost all swapping and direct reclaim activity.
      There is still overhead but it's from NUMA balancing which does not
      identify that it's pointless trying to do anything with this workload.
      
      I also tried the thpscale benchmark which forces a corner case where
      compaction can be used heavily and measures the latency of whether base
      or huge pages were used
      
      thpscale Fault Latencies
                                             4.4.0                 4.4.0
                                    kcompactd-v1r1         nodefrag-v1r3
      Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
      Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
      Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
      Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
      Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
      Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
      Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
      Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
      Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
      Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
      Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
      Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
      Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
      Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
      Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
      Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
      Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
      Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
      
      The average time to fault pages is substantially reduced in the majority
      of caseds but with the obvious caveat that fewer THPs are actually used
      in this adverse workload
      
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
      Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
      Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
      Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
      Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
      Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
      Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
      Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
      Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                  37429143    47564000
      Major Faults                      1916        1558
      Swap Ins                          1466        1079
      Swap Outs                      2936863      149626
      Allocation stalls                62510           3
      DMA allocs                           0           0
      DMA32 allocs                   6566458     6401314
      Normal allocs                216361697   216538171
      Movable allocs                       0           0
      Direct pages scanned          25977580       17998
      Kswapd pages scanned                 0     3638931
      Kswapd pages reclaimed               0      207236
      Direct pages reclaimed         8833714          88
      Compaction stalls               103349           5
      Compaction success                 270           4
      Compaction failures             103079           1
      
      Note again that while this does swap as it's an aggressive workload, the
      direct relcim activity and allocation stalls is substantially reduced.
      There is some kswapd activity but ftrace showed that the kswapd activity
      was due to normal wakeups from 4K pages being allocated.
      Compaction-related stalls and activity are almost eliminated.
      
      I also tried the stutter benchmark.  For this, I do not have figures for
      NUMA but it's something that does impact UMA so I'll report what is
      available
      
      stutter
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
      Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
      1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
      2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
      3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
      Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
      Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
      Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
      Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
      Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
      Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
      
      This benchmark is trying to fault an anonymous mapping while there is a
      heavy IO load -- a scenario that desktop users used to complain about
      frequently.  This shows a mix because the ideal case of mapping with THP
      is not hit as often.  However, note that 99% of the mappings complete
      13.79% faster.  The CPU usage here is particularly interesting
      
                     4.4.0       4.4.0
              kcompactd-v1r1nodefrag-v1r3
      User           67.50        0.99
      System       1327.88       91.30
      Elapsed      2079.00     2128.98
      
      And once again we look at the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 335241922  1314582827
      Major Faults                       715         819
      Swap Ins                             0           0
      Swap Outs                            0           0
      Allocation stalls               532723           0
      DMA allocs                           0           0
      DMA32 allocs                1822364341  1177950222
      Normal allocs               1815640808  1517844854
      Movable allocs                       0           0
      Direct pages scanned          21892772           0
      Kswapd pages scanned          20015890    41879484
      Kswapd pages reclaimed        19961986    41822072
      Direct pages reclaimed        21892741           0
      Compaction stalls              1065755           0
      Compaction success                 514           0
      Compaction failures            1065241           0
      
      Allocation stalls and all direct reclaim activity is eliminated as well
      as compaction-related stalls.
      
      THP gives impressive gains in some cases but only if they are quickly
      available.  We're not going to reach the point where they are completely
      free so lets take the costs out of the fast paths finally and defer the
      cost to kswapd, kcompactd and khugepaged where it belongs.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      444eb2a4
    • David Rientjes's avatar
      mm, mempool: only set __GFP_NOMEMALLOC if there are free elements · f9054c70
      David Rientjes authored
      If an oom killed thread calls mempool_alloc(), it is possible that it'll
      loop forever if there are no elements on the freelist since
      __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
      oom conditions.
      
      Only set __GFP_NOMEMALLOC if there are elements on the freelist.  If
      there are no free elements, allow allocations without the bit set so
      that memory reserves can be accessed if needed.
      
      Additionally, using mempool_alloc() with __GFP_NOMEMALLOC is not
      supported since the implementation can loop forever without accessing
      memory reserves when needed.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9054c70
    • Satoru Takeuchi's avatar
      mm: remove unnecessary description about a non-exist gfp flag · b14a1ef5
      Satoru Takeuchi authored
      Since __GFP_NOACCOUNT was removed by commit 20b5c303 ("Revert 'gfp:
      add __GFP_NOACCOUNT'"), its description is not necessary.
      Signed-off-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b14a1ef5
    • Johannes Weiner's avatar
      mm: scale kswapd watermarks in proportion to memory · 795ae7a0
      Johannes Weiner authored
      In machines with 140G of memory and enterprise flash storage, we have
      seen read and write bursts routinely exceed the kswapd watermarks and
      cause thundering herds in direct reclaim.  Unfortunately, the only way
      to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
      system's emergency reserves - which is entirely unrelated to the
      system's latency requirements.  In order to get kswapd to maintain a
      250M buffer of free memory, the emergency reserves need to be set to 1G.
      That is a lot of memory wasted for no good reason.
      
      On the other hand, it's reasonable to assume that allocation bursts and
      overall allocation concurrency scale with memory capacity, so it makes
      sense to make kswapd aggressiveness a function of that as well.
      
      Change the kswapd watermark scale factor from the currently fixed 25% of
      the tunable emergency reserve to a tunable 0.1% of memory.
      
      Beyond 1G of memory, this will produce bigger watermark steps than the
      current formula in default settings.  Ensure that the new formula never
      chooses steps smaller than that, i.e.  25% of the emergency reserve.
      
      On a 140G machine, this raises the default watermark steps - the
      distance between min and low, and low and high - from 16M to 143M.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      795ae7a0
    • Kirill A. Shutemov's avatar
      mm: cleanup *pte_alloc* interfaces · 3ed3a4f0
      Kirill A. Shutemov authored
      There are few things about *pte_alloc*() helpers worth cleaning up:
      
       - 'vma' argument is unused, let's drop it;
      
       - most __pte_alloc() callers do speculative check for pmd_none(),
         before taking ptl: let's introduce pte_alloc() macro which does
         the check.
      
         The only direct user of __pte_alloc left is userfaultfd, which has
         different expectation about atomicity wrt pmd.
      
       - pte_alloc_map() and pte_alloc_map_lock() are redefined using
         pte_alloc().
      
      [sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
      [sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ed3a4f0
    • Igor Redko's avatar
      virtio_balloon: export 'available' memory to balloon statistics · 5057dcd0
      Igor Redko authored
      Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
      statistics protocol, corresponding to 'Available' in /proc/meminfo.
      
      It indicates to the hypervisor how big the balloon can be inflated
      without pushing the guest system to swap.
      Signed-off-by: default avatarIgor Redko <redkoi@virtuozzo.com>
      Signed-off-by: default avatarDenis V. Lunev <den@openvz.org>
      Reviewed-by: default avatarRoman Kagan <rkagan@virtuozzo.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5057dcd0
    • Igor Redko's avatar
      mm/page_alloc.c: calculate 'available' memory in a separate function · d02bd27b
      Igor Redko authored
      Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
      statistics protocol, corresponding to 'Available' in /proc/meminfo.
      
      It indicates to the hypervisor how big the balloon can be inflated
      without pushing the guest system to swap.  This metric would be very
      useful in VM orchestration software to improve memory management of
      different VMs under overcommit.
      
      This patch (of 2):
      
      Factor out calculation of the available memory counter into a separate
      exportable function, in order to be able to use it in other parts of the
      kernel.
      
      In particular, it appears a relevant metric to report to the hypervisor
      via virtio-balloon statistics interface (in a followup patch).
      Signed-off-by: default avatarIgor Redko <redkoi@virtuozzo.com>
      Signed-off-by: default avatarDenis V. Lunev <den@openvz.org>
      Reviewed-by: default avatarRoman Kagan <rkagan@virtuozzo.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d02bd27b
    • Yang Shi's avatar
      mm/Kconfig: remove redundant arch depend for memory hotplug · 7eb50292
      Yang Shi authored
      MEMORY_HOTPLUG already depends on ARCH_ENABLE_MEMORY_HOTPLUG which is
      selected by the supported architectures, so the following arch depend is
      unnecessary.
      Signed-off-by: default avatarYang Shi <yang.shi@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7eb50292
    • Vineet Gupta's avatar
      ARC, thp: remove infrastructure for handling splitting PMDs · 01609ec2
      Vineet Gupta authored
      With THP refcounting work, no need to mark PMDs splitting.
      
      (ARC got missed under the sweeping arch change as THP support was likely
      not present in orig baseline)
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01609ec2
    • Aneesh Kumar K.V's avatar
      mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range · 458aa76d
      Aneesh Kumar K.V authored
      We remove one instace of flush_tlb_range here.  That was added by commit
      f714f4f2 ("mm: numa: call MMU notifiers on THP migration").  But the
      pmdp_huge_clear_flush_notify should have done the require flush for us.
      Hence remove the extra flush.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      458aa76d
    • Kirill A. Shutemov's avatar
      mm, tracing: refresh __def_vmaflag_names · bcf66917
      Kirill A. Shutemov authored
      Get list of VMA flags up-to-date and sort it to match VM_* definition
      order.
      
      [vbabka@suse.cz: add a note above vmaflag definitions to update the names when changing]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bcf66917
    • Andrey Ryabinin's avatar
      mm: deduplicate memory overcommitment code · 39a1aa8e
      Andrey Ryabinin authored
      Currently we have two copies of the same code which implements memory
      overcommitment logic.  Let's move it into mm/util.c and hence avoid
      duplication.  No functional changes here.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39a1aa8e
    • Andrey Ryabinin's avatar
      mm: move max_map_count bits into mm.h · ea606cf5
      Andrey Ryabinin authored
      max_map_count sysctl unrelated to scheduler. Move its bits from
      include/linux/sched/sysctl.h to include/linux/mm.h.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea606cf5
    • Kirill A. Shutemov's avatar
      thp, vmstats: count deferred split events · f9719a03
      Kirill A. Shutemov authored
      Count how many times we put a THP in split queue.  Currently, it happens
      on partial unmap of a THP.
      
      Rapidly growing value can indicate that an application behaves
      unfriendly wrt THP: often fault in huge page and then unmap part of it.
      This leads to unnecessary memory fragmentation and the application may
      require tuning.
      
      The event also can help with debugging kernel [mis-]behaviour.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9719a03
    • Vladimir Davydov's avatar
      mm: workingset: make shadow node shrinker memcg aware · 0a6b76dd
      Vladimir Davydov authored
      Workingset code was recently made memcg aware, but shadow node shrinker
      is still global.  As a result, one small cgroup can consume all memory
      available for shadow nodes, possibly hurting other cgroups by reclaiming
      their shadow nodes, even though reclaim distances stored in its shadow
      nodes have no effect.  To avoid this, we need to make shadow node
      shrinker memcg aware.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a6b76dd
    • Vladimir Davydov's avatar
      mm: workingset: size shadow nodes lru basing on file cache size · cdcbb72e
      Vladimir Davydov authored
      A page is activated on refault if the refault distance stored in the
      corresponding shadow entry is less than the number of active file pages.
      Since active file pages can't occupy more than half memory, we assume
      that the maximal effective refault distance can't be greater than half
      the number of present pages and size the shadow nodes lru list
      appropriately.  Generally speaking, this assumption is correct, but it
      can result in wasting a considerable chunk of memory on stale shadow
      nodes in case the portion of file pages is small, e.g.  if a workload
      mostly uses anonymous memory.
      
      To sort this out, we need to compute the size of shadow nodes lru basing
      not on the maximal possible, but the current size of file cache.  We
      could take the size of active file lru for the maximal refault distance,
      but active lru is pretty unstable - it can shrink dramatically at
      runtime possibly disrupting workingset detection logic.
      
      Instead we assume that the maximal refault distance equals half the
      total number of file cache pages.  This will protect us against active
      file lru size fluctuations while still being correct, because size of
      active lru is normally maintained lower than size of inactive lru.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdcbb72e
    • Vladimir Davydov's avatar
      radix-tree: account radix_tree_node to memory cgroup · 58e698af
      Vladimir Davydov authored
      Allocation of radix_tree_node objects can be easily triggered from
      userspace, so we should account them to memory cgroup.  Besides, we need
      them accounted for making shadow node shrinker per memcg (see
      mm/workingset.c).
      
      A tricky thing about accounting radix_tree_node objects is that they are
      mostly allocated through radix_tree_preload(), so we can't just set
      SLAB_ACCOUNT for radix_tree_node_cachep - that would likely result in a
      lot of unrelated cgroups using objects from each other's caches.
      
      One way to overcome this would be making radix tree preloads per memcg,
      but that would probably look cumbersome and overcomplicated.
      
      Instead, we make radix_tree_node_alloc() first try to allocate from the
      cache with __GFP_ACCOUNT, no matter if the caller has preloaded or not,
      and only if it fails fall back on using per cpu preloads.  This should
      make most allocations accounted.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58e698af
    • Vladimir Davydov's avatar
      mm: memcontrol: zap memcg_kmem_online helper · b6ecd2de
      Vladimir Davydov authored
      As kmem accounting is now either enabled for all cgroups or disabled
      system-wide, there's no point in having memcg_kmem_online() helper -
      instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
      shrink_slab() now does.
      
      There are only two places left where this helper is used -
      __memcg_kmem_charge() and memcg_create_kmem_cache().  The former can
      only be called if memcg_kmem_enabled() returned true.  Since the cgroup
      it operates on is online, mem_cgroup_is_root() check will be enough.
      
      memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
      of memcg_kmem_online(), because it relies on the fact that in
      memcg_offline_kmem() memcg->kmem_state is changed before
      memcg_deactivate_kmem_caches() is called, but there we can just
      open-code the check.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6ecd2de
    • Vladimir Davydov's avatar
      mm: vmscan: pass root_mem_cgroup instead of NULL to memcg aware shrinker · 0fc9f58a
      Vladimir Davydov authored
      It's just convenient to implement a memcg aware shrinker when you know
      that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns
      false.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0fc9f58a
    • Vladimir Davydov's avatar
      mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy · b313aeee
      Vladimir Davydov authored
      Workingset code was recently made memcg aware, but shadow node shrinker
      is still global.  As a result, one small cgroup can consume all memory
      available for shadow nodes, possibly hurting other cgroups by reclaiming
      their shadow nodes, even though reclaim distances stored in its shadow
      nodes have no effect.  To avoid this, we need to make shadow node
      shrinker memcg aware.
      
      The actual work is done in patch 6 of the series.  Patches 1 and 2
      prepare memcg/shrinker infrastructure for the change.  Patch 3 is just a
      collateral cleanup.  Patch 4 makes radix_tree_node accounted, which is
      necessary for making shadow node shrinker memcg aware.  Patch 5 reduces
      shadow nodes overhead in case workload mostly uses anonymous pages.
      
      This patch:
      
      Currently, in the legacy hierarchy kmem accounting is off for all
      cgroups by default and must be enabled explicitly by writing something
      to memory.kmem.limit_in_bytes.  Since we don't support reclaim on
      hitting kmem limit, nor do we have any plans to implement it, this is
      likely to be -1, just to enable kmem accounting and limit kernel memory
      consumption by the memory.limit_in_bytes along with user memory.
      
      This user API was introduced when the implementation of kmem accounting
      lacked slab shrinker support and hence was useless in practice.  Things
      have changed since then - slab shrinkers were made memcg aware, the
      accounting overhead seems to be negligible, and a failure to charge a
      kmem allocation should not have critical consequences, because we only
      account those kernel objects that should be safe to fail.  That's why
      kmem accounting is enabled by default for all cgroups in the default
      hierarchy, which will eventually replace the legacy one.
      
      The ability to enable kmem accounting for some cgroups while keeping it
      disabled for others is getting difficult to maintain.  E.g.  to make
      shadow node shrinker memcg aware (see mm/workingset.c), we need to know
      the relationship between the number of shadow nodes allocated for a
      cgroup and the size of its lru list.  If kmem accounting is enabled for
      all cgroups there is no problem, but what should we do if kmem
      accounting is enabled only for half of cgroups? We've no other choice
      but use global lru stats while scanning root cgroup's shadow nodes, but
      that would be wrong if kmem accounting was enabled for all cgroups
      (which is the case if the unified hierarchy is used), in which case we
      should use lru stats of the root cgroup's lruvec.
      
      That being said, let's enable kmem accounting for all memory cgroups by
      default.  If one finds it unstable or too costly, it can always be
      disabled system-wide by passing cgroup.memory=nokmem to the kernel at
      boot time.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b313aeee