1. 17 Mar, 2016 40 commits
    • Rob Landley's avatar
      include/uapi/linux/elf-em.h: remove v850 · faeb50b9
      Rob Landley authored
      The v850 port was removed by commits f606ddf4 and 07a887d3 in
      2008.  These #defines are not used in the current kernel.
      Signed-off-by: default avatarRob Landley <rob@landley.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      faeb50b9
    • Christoph Lameter's avatar
      fix Christoph's email addresses · 93e205a7
      Christoph Lameter authored
      There are various email addresses for me throughout the kernel.  Use the
      one that will always be valid.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e205a7
    • Steven Rostedt's avatar
      bug: set warn variable before calling WARN() · dfbf2897
      Steven Rostedt authored
      This has hit me a couple of times already.  I would be debugging code
      and the system would simply hang and then reboot.  Finally, I found that
      the problem was caused by WARN_ON_ONCE() and friends.
      
      The macro WARN_ON_ONCE(condition) is defined as:
      
      	static bool __section(.data.unlikely) __warned;
      	int __ret_warn_once = !!(condition);
      
      	if (unlikely(__ret_warn_once))
      		if (WARN_ON(!__warned))
      			__warned = true;
      
      	unlikely(__ret_warn_once);
      
      Which looks great and all.  But what I have hit, is an issue when
      WARN_ON() itself hits the same WARN_ON_ONCE() code.  Because, the
      variable __warned is not yet set.  Then it too calls WARN_ON() and that
      triggers the warning again.  It keeps doing this until the stack is
      overflowed and the system crashes.
      
      By setting __warned first before calling WARN_ON() makes the original
      WARN_ON_ONCE() really only warn once, and not an infinite amount of
      times if the WARN_ON() also triggers the warning.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfbf2897
    • Andrew Morton's avatar
      arch/mn10300/kernel/fpu-nofpu.c: needs asm/elf.h · c60f1692
      Andrew Morton authored
      arch/mn10300/kernel/fpu-nofpu.c:27:36: error: unknown type name 'elf_fpregset_t'
          int dump_fpu(struct pt_regs *regs, elf_fpregset_t *fpreg)
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c60f1692
    • Andrew Morton's avatar
      mn10300, c6x: CONFIG_GENERIC_BUG must depend on CONFIG_BUG · 8b9e6d58
      Andrew Morton authored
      CONFIG_BUG=n && CONFIG_GENERIC_BUG=y make no sense and things break:
      
         In file included from include/linux/page-flags.h:9:0,
                          from kernel/bounds.c:9:
         include/linux/bug.h:91:47: warning: 'struct bug_entry' declared inside parameter list
          static inline int is_warning_bug(const struct bug_entry *bug)
                                                        ^
         include/linux/bug.h:91:47: warning: its scope is only this definition or declaration, which is probably not what you want
         include/linux/bug.h: In function 'is_warning_bug':
      >> include/linux/bug.h:93:12: error: dereferencing pointer to incomplete type
           return bug->flags & BUGFLAG_WARNING;
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b9e6d58
    • Dave Young's avatar
      proc-vmcore: wrong data type casting fix · 0b50a2d8
      Dave Young authored
      On i686 PAE enabled machine the contiguous physical area could be large
      and it can cause trimming down variables in below calculation in
      read_vmcore() and mmap_vmcore():
      
      	tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
      
      That is, the types being used is like below on i686:
      m->offset: unsigned long long int
      m->size:   unsigned long long int
      *fpos:     loff_t (long long int)
      buflen:    size_t (unsigned int)
      
      So casting (m->offset + m->size - *fpos) by size_t means truncating a
      given value by 4GB.
      
      Suppose (m->offset + m->size - *fpos) being truncated to 0, buflen >0
      then we will get tsz = 0.  It is of course not an expected result.
      Similarly we could also get other truncated values less than buflen.
      Then the real size passed down is not correct any more.
      
      If (m->offset + m->size - *fpos) is above 4GB, read_vmcore or
      mmap_vmcore use the min_t result with truncated values being compared to
      buflen.  Then, fpos proceeds with the wrong value so that we reach below
      bugs:
      
      1) read_vmcore will refuse to continue so makedumpfile fails.
      2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().
      
      Use unsigned long long in min_t instead so that the variables in are not
      truncated.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarDave Young <dyoung@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Cc: Minfei Huang <mhuang@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b50a2d8
    • Minfei Huang's avatar
      proc/base: make prompt shell start from new line after executing "cat /proc/$pid/wchan" · 7e2bc81d
      Minfei Huang authored
      It is not elegant that prompt shell does not start from new line after
      executing "cat /proc/$pid/wchan".  Make prompt shell start from new
      line.
      Signed-off-by: default avatarMinfei Huang <mnfhuang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e2bc81d
    • Eric Engestrom's avatar
      procfs: add conditional compilation check · b5946bea
      Eric Engestrom authored
      `proc_timers_operations` is only used when CONFIG_CHECKPOINT_RESTORE is
      enabled.
      Signed-off-by: default avatarEric Engestrom <eric.engestrom@imgtec.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5946bea
    • John Stultz's avatar
      proc: add /proc/<pid>/timerslack_ns interface · 5de23d43
      John Stultz authored
      This patch provides a proc/PID/timerslack_ns interface which exposes a
      task's timerslack value in nanoseconds and allows it to be changed.
      
      This allows power/performance management software to set timer slack for
      other threads according to its policy for the thread (such as when the
      thread is designated foreground vs.  background activity)
      
      If the value written is non-zero, slack is set to that value.  Otherwise
      sets it to the default for the thread.
      
      This interface checks that the calling task has permissions to to use
      PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure
      arbitrary apps do not change the timer slack for other apps.
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5de23d43
    • John Stultz's avatar
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz authored
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      
      $ time sleep 1
      
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      
      Other than that things are fairly straightforward.
      
      This patch (of 2):
      
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      years).
      
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da8b44d5
    • Tetsuo Handa's avatar
      mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled · 0a687aac
      Tetsuo Handa authored
      After the OOM killer is disabled during suspend operation, any
      !__GFP_NOFAIL && __GFP_FS allocations are forced to fail.  Thus, any
      !__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a687aac
    • Tetsuo Handa's avatar
      mm,oom: make oom_killer_disable() killable · 6afcf289
      Tetsuo Handa authored
      While oom_killer_disable() is called by freeze_processes() after all
      user threads except the current thread are frozen, it is possible that
      kernel threads invoke the OOM killer and sends SIGKILL to the current
      thread due to sharing the thawed victim's memory.  Therefore, checking
      for SIGKILL is preferable than TIF_MEMDIE.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6afcf289
    • Sergey Senozhatsky's avatar
      mm/zsmalloc: add `freeable' column to pool stat · 1120ed54
      Sergey Senozhatsky authored
      Add a new column to pool stats, which will tell how many pages ideally
      can be freed by class compaction, so it will be easier to analyze
      zsmalloc fragmentation.
      
      At the moment, we have only numbers of FULL and ALMOST_EMPTY classes,
      but they don't tell us how badly the class is fragmented internally.
      
      The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows:
      
       class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
      [..]
          12   224           0            2           146          5          8                4        4
          13   240           0            0             0          0          0                1        0
          14   256           1           13          1840       1672        115                1       10
          15   272           0            0             0          0          0                1        0
      [..]
          49   816           0            3           745        735        149                1        2
          51   848           3            4           361        306         76                4        8
          52   864          12           14           378        268         81                3       21
          54   896           1           12           117         57         26                2       12
          57   944           0            0             0          0          0                3        0
      [..]
       Total                26          131         12709      10994       1071                       134
      
      For example, from this particular output we can easily conclude that
      class-896 is heavily fragmented -- it occupies 26 pages, 12 can be freed
      by compaction.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1120ed54
    • YiPing Xu's avatar
      zsmalloc: drop unused member 'mapping_area->huge' · a82cbf07
      YiPing Xu authored
      When unmapping a huge class page in zs_unmap_object, the page will be
      unmapped by kmap_atomic.  the "!area->huge" branch in __zs_unmap_object
      is alway true, and no code set "area->huge" now, so we can drop it.
      Signed-off-by: default avatarYiPing Xu <xuyiping@huawei.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a82cbf07
    • Shawn Lin's avatar
      mm/vmalloc: use PAGE_ALIGNED() to check PAGE_SIZE alignment · a1c0b1a0
      Shawn Lin authored
      We have PAGE_ALIGNED() in mm.h, so let's use it instead of IS_ALIGNED()
      for checking PAGE_SIZE aligned case.
      Signed-off-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1c0b1a0
    • Vladimir Davydov's avatar
      mm: memcontrol: zap oom_info_lock · e0775d10
      Vladimir Davydov authored
      mem_cgroup_print_oom_info is always called under oom_lock, so
      oom_info_lock is redundant.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0775d10
    • Johannes Weiner's avatar
      mm: memcontrol: clarify the uncharge_list() loop · 8b592656
      Johannes Weiner authored
      uncharge_list() does an unusual list walk because the function can take
      regular lists with dedicated list_heads as well as singleton lists where
      a single page is passed via the page->lru list node.
      
      This can sometimes lead to confusion as well as suggestions to replace
      the loop with a list_for_each_entry(), which wouldn't work.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b592656
    • Johannes Weiner's avatar
      mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage · b6e6edcf
      Johannes Weiner authored
      Setting the original memory.limit_in_bytes hardlimit is subject to a
      race condition when the desired value is below the current usage.  The
      code tries a few times to first reclaim and then see if the usage has
      dropped to where we would like it to be, but there is no locking, and
      the workload is free to continue making new charges up to the old limit.
      Thus, attempting to shrink a workload relies on pure luck and hope that
      the workload happens to cooperate.
      
      To fix this in the cgroup2 memory.max knob, do it the other way round:
      set the limit first, then try enforcement.  And if reclaim is not able
      to succeed, trigger OOM kills in the group.  Keep going until the new
      limit is met, we run out of OOM victims and there's only unreclaimable
      memory left, or the task writing to memory.max is killed.  This allows
      users to shrink groups reliably, and the behavior is consistent with
      what happens when new charges are attempted in excess of memory.max.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6e6edcf
    • Johannes Weiner's avatar
      mm: memcontrol: reclaim when shrinking memory.high below usage · 588083bb
      Johannes Weiner authored
      When setting memory.high below usage, nothing happens until the next
      charge comes along, and then it will only reclaim its own charge and not
      the now potentially huge excess of the new memory.high.  This can cause
      groups to stay in excess of their memory.high indefinitely.
      
      To fix that, when shrinking memory.high, kick off a reclaim cycle that
      goes after the delta.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      588083bb
    • Naoya Horiguchi's avatar
      tools/vm/page-types.c: avoid memset() in walk_pfn() when count == 1 · d9b2ddf8
      Naoya Horiguchi authored
      I found that page-types is very slow and my testing shows many timeout
      errors.  Here's an example with a simple program allocating 1000 thps.
      
        $ time ./page-types -p $(pgrep -f test_alloc)
        ...
        real    0m17.201s
        user    0m16.889s
        sys     0m0.312s
      
      Most of time is spent in memset().  Currently memset() clears over whole
      buffer for every walk_pfn() call, which is inefficient when walk_pfn()
      is called from walk_vma(), because in that case walk_pfn() is called for
      each pfn.  So this patch limits the zero initialization only for the
      first element.
      
        $ time ./page-types.patched -p $(pgrep -f test_alloc)
        ...
        real    0m0.182s
        user    0m0.046s
        sys     0m0.135s
      
      Fixes: 954e95584579 ("tools/vm/page-types.c: add memory cgroup dumping and filtering")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Suggested-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9b2ddf8
    • Li Zhang's avatar
      powerpc/mm: enable page parallel initialisation · 7f2bd006
      Li Zhang authored
      Parallel initialisation has been enabled for X86, boot time is improved
      greatly.  On Power8, it is improved greatly for small memory.  Here is
      the result from my test on Power8 platform:
      
      For 4GB of memory, boot time is improved by 59%, from 24.5s to 10s.
      
      For 50GB memory, boot time is improved by 22%, from 56.8s to 43.8s.
      Signed-off-by: default avatarLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f2bd006
    • Li Zhang's avatar
      mm: meminit: initialise more memory for inode/dentry hash tables in early boot · 987b3095
      Li Zhang authored
      Upstream has supported page parallel initialisation for X86 and the boot
      time is improved greately.  Some tests have been done for Power.
      
      Here is the result I have done with different memory size.
      
      * 4GB memory:
          boot time is as the following:
          with patch vs without patch: 10.4s vs 24.5s
          boot time is improved 57%
      * 200GB memory:
          boot time looks the same with and without patches.
          boot time is about 38s
      * 32TB memory:
          boot time looks the same with and without patches
          boot time is about 160s.
          The boot time is much shorter than X86 with 24TB memory.
          From community discussion, it costs about 694s for X86 24T system.
      
      Parallel initialisation improves the performance by deferring memory
      initilisation to kswap with N kthreads, it should improve the performance
      therotically.
      
      In testing on X86, performance is improved greatly with huge memory.  But
      on Power platform, it is improved greatly with less than 100GB memory.
      For huge memory, it is not improved greatly.  But it saves the time with
      several threads at least, as the following information shows(32TB system
      log):
      
      [   22.648169] node 9 initialised, 16607461 pages in 280ms
      [   22.783772] node 3 initialised, 23937243 pages in 410ms
      [   22.858877] node 6 initialised, 29179347 pages in 490ms
      [   22.863252] node 2 initialised, 29179347 pages in 490ms
      [   22.907545] node 0 initialised, 32049614 pages in 540ms
      [   22.920891] node 15 initialised, 32212280 pages in 550ms
      [   22.923236] node 4 initialised, 32306127 pages in 550ms
      [   22.923384] node 12 initialised, 32314319 pages in 550ms
      [   22.924754] node 8 initialised, 32314319 pages in 550ms
      [   22.940780] node 13 initialised, 33353677 pages in 570ms
      [   22.940796] node 11 initialised, 33353677 pages in 570ms
      [   22.941700] node 5 initialised, 33353677 pages in 570ms
      [   22.941721] node 10 initialised, 33353677 pages in 570ms
      [   22.941876] node 7 initialised, 33353677 pages in 570ms
      [   22.944946] node 14 initialised, 33353677 pages in 570ms
      [   22.946063] node 1 initialised, 33345485 pages in 580ms
      
      It saves the time about 550*16 ms at least, although it can be ignore to
      compare the boot time about 160 seconds.  What's more, the boot time is
      much shorter on Power even without patches than x86 for huge memory
      machine.
      
      So this patchset is still necessary to be enabled for Power.
      
      This patch (of 2):
      
      This patch is based on Mel Gorman's old patch in the mailing list,
      https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with
      a completion to wait for all memory initialised in page_alloc_init_late().
      It is to fix the OOM problem on X86 with 24TB memory which allocates
      memory in late initialisation.  But for Power platform with 32TB memory,
      it causes a call trace in vfs_caches_init->inode_init() and inode hash
      table needs more memory.  So this patch allocates 1GB for 0.25TB/node for
      large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627
      
      This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes.
      Currently, it only allocates 2GB*16=32GB for early initialisation.  But
      Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB.
      So the system have no enough memory for it.  The log from dmesg as the
      following:
      
        Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes)
        vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes
        swapper/0: page allocation failure: order:0,mode:0x2080020
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64
        Call Trace:
          .dump_stack+0xb4/0xb664 (unreliable)
          .warn_alloc_failed+0x114/0x160
          .__vmalloc_area_node+0x1a4/0x2b0
          .__vmalloc_node_range+0xe4/0x110
          .__vmalloc_node+0x40/0x50
          .alloc_large_system_hash+0x134/0x2a4
          .inode_init+0xa4/0xf0
          .vfs_caches_init+0x80/0x144
          .start_kernel+0x40c/0x4e0
          start_here_common+0x20/0x4a4
      Signed-off-by: default avatarLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      987b3095
    • Kirill A. Shutemov's avatar
      thp: fix deadlock in split_huge_pmd() · 5f737714
      Kirill A. Shutemov authored
      split_huge_pmd() tries to munlock page with munlock_vma_page().  That
      requires the page to locked.
      
      If the is locked by caller, we would get a deadlock:
      
      	Unable to find swap-space signature
      	INFO: task trinity-c85:1907 blocked for more than 120 seconds.
      	      Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
      	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      	trinity-c85     D ffff88084d997608     0  1907    309 0x00000000
      	Call Trace:
      	  schedule+0x9f/0x1c0
      	  schedule_timeout+0x48e/0x600
      	  io_schedule_timeout+0x1c3/0x390
      	  bit_wait_io+0x29/0xd0
      	  __wait_on_bit_lock+0x94/0x140
      	  __lock_page+0x1d4/0x280
      	  __split_huge_pmd+0x5a8/0x10f0
      	  split_huge_pmd_address+0x1d9/0x230
      	  try_to_unmap_one+0x540/0xc70
      	  rmap_walk_anon+0x284/0x810
      	  rmap_walk_locked+0x11e/0x190
      	  try_to_unmap+0x1b1/0x4b0
      	  split_huge_page_to_list+0x49d/0x18a0
      	  follow_page_mask+0xa36/0xea0
      	  SyS_move_pages+0xaf3/0x1570
      	  entry_SYSCALL_64_fastpath+0x12/0x6b
      	2 locks held by trinity-c85/1907:
      	 #0:  (&mm->mmap_sem){++++++}, at:  SyS_move_pages+0x933/0x1570
      	 #1:  (&anon_vma->rwsem){++++..}, at:  split_huge_page_to_list+0x402/0x18a0
      
      I don't think the deadlock is triggerable without split_huge_page()
      simplifilcation patchset.
      
      But munlock_vma_page() here is wrong: we want to munlock the page
      unconditionally, no need in rmap lookup, that munlock_vma_page() does.
      
      Let's use clear_page_mlock() instead.  It can be called under ptl.
      
      Fixes: e90309c9 ("thp: allow mlocked THP again")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f737714
    • Kirill A. Shutemov's avatar
      thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers · fec89c10
      Kirill A. Shutemov authored
      freeze_page() and unfreeze_page() helpers evolved in rather complex
      beasts.  It would be nice to cut complexity of this code.
      
      This patch rewrites freeze_page() using standard try_to_unmap().
      unfreeze_page() is rewritten with remove_migration_ptes().
      
      The result is much simpler.
      
      But the new variant is somewhat slower for PTE-mapped THPs.  Current
      helpers iterates over VMAs the compound page is mapped to, and then over
      ptes within this VMA.  New helpers iterates over small page, then over
      VMA the small page mapped to, and only then find relevant pte.
      
      We have short cut for PMD-mapped THP: we directly install migration
      entries on PMD split.
      
      I don't think the slowdown is critical, considering how much simpler
      result is and that split_huge_page() is quite rare nowadays.  It only
      happens due memory pressure or migration.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fec89c10
    • Kirill A. Shutemov's avatar
      mm: make remove_migration_ptes() beyond mm/migration.c · e388466d
      Kirill A. Shutemov authored
      Make remove_migration_ptes() available to be used in split_huge_page().
      
      New parameter 'locked' added: as with try_to_umap() we need a way to
      indicate that caller holds rmap lock.
      
      We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
      pages are never mlocked.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e388466d
    • Kirill A. Shutemov's avatar
      rmap: extend try_to_unmap() to be usable by split_huge_page() · 2a52bcbc
      Kirill A. Shutemov authored
      Add support for two ttu_flags:
      
        - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
          unmap page;
      
        - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;
      
      Also, change rwc->done to !page_mapcount() instead of !page_mapped().
      try_to_unmap() works on pte level, so we are really interested in the
      mappedness of this small page rather than of the compound page it's a
      part of.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a52bcbc
    • Kirill A. Shutemov's avatar
      rmap: introduce rmap_walk_locked() · b9773199
      Kirill A. Shutemov authored
      This patchset rewrites freeze_page() and unfreeze_page() using
      try_to_unmap() and remove_migration_ptes().  Result is much simpler, but
      somewhat slower.
      
      Migration 8GiB worth of PMD-mapped THP:
      
        Baseline	20.21 +/- 0.393
        Patched	20.73 +/- 0.082
        Slowdown	1.03x
      
      It's 3% slower, comparing to 14% in v1.  I don't it should be a stopper.
      
      Splitting of PTE-mapped pages slowed more.  But this is not a common
      case.
      
      Migration 8GiB worth of PMD-mapped THP:
      
        Baseline	20.39 +/- 0.225
        Patched	22.43 +/- 0.496
        Slowdown	1.10x
      
      rmap_walk_locked() is the same as rmap_walk(), but the caller takes care
      of the relevant rmap lock.
      
      This is preparation for switching THP splitting from custom rmap walk in
      freeze_page()/unfreeze_page() to the generic one.
      
      There is no support for KSM pages for now: not clear which lock is
      implied.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9773199
    • Dan Williams's avatar
      mm: ZONE_DEVICE depends on SPARSEMEM_VMEMMAP · 99490f16
      Dan Williams authored
      The primary use case for devm_memremap_pages() is to allocate an memmap
      array from persistent memory.  That capabilty requires vmem_altmap which
      requires SPARSEMEM_VMEMMAP.
      
      Also, without SPARSEMEM_VMEMMAP the addition of ZONE_DEVICE expands
      ZONES_WIDTH and triggers the:
      
      "Unfortunate NUMA and NUMA Balancing config, growing page-frame for
      last_cpupid."
      
      ...warning in mm/memory.c.  SPARSEMEM_VMEMMAP=n && ZONE_DEVICE=y is not
      a configuration we should worry about supporting.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99490f16
    • Jan Kara's avatar
      mm: remove VM_FAULT_MINOR · 0e8fb931
      Jan Kara authored
      The define has a comment from Nick Piggin from 2007:
      
       /* For backwards compat. Remove me quickly. */
      
      I guess 9 years should not be too hurried sense of 'quickly' even for
      kernel measures.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e8fb931
    • Joe Perches's avatar
      mm: percpu: use pr_fmt to prefix output · 870d4b12
      Joe Perches authored
      Use the normal mechanism to make the logging output consistently
      "percpu:" instead of a mix of "PERCPU:" and "percpu:"
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      870d4b12
    • Joe Perches's avatar
      mm: convert printk(KERN_<LEVEL> to pr_<level> · 1170532b
      Joe Perches authored
      Most of the mm subsystem uses pr_<level> so make it consistent.
      
      Miscellanea:
      
       - Realign arguments
       - Add missing newline to format
       - kmemleak-test.c has a "kmemleak: " prefix added to the
         "Kmemleak testing" logging message via pr_fmt
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1170532b
    • Joe Perches's avatar
      mm: coalesce split strings · 756a025f
      Joe Perches authored
      Kernel style prefers a single string over split strings when the string is
      'user-visible'.
      
      Miscellanea:
      
       - Add a missing newline
       - Realign arguments
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      756a025f
    • Joe Perches's avatar
      mm: convert pr_warning to pr_warn · 598d8091
      Joe Perches authored
      There are a mixture of pr_warning and pr_warn uses in mm.  Use pr_warn
      consistently.
      
      Miscellanea:
      
       - Coalesce formats
       - Realign arguments
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      598d8091
    • Dan Williams's avatar
      mm: exclude ZONE_DEVICE from GFP_ZONE_TABLE · b11a7b94
      Dan Williams authored
      ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
      mm zones that are bumping up against the current maximum limit of 4
      zones, i.e.  2 bits in page->flags for the GFP_ZONE_TABLE.
      
      The GFP_ZONE_TABLE poses an interesting constraint since
      include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
      build.  We need to be careful to only build the table for zones that
      have a corresponding gfp_t flag.  GFP_ZONES_SHIFT is introduced for this
      purpose.  This patch does not attempt to solve the problem of adding a
      new zone that also has a corresponding GFP_ flag.
      
      Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
      SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero.  In other words
      even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
      consume another bit in page->flags (expand ZONES_WIDTH) with room to
      spare.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
      Fixes: 033fbae9 ("mm: ZONE_DEVICE for "device memory"")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarMark <markk@clara.co.uk>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b11a7b94
    • Vladimir Davydov's avatar
      mm: memcontrol: cleanup css_reset callback · d334c9bc
      Vladimir Davydov authored
      - Do not take memcg_limit_mutex for resetting limits - the cgroup cannot
        be altered from userspace anymore, so no need to protect them.
      
      - Use plain page_counter_limit() for resetting ->memory and ->memsw
        limits instead of mem_cgrouop_resize_* helpers - we enlarge the limits,
        so no need in special handling.
      
      - Reset ->swap and ->tcpmem limits as well.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d334c9bc
    • Chen Yucong's avatar
      mm, memory hotplug: print debug message in the proper way for online_pages · e33e33b4
      Chen Yucong authored
      online_pages() simply returns an error value if
      memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
      want for successfully onlining target pages.  This patch arms to print
      more failure information like offline_pages() in online_pages.
      
      This patch also converts printk(KERN_<LEVEL>) to pr_<level>(), and moves
      __offline_pages() to not print failure information with KERN_INFO
      according to David Rientjes's suggestion[1].
      
      [1] https://lkml.org/lkml/2016/2/24/1094Signed-off-by: default avatarChen Yucong <slaoub@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e33e33b4
    • Michal Hocko's avatar
      mm: remove __GFP_NOFAIL is deprecated comment · 0f352e53
      Michal Hocko authored
      Commit 64775719 ("mm: clarify __GFP_NOFAIL deprecation status") was
      incomplete and didn't remove the comment about __GFP_NOFAIL being
      deprecated in buffered_rmqueue.
      
      Let's get rid of this leftover but keep the WARN_ON_ONCE for order > 1
      because we should really discourage from using __GFP_NOFAIL with higher
      order allocations because those are just too subtle.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarNikolay Borisov <kernel@kyup.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f352e53
    • Joonsoo Kim's avatar
      mm/page_ref: add tracepoint to track down page reference manipulation · 95813b8f
      Joonsoo Kim authored
      CMA allocation should be guaranteed to succeed by definition, but,
      unfortunately, it would be failed sometimes.  It is hard to track down
      the problem, because it is related to page reference manipulation and we
      don't have any facility to analyze it.
      
      This patch adds tracepoints to track down page reference manipulation.
      With it, we can find exact reason of failure and can fix the problem.
      Following is an example of tracepoint output.  (note: this example is
      stale version that printing flags as the number.  Recent version will
      print it as human readable string.)
      
      <...>-9018  [004]    92.678375: page_ref_set:         pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
      <...>-9018  [004]    92.678378: kernel_stack:
       => get_page_from_freelist (ffffffff81176659)
       => __alloc_pages_nodemask (ffffffff81176d22)
       => alloc_pages_vma (ffffffff811bf675)
       => handle_mm_fault (ffffffff8119e693)
       => __do_page_fault (ffffffff810631ea)
       => trace_do_page_fault (ffffffff81063543)
       => do_async_page_fault (ffffffff8105c40a)
       => async_page_fault (ffffffff817581d8)
      [snip]
      <...>-9018  [004]    92.678379: page_ref_mod:         pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
      [snip]
      ...
      ...
      <...>-9131  [001]    93.174468: test_pages_isolated:  start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
      [snip]
      <...>-9018  [004]    93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
       => release_pages (ffffffff8117c9e4)
       => free_pages_and_swap_cache (ffffffff811b0697)
       => tlb_flush_mmu_free (ffffffff81199616)
       => tlb_finish_mmu (ffffffff8119a62c)
       => exit_mmap (ffffffff811a53f7)
       => mmput (ffffffff81073f47)
       => do_exit (ffffffff810794e9)
       => do_group_exit (ffffffff81079def)
       => SyS_exit_group (ffffffff81079e74)
       => entry_SYSCALL_64_fastpath (ffffffff817560b6)
      
      This output shows that problem comes from exit path.  In exit path, to
      improve performance, pages are not freed immediately.  They are gathered
      and processed by batch.  During this process, migration cannot be
      possible and CMA allocation is failed.  This problem is hard to find
      without this page reference tracepoint facility.
      
      Enabling this feature bloat kernel text 30 KB in my configuration.
      
         text    data     bss     dec     hex filename
      12127327        2243616 1507328 15878271         f2487f vmlinux_disabled
      12157208        2258880 1507328 15923416         f2f8d8 vmlinux_enabled
      
      Note that, due to header file dependency problem between mm.h and
      tracepoint.h, this feature has to open code the static key functions for
      tracepoints.  Proposed by Steven Rostedt in following link.
      
      https://lkml.org/lkml/2015/12/9/699
      
      [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
      [iamjoonsoo.kim@lge.com: fix build failure for xtensa]
      [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95813b8f
    • Joonsoo Kim's avatar
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim authored
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
    • Mel Gorman's avatar
      mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4
      Mel Gorman authored
      THP defrag is enabled by default to direct reclaim/compact but not wake
      kswapd in the event of a THP allocation failure.  The problem is that
      THP allocation requests potentially enter reclaim/compaction.  This
      potentially incurs a severe stall that is not guaranteed to be offset by
      reduced TLB misses.  While there has been considerable effort to reduce
      the impact of reclaim/compaction, it is still a high cost and workloads
      that should fit in memory fail to do so.  Specifically, a simple
      anon/file streaming workload will enter direct reclaim on NUMA at least
      even though the working set size is 80% of RAM.  It's been years and
      it's time to throw in the towel.
      
      First, this patch defines THP defrag as follows;
      
       madvise: A failed allocation will direct reclaim/compact if the application requests it
       never:   Neither reclaim/compact nor wake kswapd
       defer:   A failed allocation will wake kswapd/kcompactd
       always:  A failed allocation will direct reclaim/compact (historical behaviour)
                khugepaged defrag will enter direct/reclaim but not wake kswapd.
      
      Next it sets the default defrag option to be "madvise" to only enter
      direct reclaim/compaction for applications that specifically requested
      it.
      
      Lastly, it removes a check from the page allocator slowpath that is
      related to __GFP_THISNODE to allow "defer" to work.  The callers that
      really cares are slub/slab and they are updated accordingly.  The slab
      one may be surprising because it also corrects a comment as kswapd was
      never woken up by that path.
      
      This means that a THP fault will no longer stall for most applications
      by default and the ideal for most users that get THP if they are
      immediately available.  There are still options for users that prefer a
      stall at startup of a new application by either restoring historical
      behaviour with "always" or pick a half-way point with "defer" where
      kswapd does some of the work in the background and wakes kcompactd if
      necessary.  THP defrag for khugepaged remains enabled and will enter
      direct/reclaim but no wakeup kswapd or kcompactd.
      
      After this patch a THP allocation failure will quickly fallback and rely
      on khugepaged to recover the situation at some time in the future.  In
      some cases, this will reduce THP usage but the benefit of THP is hard to
      measure and not a universal win where as a stall to reclaim/compaction
      is definitely measurable and can be painful.
      
      The first test for this is using "usemem" to read a large file and write
      a large anonymous mapping (to avoid the zero page) multiple times.  The
      total size of the mappings is 80% of RAM and the benchmark simply
      measures how long it takes to complete.  It uses multiple threads to see
      if that is a factor.  On UMA, the performance is almost identical so is
      not reported but on NUMA, we see this
      
      usemem
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
      Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
      Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
      Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
      Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
      Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
      Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
      Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
      Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
      Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
      Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
      Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
      Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
      Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
      
      For a single thread, the benchmark completes 43.23% faster with this
      patch applied with smaller benefits as the thread increases.  Similar,
      notice the large reduction in most cases in system CPU usage.  The
      overall CPU time is
      
                     4.4.0       4.4.0
              kcompactd-v1r1 nodefrag-v1r3
      User        10357.65    10438.33
      System       3988.88     3543.94
      Elapsed      2203.01     1634.41
      
      Which is substantial. Now, the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 128458477   278352931
      Major Faults                   2174976         225
      Swap Ins                      16904701           0
      Swap Outs                     17359627           0
      Allocation stalls                43611           0
      DMA allocs                           0           0
      DMA32 allocs                  19832646    19448017
      Normal allocs                614488453   580941839
      Movable allocs                       0           0
      Direct pages scanned          24163800           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed        20691346           0
      Compaction stalls                42263           0
      Compaction success                 938           0
      Compaction failures              41325           0
      
      This patch eliminates almost all swapping and direct reclaim activity.
      There is still overhead but it's from NUMA balancing which does not
      identify that it's pointless trying to do anything with this workload.
      
      I also tried the thpscale benchmark which forces a corner case where
      compaction can be used heavily and measures the latency of whether base
      or huge pages were used
      
      thpscale Fault Latencies
                                             4.4.0                 4.4.0
                                    kcompactd-v1r1         nodefrag-v1r3
      Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
      Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
      Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
      Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
      Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
      Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
      Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
      Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
      Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
      Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
      Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
      Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
      Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
      Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
      Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
      Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
      Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
      Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
      
      The average time to fault pages is substantially reduced in the majority
      of caseds but with the obvious caveat that fewer THPs are actually used
      in this adverse workload
      
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
      Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
      Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
      Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
      Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
      Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
      Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
      Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
      Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                  37429143    47564000
      Major Faults                      1916        1558
      Swap Ins                          1466        1079
      Swap Outs                      2936863      149626
      Allocation stalls                62510           3
      DMA allocs                           0           0
      DMA32 allocs                   6566458     6401314
      Normal allocs                216361697   216538171
      Movable allocs                       0           0
      Direct pages scanned          25977580       17998
      Kswapd pages scanned                 0     3638931
      Kswapd pages reclaimed               0      207236
      Direct pages reclaimed         8833714          88
      Compaction stalls               103349           5
      Compaction success                 270           4
      Compaction failures             103079           1
      
      Note again that while this does swap as it's an aggressive workload, the
      direct relcim activity and allocation stalls is substantially reduced.
      There is some kswapd activity but ftrace showed that the kswapd activity
      was due to normal wakeups from 4K pages being allocated.
      Compaction-related stalls and activity are almost eliminated.
      
      I also tried the stutter benchmark.  For this, I do not have figures for
      NUMA but it's something that does impact UMA so I'll report what is
      available
      
      stutter
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
      Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
      1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
      2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
      3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
      Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
      Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
      Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
      Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
      Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
      Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
      
      This benchmark is trying to fault an anonymous mapping while there is a
      heavy IO load -- a scenario that desktop users used to complain about
      frequently.  This shows a mix because the ideal case of mapping with THP
      is not hit as often.  However, note that 99% of the mappings complete
      13.79% faster.  The CPU usage here is particularly interesting
      
                     4.4.0       4.4.0
              kcompactd-v1r1nodefrag-v1r3
      User           67.50        0.99
      System       1327.88       91.30
      Elapsed      2079.00     2128.98
      
      And once again we look at the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 335241922  1314582827
      Major Faults                       715         819
      Swap Ins                             0           0
      Swap Outs                            0           0
      Allocation stalls               532723           0
      DMA allocs                           0           0
      DMA32 allocs                1822364341  1177950222
      Normal allocs               1815640808  1517844854
      Movable allocs                       0           0
      Direct pages scanned          21892772           0
      Kswapd pages scanned          20015890    41879484
      Kswapd pages reclaimed        19961986    41822072
      Direct pages reclaimed        21892741           0
      Compaction stalls              1065755           0
      Compaction success                 514           0
      Compaction failures            1065241           0
      
      Allocation stalls and all direct reclaim activity is eliminated as well
      as compaction-related stalls.
      
      THP gives impressive gains in some cases but only if they are quickly
      available.  We're not going to reach the point where they are completely
      free so lets take the costs out of the fast paths finally and defer the
      cost to kswapd, kcompactd and khugepaged where it belongs.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      444eb2a4