1. 10 Oct, 2014 40 commits
    • Andrew Morton's avatar
      include/linux/migrate.h: remove migrate_page #define · 1c93923c
      Andrew Morton authored
      This is designed to avoid a few ifdefs in .c files but it's obnoxious
      because it can cause unsuspecting "migrate_page" symbols to get turned into
      "NULL".
      
      Just nuke it and use the ifdefs.
      
      Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c93923c
    • Oleg Nesterov's avatar
      mempolicy: unexport get_vma_policy() and remove its "task" arg · dd6eecb9
      Oleg Nesterov authored
      - get_vma_policy(task) is not safe if task != current, remove this
        argument.
      
      - get_vma_policy() no longer has callers outside of mempolicy.c,
        make it static.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd6eecb9
    • Oleg Nesterov's avatar
      mempolicy: kill do_set_mempolicy()->down_write(&mm->mmap_sem) · 2c7c3a7d
      Oleg Nesterov authored
      Remove down_write(&mm->mmap_sem) in do_set_mempolicy(). This logic
      was never correct and it is no longer needed, see the previous patch.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c7c3a7d
    • Oleg Nesterov's avatar
      mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race · 498f2371
      Oleg Nesterov authored
      9e781440 "hold task->mempolicy while numa_maps scans." fixed the
      race with the exiting task but this is not enough.
      
      The current code assumes that get_vma_policy(task) should either see
      task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
      by hold_task_mempolicy(), so we can never race with __mpol_put(). But
      this can only work if we can't race with do_set_mempolicy(), and thus
      we can't race with another do_set_mempolicy() or do_exit() after that.
      
      However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
      race. This task can exec, change it's ->mm, and call do_set_mempolicy()
      after that; in this case they take 2 different locks.
      
      Change hold_task_mempolicy() to use get_task_policy(), it never returns
      NULL, and change show_numa_map() to use __get_vma_policy() or fall back
      to proc_priv->task_mempolicy.
      
      Note: this is the minimal fix, we will cleanup this code later. I think
      hold_task_mempolicy() and release_task_mempolicy() should die, we can
      move this logic into show_numa_map(). Or we can move get_task_policy()
      outside of ->mmap_sem and !CONFIG_NUMA code at least.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      498f2371
    • Oleg Nesterov's avatar
      mempolicy: introduce __get_vma_policy(), export get_task_policy() · 74d2c3a0
      Oleg Nesterov authored
      Extract the code which looks for vma's policy from get_vma_policy()
      into the new helper, __get_vma_policy(). Export get_task_policy().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74d2c3a0
    • Oleg Nesterov's avatar
      mempolicy: remove the "task" arg of vma_policy_mof() and simplify it · 6b6482bb
      Oleg Nesterov authored
      1. vma_policy_mof(task) is simply not safe unless task == current,
         it can race with do_exit()->mpol_put(). Remove this arg and update
         its single caller.
      
      2. vma can not be NULL, remove this check and simplify the code.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b6482bb
    • Oleg Nesterov's avatar
      mempolicy: sanitize the usage of get_task_policy() · 8d90274b
      Oleg Nesterov authored
      Cleanup + preparation. Every user of get_task_policy() calls it
      unconditionally, even if it is not going to use the result.
      
      get_task_policy() is cheap but still this does not look clean, plus
      the code looks simpler if get_task_policy() is called only when this
      is really needed.
      
      Note: I hope this is correct, but it is not clear why vma_policy_mof()
      doesn't fall back to get_task_policy() if ->get_policy() returns NULL.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d90274b
    • Oleg Nesterov's avatar
      mempolicy: change get_task_policy() to return default_policy rather than NULL · f15ca78e
      Oleg Nesterov authored
      Every caller of get_task_policy() falls back to default_policy if it
      returns NULL. Change get_task_policy() to do this.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f15ca78e
    • Oleg Nesterov's avatar
      mempolicy: change alloc_pages_vma() to use mpol_cond_put() · 2386740d
      Oleg Nesterov authored
      Trivial cleanup. alloc_pages_vma() can use mpol_cond_put().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2386740d
    • Johannes Weiner's avatar
      mm: remove noisy remainder of the scan_unevictable interface · 1f13ae39
      Johannes Weiner authored
      The deprecation warnings for the scan_unevictable interface triggers by
      scripts doing `sysctl -a | grep something else'.  This is annoying and not
      helpful.
      
      The interface has been defunct since 264e56d8 ("mm: disable user
      interface to manually rescue unevictable pages"), which was in 2011, and
      there haven't been any reports of usecases for it, only reports that the
      deprecation warnings are annying.  It's unlikely that anybody is using
      this interface specifically at this point, so remove it.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f13ae39
    • Cyrill Gorcunov's avatar
      prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation · f606b77f
      Cyrill Gorcunov authored
      During development of c/r we've noticed that in case if we need to support
      user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
      ...) call, in particular once new user namespace is created
      capable(CAP_SYS_RESOURCE) no longer passes.
      
      A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
      in one bundle, which would allow the kernel to make more intensive test
      for sanity of values and same time allow us to support checkpoint/restore
      of user namespaces.
      
      Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
      prctl_mm_map structure which carries all the members to be updated.
      
      	prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
      
      	struct prctl_mm_map {
      		__u64	start_code;
      		__u64	end_code;
      		__u64	start_data;
      		__u64	end_data;
      		__u64	start_brk;
      		__u64	brk;
      		__u64	start_stack;
      		__u64	arg_start;
      		__u64	arg_end;
      		__u64	env_start;
      		__u64	env_end;
      		__u64	*auxv;
      		__u32	auxv_size;
      		__u32	exe_fd;
      	};
      
      All members except @exe_fd correspond ones of struct mm_struct.  To figure
      out which available values these members may take here are meanings of the
      members.
      
       - start_code, end_code: represent bounds of executable code area
       - start_data, end_data: represent bounds of data area
       - start_brk, brk: used to calculate bounds for brk() syscall
       - start_stack: used when accounting space needed for command
         line arguments, environment and shmat() syscall
       - arg_start, arg_end, env_start, env_end: represent memory area
         supplied for command line arguments and environment variables
       - auxv, auxv_size: carries auxiliary vector, Elf format specifics
       - exe_fd: file descriptor number for executable link (/proc/self/exe)
      
      Thus we apply the following requirements to the values
      
      1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
         in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
         interval.
      
      2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
         VMAs (say a program maps own new .text and .data segments during execution)
         the rest of members should belong to VMA which must exist.
      
      3) Addresses must be ordered, ie @start_ member must not be greater or
         equal to appropriate @end_ member.
      
      4) As in regular Elf loading procedure we require that @start_brk and
         @brk be greater than @end_data.
      
      5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
         exceed existing limit. Same applies to RLIMIT_STACK.
      
      6) Auxiliary vector size must not exceed existing one (which is
         predefined as AT_VECTOR_SIZE and depends on architecture).
      
      7) File descriptor passed in @exe_file should be pointing
         to executable file (because we use existing prctl_set_mm_exe_file_locked
         helper it ensures that the file we are going to use as exe link has all
         required permission granted).
      
      Now about where these members are involved inside kernel code:
      
       - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
      
       - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
         also they are considered if there enough space for brk() syscall
         result if RLIMIT_DATA is set;
      
       - @start_brk shown in /proc/$pid/stat output and accounted in brk()
         syscall if RLIMIT_DATA is set; also this member is tested to
         find a symbolic name of mmap event for perf system (we choose
         if event is generated for "heap" area); one more aplication is
         selinux -- we test if a process has PROCESS__EXECHEAP permission
         if trying to make heap area being executable with mprotect() syscall;
      
       - @brk is a current value for brk() syscall which lays inside heap
         area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
         provides new memory area to a user space upon brk() completion the
         mm::brk is updated to carry new value;
      
         Both @start_brk and @brk are actively used in /proc/$pid/maps
         and /proc/$pid/smaps output to find a symbolic name "heap" for
         VMA being scanned;
      
       - @start_stack is printed out in /proc/$pid/stat and used to
         find a symbolic name "stack" for task and threads in
         /proc/$pid/maps and /proc/$pid/smaps output, and as the same
         as with @start_brk -- perf system uses it for event naming.
         Also kernel treat this member as a start address of where
         to map vDSO pages and to check if there is enough space
         for shmat() syscall;
      
       - @arg_start, @arg_end, @env_start and @env_end are printed out
         in /proc/$pid/stat. Another access to the data these members
         represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
         Any attempt to read these areas kernel tests with access_process_vm
         helper so a user must have enough rights for this action;
      
       - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
         speaking kernel doesn't care much about which exactly data is
         sitting there because it is solely for userspace;
      
       - @exe_fd is referred from /proc/$pid/exe and when generating
         coredump. We uses prctl_set_mm_exe_file_locked helper to update
         this member, so exe-file link modification remains one-shot
         action.
      
      Still note that updating exe-file link now doesn't require sys-resource
      capability anymore, after all there is no much profit in preventing setup
      own file link (there are a number of ways to execute own code -- ptrace,
      ld-preload, so that the only reliable way to find which exactly code is
      executed is to inspect running program memory).  Still we require the
      caller to be at least user-namespace root user.
      
      I believe the old interface should be deprecated and ripped off in a
      couple of kernel releases if no one against.
      
      To test if new interface is implemented in the kernel one can pass
      PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
      supported struct prctl_mm_map.
      
      [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: default avatarAndrew Vagin <avagin@openvz.org>
      Tested-by: default avatarAndrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f606b77f
    • Cyrill Gorcunov's avatar
      prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file · 71fe97e1
      Cyrill Gorcunov authored
      Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
      and rename the helper to prctl_set_mm_exe_file_locked().  This will allow
      to reuse this function in a next patch.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71fe97e1
    • Cyrill Gorcunov's avatar
      mm: use may_adjust_brk helper · 8764b338
      Cyrill Gorcunov authored
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8764b338
    • Cyrill Gorcunov's avatar
      mm: introduce check_data_rlimit helper · 9c599024
      Cyrill Gorcunov authored
      To eliminate code duplication lets introduce check_data_rlimit helper
      which we will use in brk() and prctl() syscalls.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c599024
    • David Rientjes's avatar
      mm, compaction: pass gfp mask to compact_control · 6d7ce559
      David Rientjes authored
      struct compact_control currently converts the gfp mask to a migratetype,
      but we need the entire gfp mask in a follow-up patch.
      
      Pass the entire gfp mask as part of struct compact_control.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d7ce559
    • David Rientjes's avatar
      mm: rename allocflags_to_migratetype for clarity · 43e7a34d
      David Rientjes authored
      The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
      ALLOC_CPUSET) that have separate semantics.
      
      The function allocflags_to_migratetype() actually takes gfp flags, not
      alloc flags, and returns a migratetype.  Rename it to
      gfpflags_to_migratetype().
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43e7a34d
    • Vlastimil Babka's avatar
      mm, compaction: skip buddy pages by their order in the migrate scanner · 99c0fd5e
      Vlastimil Babka authored
      The migration scanner skips PageBuddy pages, but does not consider their
      order as checking page_order() is generally unsafe without holding the
      zone->lock, and acquiring the lock just for the check wouldn't be a good
      tradeoff.
      
      Still, this could avoid some iterations over the rest of the buddy page,
      and if we are careful, the race window between PageBuddy() check and
      page_order() is small, and the worst thing that can happen is that we skip
      too much and miss some isolation candidates.  This is not that bad, as
      compaction can already fail for many other reasons like parallel
      allocations, and those have much larger race window.
      
      This patch therefore makes the migration scanner obtain the buddy page
      order and use it to skip the whole buddy page, if the order appears to be
      in the valid range.
      
      It's important that the page_order() is read only once, so that the value
      used in the checks and in the pfn calculation is the same.  But in theory
      the compiler can replace the local variable by multiple inlines of
      page_order().  Therefore, the patch introduces page_order_unsafe() that
      uses ACCESS_ONCE to prevent this.
      
      Testing with stress-highalloc from mmtests shows a 15% reduction in number
      of pages scanned by migration scanner.  The reduction is >60% with
      __GFP_NO_KSWAPD allocations, along with success rates better by few
      percent.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99c0fd5e
    • Vlastimil Babka's avatar
      mm, compaction: remember position within pageblock in free pages scanner · e14c720e
      Vlastimil Babka authored
      Unlike the migration scanner, the free scanner remembers the beginning of
      the last scanned pageblock in cc->free_pfn.  It might be therefore
      rescanning pages uselessly when called several times during single
      compaction.  This might have been useful when pages were returned to the
      buddy allocator after a failed migration, but this is no longer the case.
      
      This patch changes the meaning of cc->free_pfn so that if it points to a
      middle of a pageblock, that pageblock is scanned only from cc->free_pfn to
      the end.  isolate_freepages_block() will record the pfn of the last page
      it looked at, which is then used to update cc->free_pfn.
      
      In the mmtests stress-highalloc benchmark, this has resulted in lowering
      the ratio between pages scanned by both scanners, from 2.5 free pages per
      migrate page, to 2.25 free pages per migrate page, without affecting
      success rates.
      
      With __GFP_NO_KSWAPD allocations, this appears to result in a worse ratio
      (2.1 instead of 1.8), but page migration successes increased by 10%, so
      this could mean that more useful work can be done until need_resched()
      aborts this kind of compaction.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e14c720e
    • Vlastimil Babka's avatar
      mm, compaction: skip rechecks when lock was already held · 69b7189f
      Vlastimil Babka authored
      Compaction scanners try to lock zone locks as late as possible by checking
      many page or pageblock properties opportunistically without lock and
      skipping them if not unsuitable.  For pages that pass the initial checks,
      some properties have to be checked again safely under lock.  However, if
      the lock was already held from a previous iteration in the initial checks,
      the rechecks are unnecessary.
      
      This patch therefore skips the rechecks when the lock was already held.
      This is now possible to do, since we don't (potentially) drop and
      reacquire the lock between the initial checks and the safe rechecks
      anymore.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69b7189f
    • Vlastimil Babka's avatar
      mm, compaction: periodically drop lock and restore IRQs in scanners · 8b44d279
      Vlastimil Babka authored
      Compaction scanners regularly check for lock contention and need_resched()
      through the compact_checklock_irqsave() function.  However, if there is no
      contention, the lock can be held and IRQ disabled for potentially long
      time.
      
      This has been addressed by commit b2eef8c0 ("mm: compaction: minimise
      the time IRQs are disabled while isolating pages for migration") for the
      migration scanner.  However, the refactoring done by commit 2a1402aa
      ("mm: compaction: acquire the zone->lru_lock as late as possible") has
      changed the conditions so that the lock is dropped only when there's
      contention on the lock or need_resched() is true.  Also, need_resched() is
      checked only when the lock is already held.  The comment "give a chance to
      irqs before checking need_resched" is therefore misleading, as IRQs remain
      disabled when the check is done.
      
      This patch restores the behavior intended by commit b2eef8c0 and also
      tries to better balance and make more deterministic the time spent by
      checking for contention vs the time the scanners might run between the
      checks.  It also avoids situations where checking has not been done often
      enough before.  The result should be avoiding both too frequent and too
      infrequent contention checking, and especially the potentially
      long-running scans with IRQs disabled and no checking of need_resched() or
      for fatal signal pending, which can happen when many consecutive pages or
      pageblocks fail the preliminary tests and do not reach the later call site
      to compact_checklock_irqsave(), as explained below.
      
      Before the patch:
      
      In the migration scanner, compact_checklock_irqsave() was called each
      loop, if reached.  If not reached, some lower-frequency checking could
      still be done if the lock was already held, but this would not result in
      aborting contended async compaction until reaching
      compact_checklock_irqsave() or end of pageblock.  In the free scanner, it
      was similar but completely without the periodical checking, so lock can be
      potentially held until reaching the end of pageblock.
      
      After the patch, in both scanners:
      
      The periodical check is done as the first thing in the loop on each
      SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
      function, which always unlocks the lock (if locked) and aborts async
      compaction if scheduling is needed.  It also aborts any type of compaction
      when a fatal signal is pending.
      
      The compact_checklock_irqsave() function is replaced with a slightly
      different compact_trylock_irqsave().  The biggest difference is that the
      function is not called at all if the lock is already held.  The periodical
      need_resched() checking is left solely to compact_unlock_should_abort().
      The lock contention avoidance for async compaction is achieved by the
      periodical unlock by compact_unlock_should_abort() and by using trylock in
      compact_trylock_irqsave() and aborting when trylock fails.  Sync
      compaction does not use trylock.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b44d279
    • Vlastimil Babka's avatar
      mm, compaction: khugepaged should not give up due to need_resched() · 1f9efdef
      Vlastimil Babka authored
      Async compaction aborts when it detects zone lock contention or
      need_resched() is true.  David Rientjes has reported that in practice,
      most direct async compactions for THP allocation abort due to
      need_resched().  This means that a second direct compaction is never
      attempted, which might be OK for a page fault, but khugepaged is intended
      to attempt a sync compaction in such case and in these cases it won't.
      
      This patch replaces "bool contended" in compact_control with an int that
      distinguishes between aborting due to need_resched() and aborting due to
      lock contention.  This allows propagating the abort through all compaction
      functions as before, but passing the abort reason up to
      __alloc_pages_slowpath() which decides when to continue with direct
      reclaim and another compaction attempt.
      
      Another problem is that try_to_compact_pages() did not act upon the
      reported contention (both need_resched() or lock contention) immediately
      and would proceed with another zone from the zonelist.  When
      need_resched() is true, that means initializing another zone compaction,
      only to check again need_resched() in isolate_migratepages() and aborting.
       For zone lock contention, the unintended consequence is that the lock
      contended status reported back to the allocator is detrmined from the last
      zone where compaction was attempted, which is rather arbitrary.
      
      This patch fixes the problem in the following way:
      - async compaction of a zone aborting due to need_resched() or fatal signal
        pending means that further zones should not be tried. We report
        COMPACT_CONTENDED_SCHED to the allocator.
      - aborting zone compaction due to lock contention means we can still try
        another zone, since it has different set of locks. We report back
        COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
        it was aborted due to lock contention.
      
      As a result of these fixes, khugepaged will proceed with second sync
      compaction as intended, when the preceding async compaction aborted due to
      need_resched().  Page fault compactions aborting due to need_resched()
      will spare some cycles previously wasted by initializing another zone
      compaction only to abort again.  Lock contention will be reported only
      when compaction in all zones aborted due to lock contention, and therefore
      it's not a good idea to try again after reclaim.
      
      In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
      has improved number of THP collapse allocations by 10%, which shows
      positive effect on khugepaged.  The benchmark's success rates are
      unchanged as it is not recognized as khugepaged.  Numbers of compact_stall
      and compact_fail events have however decreased by 20%, with
      compact_success still a bit improved, which is good.  With benchmark
      configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
      collapse allocations, and only slight improvement in stalls and failures.
      
      [akpm@linux-foundation.org: fix warnings]
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f9efdef
    • Vlastimil Babka's avatar
      mm, compaction: reduce zone checking frequency in the migration scanner · 7d49d886
      Vlastimil Babka authored
      The unification of the migrate and free scanner families of function has
      highlighted a difference in how the scanners ensure they only isolate
      pages of the intended zone.  This is important for taking zone lock or lru
      lock of the correct zone.  Due to nodes overlapping, it is however
      possible to encounter a different zone within the range of the zone being
      compacted.
      
      The free scanner, since its inception by commit 748446bb ("mm:
      compaction: memory compaction core"), has been checking the zone of the
      first valid page in a pageblock, and skipping the whole pageblock if the
      zone does not match.
      
      This checking was completely missing from the migration scanner at first,
      and later added by commit dc908600 ("mm: compaction: check for
      overlapping nodes during isolation for migration") in a reaction to a bug
      report.  But the zone comparison in migration scanner is done once per a
      single scanned page, which is more defensive and thus more costly than a
      check per pageblock.
      
      This patch unifies the checking done in both scanners to once per
      pageblock, through a new pageblock_pfn_to_page() function, which also
      includes pfn_valid() checks.  It is more defensive than the current free
      scanner checks, as it checks both the first and last page of the
      pageblock, but less defensive by the migration scanner per-page checks.
      It assumes that node overlapping may result (on some architecture) in a
      boundary between two nodes falling into the middle of a pageblock, but
      that there cannot be a node0 node1 node0 interleaving within a single
      pageblock.
      
      The result is more code being shared and a bit less per-page CPU cost in
      the migration scanner.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d49d886
    • Vlastimil Babka's avatar
      mm, compaction: move pageblock checks up from isolate_migratepages_range() · edc2ca61
      Vlastimil Babka authored
      isolate_migratepages_range() is the main function of the compaction
      scanner, called either on a single pageblock by isolate_migratepages()
      during regular compaction, or on an arbitrary range by CMA's
      __alloc_contig_migrate_range().  It currently perfoms two pageblock-wide
      compaction suitability checks, and because of the CMA callpath, it tracks
      if it crossed a pageblock boundary in order to repeat those checks.
      
      However, closer inspection shows that those checks are always true for CMA:
      - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
      - migrate_async_suitable() check is skipped because CMA uses sync compaction
      
      We can therefore move the compaction-specific checks to
      isolate_migratepages() and simplify isolate_migratepages_range().
      Furthermore, we can mimic the freepage scanner family of functions, which
      has isolate_freepages_block() function called both by compaction from
      isolate_freepages() and by CMA from isolate_freepages_range(), where each
      use-case adds own specific glue code.  This allows further code
      simplification.
      
      Thus, we rename isolate_migratepages_range() to
      isolate_migratepages_block() and limit its functionality to a single
      pageblock (or its subset).  For CMA, a new different
      isolate_migratepages_range() is created as a CMA-specific wrapper for the
      _block() function.  The checks specific to compaction are moved to
      isolate_migratepages().  As part of the unification of these two families
      of functions, we remove the redundant zone parameter where applicable,
      since zone pointer is already passed in cc->zone.
      
      Furthermore, going back to compact_zone() and compact_finished() when
      pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
      - the checks are meant to skip pageblocks quickly.  The patch therefore
      also introduces a simple loop into isolate_migratepages() so that it does
      not return immediately on failed pageblock checks, but keeps going until
      isolate_migratepages_range() gets called once.  Similarily to
      isolate_freepages(), the function periodically checks if it needs to
      reschedule or abort async compaction.
      
      [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edc2ca61
    • Vlastimil Babka's avatar
      mm, compaction: do not recheck suitable_migration_target under lock · f8224aa5
      Vlastimil Babka authored
      isolate_freepages_block() rechecks if the pageblock is suitable to be a
      target for migration after it has taken the zone->lock.  However, the
      check has been optimized to occur only once per pageblock, and
      compact_checklock_irqsave() might be dropping and reacquiring lock, which
      means somebody else might have changed the pageblock's migratetype
      meanwhile.
      
      Furthermore, nothing prevents the migratetype to change right after
      isolate_freepages_block() has finished isolating.  Given how imperfect
      this is, it's simpler to just rely on the check done in
      isolate_freepages() without lock, and not pretend that the recheck under
      lock guarantees anything.  It is just a heuristic after all.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8224aa5
    • Vlastimil Babka's avatar
      mm, compaction: do not count compact_stall if all zones skipped compaction · 98dd3b48
      Vlastimil Babka authored
      The compact_stall vmstat counter counts the number of allocations stalled
      by direct compaction.  It does not count when all attempted zones had
      deferred compaction, but it does count when all zones skipped compaction.
      The skipping is decided based on very early check of
      compaction_suitable(), based on watermarks and memory fragmentation.
      Therefore it makes sense not to count skipped compactions as stalls.
      Moreover, compact_success or compact_fail is also already not being
      counted when compaction was skipped, so this patch changes the
      compact_stall counting to match the other two.
      
      Additionally, restructure __alloc_pages_direct_compact() code for better
      readability.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98dd3b48
    • Vlastimil Babka's avatar
      mm, compaction: defer each zone individually instead of preferred zone · 53853e2d
      Vlastimil Babka authored
      When direct sync compaction is often unsuccessful, it may become deferred
      for some time to avoid further useless attempts, both sync and async.
      Successful high-order allocations un-defer compaction, while further
      unsuccessful compaction attempts prolong the compaction deferred period.
      
      Currently the checking and setting deferred status is performed only on
      the preferred zone of the allocation that invoked direct compaction.  But
      compaction itself is attempted on all eligible zones in the zonelist, so
      the behavior is suboptimal and may lead both to scenarios where 1)
      compaction is attempted uselessly, or 2) where it's not attempted despite
      good chances of succeeding, as shown on the examples below:
      
      1) A direct compaction with Normal preferred zone failed and set
         deferred compaction for the Normal zone.  Another unrelated direct
         compaction with DMA32 as preferred zone will attempt to compact DMA32
         zone even though the first compaction attempt also included DMA32 zone.
      
         In another scenario, compaction with Normal preferred zone failed to
         compact Normal zone, but succeeded in the DMA32 zone, so it will not
         defer compaction.  In the next attempt, it will try Normal zone which
         will fail again, instead of skipping Normal zone and trying DMA32
         directly.
      
      2) Kswapd will balance DMA32 zone and reset defer status based on
         watermarks looking good.  A direct compaction with preferred Normal
         zone will skip compaction of all zones including DMA32 because Normal
         was still deferred.  The allocation might have succeeded in DMA32, but
         won't.
      
      This patch makes compaction deferring work on individual zone basis
      instead of preferred zone.  For each zone, it checks compaction_deferred()
      to decide if the zone should be skipped.  If watermarks fail after
      compacting the zone, defer_compaction() is called.  The zone where
      watermarks passed can still be deferred when the allocation attempt is
      unsuccessful.  When allocation is successful, compaction_defer_reset() is
      called for the zone containing the allocated page.  This approach should
      approximate calling defer_compaction() only on zones where compaction was
      attempted and did not yield allocated page.  There might be corner cases
      but that is inevitable as long as the decision to stop compacting dues not
      guarantee that a page will be allocated.
      
      Due to a new COMPACT_DEFERRED return value, some functions relying
      implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
      more accurate.  The did_some_progress output parameter of
      __alloc_pages_direct_compact() is removed completely, as the caller
      actually does not use it after compaction sets it - it is only considered
      when direct reclaim sets it.
      
      During testing on a two-node machine with a single very small Normal zone
      on node 1, this patch has improved success rates in stress-highalloc
      mmtests benchmark.  The success here were previously made worse by commit
      3a025760 ("mm: page_alloc: spill to remote nodes before waking
      kswapd") as kswapd was no longer resetting often enough the deferred
      compaction for the Normal zone, and DMA32 zones on both nodes were thus
      not considered for compaction.  On different machine, success rates were
      improved with __GFP_NO_KSWAPD allocations.
      
      [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53853e2d
    • Vlastimil Babka's avatar
      mm, THP: don't hold mmap_sem in khugepaged when allocating THP · 8b164568
      Vlastimil Babka authored
      When allocating huge page for collapsing, khugepaged currently holds
      mmap_sem for reading on the mm where collapsing occurs.  Afterwards the
      read lock is dropped before write lock is taken on the same mmap_sem.
      
      Holding mmap_sem during whole huge page allocation is therefore useless,
      the vma needs to be rechecked after taking the write lock anyway.
      Furthemore, huge page allocation might involve a rather long sync
      compaction, and thus block any mmap_sem writers and i.e.  affect workloads
      that perform frequent m(un)map or mprotect oterations.
      
      This patch simply releases the read lock before allocating a huge page.
      It also deletes an outdated comment that assumed vma must be stable, as it
      was using alloc_hugepage_vma().  This is no longer true since commit
      9f1b868a ("mm: thp: khugepaged: add policy for finding target node").
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b164568
    • Akinobu Mita's avatar
      block_dev: implement readpages() to optimize sequential read · 447f05bb
      Akinobu Mita authored
      Sequential read from a block device is expected to be equal or faster than
      from the file on a filesystem.  But it is not correct due to the lack of
      effective readpages() in the address space operations for block device.
      
      This implements readpages() operation for block device by using
      mpage_readpages() which can create multipage BIOs instead of BIOs for each
      page and reduce system CPU time consumption.
      
      Install 1GB of RAM disk storage:
      
      	# modprobe scsi_debug dev_size_mb=1024 delay=0
      
      Sequential read from file on a filesystem:
      
      	# mkfs.ext4 /dev/$DEV
      	# mount /dev/$DEV /mnt
      	# fio --name=t --size=512m --rw=read --filename=/mnt/file
      	...
      	  read : io=524288KB, bw=2133.4MB/s, iops=546133, runt=   240msec
      
      Sequential read from a block device:
      	# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
      	...
      (Without this commit)
      	  read : io=524288KB, bw=1700.2MB/s, iops=435455, runt=   301msec
      
      (With this commit)
      	  read : io=524288KB, bw=2160.4MB/s, iops=553046, runt=   237msec
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      447f05bb
    • Akinobu Mita's avatar
      vfs: guard end of device for mpage interface · 4db96b71
      Akinobu Mita authored
      Add guard_bio_eod() check for mpage code in order to allow us to do IO
      even on the odd last sectors of a device, even if the block size is some
      multiple of the physical sector size.
      
      Using mpage_readpages() for block device requires this guard check.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4db96b71
    • Akinobu Mita's avatar
      vfs: make guard_bh_eod() more generic · 59d43914
      Akinobu Mita authored
      This patchset implements readpages() operation for block device by using
      mpage_readpages() which can create multipage BIOs instead of BIOs for each
      page and reduce system CPU time consumption.
      
      This patch (of 3):
      
      guard_bh_eod() is used in submit_bh() to allow us to do IO even on the odd
      last sectors of a device, even if the block size is some multiple of the
      physical sector size.  This makes guard_bh_eod() more generic and renames
      it guard_bio_eod() so that we can use it without struct buffer_head
      argument.
      
      The reason for this change is that using mpage_readpages() for block
      device requires to add this guard check in mpage code.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59d43914
    • Vlastimil Babka's avatar
      mm: page_alloc: determine migratetype only once · 21bb9bd1
      Vlastimil Babka authored
      The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
      from gfp_mask in each retry pass, although the migratetype variable
      already has the value determined and it does not change.  Use the variable
      and perform the check only once.  Also convert #ifdef CONFIG_CMA to
      IS_ENABLED.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21bb9bd1
    • Marek Szyprowski's avatar
      ARM: mm: don't limit default CMA region only to low memory · 95b0e655
      Marek Szyprowski authored
      DMA-mapping supports CMA regions places either in low or high memory, so
      there is no longer needed to limit default CMA regions only to low memory.
       The real limit is still defined by architecture specific DMA limit.
      Signed-off-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Reported-by: default avatarRussell King - ARM Linux <linux@arm.linux.org.uk>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Daniel Drake <drake@endlessm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95b0e655
    • Marek Szyprowski's avatar
      mm: cma: adjust address limit to avoid hitting low/high memory boundary · f7426b98
      Marek Szyprowski authored
      Russell King recently noticed that limiting default CMA region only to low
      memory on ARM architecture causes serious memory management issues with
      machines having a lot of memory (which is mainly available as high
      memory).  More information can be found the following thread:
      http://thread.gmane.org/gmane.linux.ports.arm.kernel/348441/
      
      Those two patches removes this limit letting kernel to put default CMA
      region into high memory when this is possible (there is enough high memory
      available and architecture specific DMA limit fits).
      
      This should solve strange OOM issues on systems with lots of RAM (i.e.
      >1GiB) and large (>256M) CMA area.
      
      This patch (of 2):
      
      Automatically allocated regions should not cross low/high memory boundary,
      because such regions cannot be later correctly initialized due to spanning
      across two memory zones.  This patch adds a check for this case and a
      simple code for moving region to low memory if automatically selected
      address might not fit completely into high memory.
      Signed-off-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Daniel Drake <drake@endlessm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7426b98
    • Laura Abbott's avatar
      arm64: add atomic pool for non-coherent and CMA allocations · d4932f9e
      Laura Abbott authored
      Neither CMA nor noncoherent allocations support atomic allocations.
      Add a dedicated atomic pool to support this.
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Riley <davidriley@chromium.org>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Ritesh Harjain <ritesh.harjani@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thierry Reding <thierry.reding@gmail.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4932f9e
    • Laura Abbott's avatar
      arm: use genalloc for the atomic pool · 36d0fd21
      Laura Abbott authored
      ARM currently uses a bitmap for tracking atomic allocations.  genalloc
      already handles this type of memory pool allocation so switch to using
      that instead.
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Riley <davidriley@chromium.org>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Ritesh Harjain <ritesh.harjani@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thierry Reding <thierry.reding@gmail.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36d0fd21
    • Laura Abbott's avatar
      common: dma-mapping: introduce common remapping functions · 513510dd
      Laura Abbott authored
      For architectures without coherent DMA, memory for DMA may need to be
      remapped with coherent attributes.  Factor out the the remapping code from
      arm and put it in a common location to reduce code duplication.
      
      As part of this, the arm APIs are now migrated away from
      ioremap_page_range to the common APIs which use map_vm_area for remapping.
       This should be an equivalent change and using map_vm_area is more correct
      as ioremap_page_range is intended to bring in io addresses into the cpu
      space and not regular kernel managed memory.
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Riley <davidriley@chromium.org>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Ritesh Harjain <ritesh.harjani@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thierry Reding <thierry.reding@gmail.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Mitchel Humpherys <mitchelh@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      513510dd
    • Laura Abbott's avatar
      lib/genalloc.c: add genpool range check function · 9efb3a42
      Laura Abbott authored
      After allocating an address from a particular genpool, there is no good
      way to verify if that address actually belongs to a genpool.  Introduce
      addr_in_gen_pool which will return if an address plus size falls
      completely within the genpool range.
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Reviewed-by: default avatarOlof Johansson <olof@lixom.net>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Riley <davidriley@chromium.org>
      Cc: Ritesh Harjain <ritesh.harjani@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thierry Reding <thierry.reding@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9efb3a42
    • Laura Abbott's avatar
      lib/genalloc.c: add power aligned algorithm · 505e3be6
      Laura Abbott authored
      One of the more common algorithms used for allocation is to align the
      start address of the allocation to the order of size requested.  Add this
      as an algorithm option for genalloc.
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarOlof Johansson <olof@lixom.net>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Riley <davidriley@chromium.org>
      Cc: Ritesh Harjain <ritesh.harjani@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thierry Reding <thierry.reding@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505e3be6
    • Mel Gorman's avatar
      mm: remove misleading ARCH_USES_NUMA_PROT_NONE · 6a33979d
      Mel Gorman authored
      ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
      _PAGE_NUMA using _PROT_NONE.  This saved using an additional PTE bit and
      relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
      fault scanner.  This was found to be conceptually confusing with a lot of
      implicit assumptions and it was asked that an alternative be found.
      
      Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
      PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE
      bits and shrunk the maximum possible swap size but it did not go far
      enough.  There are no architectures that reuse _PROT_NONE as _PROT_NUMA
      but the relics still exist.
      
      This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
      duplication in powerpc vs the generic implementation by defining the types
      the core NUMA helpers expected to exist from x86 with their ppc64
      equivalent.  This necessitated that a PTE bit mask be created that
      identified the bits that distinguish present from NUMA pte entries but it
      is expected this will only differ between arches based on _PAGE_PROTNONE.
      The naming for the generic helpers was taken from x86 originally but ppc64
      has types that are equivalent for the purposes of the helper so they are
      mapped instead of duplicating code.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a33979d
    • Zhang Zhen's avatar
      memory-hotplug: add sysfs valid_zones attribute · ed2f2400
      Zhang Zhen authored
      Currently memory-hotplug has two limits:
      
      1. If the memory block is in ZONE_NORMAL, you can change it to
         ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.
      
      2. If the memory block is in ZONE_MOVABLE, you can change it to
         ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.
      
      With this patch, we can easy to know a memory block can be onlined to
      which zone, and don't need to know the above two limits.
      
      Updated the related Documentation.
      
      [akpm@linux-foundation.org: use conventional comment layout]
      [akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n]
      [akpm@linux-foundation.org: remove unused local zone_prev]
      Signed-off-by: default avatarZhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed2f2400