1. 10 Oct, 2014 40 commits
    • Junxiao Bi's avatar
      mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set · 934f3072
      Junxiao Bi authored
      commit 21caf2fc ("mm: teach mm by current context info to not do I/O
      during memory allocation") introduces PF_MEMALLOC_NOIO flag to avoid doing
      I/O inside memory allocation, __GFP_IO is cleared when this flag is set,
      but __GFP_FS implies __GFP_IO, it should also be cleared.  Or it may still
      run into I/O, like in superblock shrinker.  And this will make the kernel
      run into the deadlock case described in that commit.
      
      See Dave Chinner's comment about io in superblock shrinker:
      
      Filesystem shrinkers do indeed perform IO from the superblock shrinker and
      have for years.  Even clean inodes can require IO before they can be freed
      - e.g.  on an orphan list, need truncation of post-eof blocks, need to
      wait for ordered operations to complete before it can be freed, etc.
      
      IOWs, Ext4, btrfs and XFS all can issue and/or block on arbitrary amounts
      of IO in the superblock shrinker context.  XFS, in particular, has been
      doing transactions and IO from the VFS inode cache shrinker since it was
      first introduced....
      
      Fix this by clearing __GFP_FS in memalloc_noio_flags(), this function has
      masked all the gfp_mask that will be passed into fs for the processes
      setting PF_MEMALLOC_NOIO in the direct reclaim path.
      
      v1 thread at: https://lkml.org/lkml/2014/9/3/32Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: joyce.xue <xuejiufei@huawei.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      934f3072
    • Xiubo Li's avatar
      mm/compaction.c: fix warning of 'flags' may be used uninitialized · b8b2d825
      Xiubo Li authored
      C      mm/compaction.o
      mm/compaction.c: In function isolate_freepages_block:
      mm/compaction.c:364:37: warning: flags may be used uninitialized in this function [-Wmaybe-uninitialized]
             && compact_unlock_should_abort(&cc->zone->lock, flags,
                                           ^
      Signed-off-by: default avatarXiubo Li <Li.Xiubo@freescale.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8b2d825
    • Andrew Morton's avatar
      mm/mmap.c: clean up CONFIG_DEBUG_VM_RB checks · ff26f70f
      Andrew Morton authored
      - be consistent in printing the test which failed
      
      - one message was actually wrong (a<b != b>a)
      
      - don't print second bogus warning if browse_rb() failed
      
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff26f70f
    • Johannes Weiner's avatar
      mm: clean up zone flags · 57054651
      Johannes Weiner authored
      Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
      sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
      reader through layers indirection just to track down a simple bit.
      
      Remove all zone flag wrappers and just use bitops against zone->flags
      directly.  It's just as readable and the lines are barely any longer.
      
      Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
      remove the zone_flags_t typedef.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57054651
    • Mark Rustad's avatar
      mm/page-writeback.c: use min3/max3 macros to avoid shadow warnings · 7c809968
      Mark Rustad authored
      Nested calls to min/max functions result in shadow warnings in W=2 builds.
       Avoid the warning by using the min3 and max3 macros to get the min/max of
      3 values instead of nested calls.
      Signed-off-by: default avatarMark Rustad <mark.d.rustad@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c809968
    • Weijie Yang's avatar
      mm: page_alloc: avoid wakeup kswapd on the unintended node · 7ade3c99
      Weijie Yang authored
      When entering the page_alloc slowpath, we wakeup kswapd on every pgdat
      according to the zonelist and high_zoneidx.  However, this doesn't take
      nodemask into account, and could prematurely wakeup kswapd on some
      unintended nodes.
      
      This patch uses for_each_zone_zonelist_nodemask() instead of
      for_each_zone_zonelist() in wake_all_kswapds() to avoid the above
      situation.
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ade3c99
    • Sasha Levin's avatar
      mm: convert a few VM_BUG_ON callers to VM_BUG_ON_VMA · 81d1b09c
      Sasha Levin authored
      Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
      more information when they trigger.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81d1b09c
    • Sasha Levin's avatar
      mm: introduce VM_BUG_ON_VMA · fa3759cc
      Sasha Levin authored
      Very similar to VM_BUG_ON_PAGE but dumps VMA information instead.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa3759cc
    • Sasha Levin's avatar
      mm: introduce dump_vma · 0bf55139
      Sasha Levin authored
      Introduce a helper to dump information about a VMA, this also makes
      dump_page_flags more generic and re-uses that so the output looks very
      similar to dump_page:
      
      [   61.903437] vma ffff88070f88be00 start 00007fff25970000 end 00007fff25992000
      [   61.903437] next ffff88070facd600 prev ffff88070face400 mm ffff88070fade000
      [   61.903437] prot 8000000000000025 anon_vma ffff88070fa1e200 vm_ops           (null)
      [   61.903437] pgoff 7ffffffdd file           (null) private_data           (null)
      [   61.909129] flags: 0x100173(read|write|mayread|maywrite|mayexec|growsdown|account)
      
      [akpm@linux-foundation.org: make dump_vma() require CONFIG_DEBUG_VM]
      [swarren@nvidia.com: fix dump_vma() compilation]
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarStephen Warren <swarren@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0bf55139
    • Rob Jones's avatar
      mm/slab.c: use __seq_open_private() instead of seq_open() · b208ce32
      Rob Jones authored
      Using __seq_open_private() removes boilerplate code from slabstats_open()
      
      The resultant code is shorter and easier to follow.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarRob Jones <rob.jones@codethink.co.uk>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b208ce32
    • Rob Jones's avatar
      mm/vmalloc.c: use seq_open_private() instead of seq_open() · 703394c1
      Rob Jones authored
      Using seq_open_private() removes boilerplate code from vmalloc_open().
      
      The resultant code is shorter and easier to follow.
      
      However, please note that seq_open_private() call kzalloc() rather than
      kmalloc() which may affect timing due to the memory initialisation
      overhead.
      Signed-off-by: default avatarRob Jones <rob.jones@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      703394c1
    • Andrew Morton's avatar
      include/linux/migrate.h: remove migrate_page #define · 1c93923c
      Andrew Morton authored
      This is designed to avoid a few ifdefs in .c files but it's obnoxious
      because it can cause unsuspecting "migrate_page" symbols to get turned into
      "NULL".
      
      Just nuke it and use the ifdefs.
      
      Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c93923c
    • Oleg Nesterov's avatar
      mempolicy: unexport get_vma_policy() and remove its "task" arg · dd6eecb9
      Oleg Nesterov authored
      - get_vma_policy(task) is not safe if task != current, remove this
        argument.
      
      - get_vma_policy() no longer has callers outside of mempolicy.c,
        make it static.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd6eecb9
    • Oleg Nesterov's avatar
      mempolicy: kill do_set_mempolicy()->down_write(&mm->mmap_sem) · 2c7c3a7d
      Oleg Nesterov authored
      Remove down_write(&mm->mmap_sem) in do_set_mempolicy(). This logic
      was never correct and it is no longer needed, see the previous patch.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c7c3a7d
    • Oleg Nesterov's avatar
      mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race · 498f2371
      Oleg Nesterov authored
      9e781440 "hold task->mempolicy while numa_maps scans." fixed the
      race with the exiting task but this is not enough.
      
      The current code assumes that get_vma_policy(task) should either see
      task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
      by hold_task_mempolicy(), so we can never race with __mpol_put(). But
      this can only work if we can't race with do_set_mempolicy(), and thus
      we can't race with another do_set_mempolicy() or do_exit() after that.
      
      However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
      race. This task can exec, change it's ->mm, and call do_set_mempolicy()
      after that; in this case they take 2 different locks.
      
      Change hold_task_mempolicy() to use get_task_policy(), it never returns
      NULL, and change show_numa_map() to use __get_vma_policy() or fall back
      to proc_priv->task_mempolicy.
      
      Note: this is the minimal fix, we will cleanup this code later. I think
      hold_task_mempolicy() and release_task_mempolicy() should die, we can
      move this logic into show_numa_map(). Or we can move get_task_policy()
      outside of ->mmap_sem and !CONFIG_NUMA code at least.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      498f2371
    • Oleg Nesterov's avatar
      mempolicy: introduce __get_vma_policy(), export get_task_policy() · 74d2c3a0
      Oleg Nesterov authored
      Extract the code which looks for vma's policy from get_vma_policy()
      into the new helper, __get_vma_policy(). Export get_task_policy().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74d2c3a0
    • Oleg Nesterov's avatar
      mempolicy: remove the "task" arg of vma_policy_mof() and simplify it · 6b6482bb
      Oleg Nesterov authored
      1. vma_policy_mof(task) is simply not safe unless task == current,
         it can race with do_exit()->mpol_put(). Remove this arg and update
         its single caller.
      
      2. vma can not be NULL, remove this check and simplify the code.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b6482bb
    • Oleg Nesterov's avatar
      mempolicy: sanitize the usage of get_task_policy() · 8d90274b
      Oleg Nesterov authored
      Cleanup + preparation. Every user of get_task_policy() calls it
      unconditionally, even if it is not going to use the result.
      
      get_task_policy() is cheap but still this does not look clean, plus
      the code looks simpler if get_task_policy() is called only when this
      is really needed.
      
      Note: I hope this is correct, but it is not clear why vma_policy_mof()
      doesn't fall back to get_task_policy() if ->get_policy() returns NULL.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d90274b
    • Oleg Nesterov's avatar
      mempolicy: change get_task_policy() to return default_policy rather than NULL · f15ca78e
      Oleg Nesterov authored
      Every caller of get_task_policy() falls back to default_policy if it
      returns NULL. Change get_task_policy() to do this.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f15ca78e
    • Oleg Nesterov's avatar
      mempolicy: change alloc_pages_vma() to use mpol_cond_put() · 2386740d
      Oleg Nesterov authored
      Trivial cleanup. alloc_pages_vma() can use mpol_cond_put().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2386740d
    • Johannes Weiner's avatar
      mm: remove noisy remainder of the scan_unevictable interface · 1f13ae39
      Johannes Weiner authored
      The deprecation warnings for the scan_unevictable interface triggers by
      scripts doing `sysctl -a | grep something else'.  This is annoying and not
      helpful.
      
      The interface has been defunct since 264e56d8 ("mm: disable user
      interface to manually rescue unevictable pages"), which was in 2011, and
      there haven't been any reports of usecases for it, only reports that the
      deprecation warnings are annying.  It's unlikely that anybody is using
      this interface specifically at this point, so remove it.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f13ae39
    • Cyrill Gorcunov's avatar
      prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation · f606b77f
      Cyrill Gorcunov authored
      During development of c/r we've noticed that in case if we need to support
      user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
      ...) call, in particular once new user namespace is created
      capable(CAP_SYS_RESOURCE) no longer passes.
      
      A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
      in one bundle, which would allow the kernel to make more intensive test
      for sanity of values and same time allow us to support checkpoint/restore
      of user namespaces.
      
      Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
      prctl_mm_map structure which carries all the members to be updated.
      
      	prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
      
      	struct prctl_mm_map {
      		__u64	start_code;
      		__u64	end_code;
      		__u64	start_data;
      		__u64	end_data;
      		__u64	start_brk;
      		__u64	brk;
      		__u64	start_stack;
      		__u64	arg_start;
      		__u64	arg_end;
      		__u64	env_start;
      		__u64	env_end;
      		__u64	*auxv;
      		__u32	auxv_size;
      		__u32	exe_fd;
      	};
      
      All members except @exe_fd correspond ones of struct mm_struct.  To figure
      out which available values these members may take here are meanings of the
      members.
      
       - start_code, end_code: represent bounds of executable code area
       - start_data, end_data: represent bounds of data area
       - start_brk, brk: used to calculate bounds for brk() syscall
       - start_stack: used when accounting space needed for command
         line arguments, environment and shmat() syscall
       - arg_start, arg_end, env_start, env_end: represent memory area
         supplied for command line arguments and environment variables
       - auxv, auxv_size: carries auxiliary vector, Elf format specifics
       - exe_fd: file descriptor number for executable link (/proc/self/exe)
      
      Thus we apply the following requirements to the values
      
      1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
         in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
         interval.
      
      2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
         VMAs (say a program maps own new .text and .data segments during execution)
         the rest of members should belong to VMA which must exist.
      
      3) Addresses must be ordered, ie @start_ member must not be greater or
         equal to appropriate @end_ member.
      
      4) As in regular Elf loading procedure we require that @start_brk and
         @brk be greater than @end_data.
      
      5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
         exceed existing limit. Same applies to RLIMIT_STACK.
      
      6) Auxiliary vector size must not exceed existing one (which is
         predefined as AT_VECTOR_SIZE and depends on architecture).
      
      7) File descriptor passed in @exe_file should be pointing
         to executable file (because we use existing prctl_set_mm_exe_file_locked
         helper it ensures that the file we are going to use as exe link has all
         required permission granted).
      
      Now about where these members are involved inside kernel code:
      
       - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
      
       - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
         also they are considered if there enough space for brk() syscall
         result if RLIMIT_DATA is set;
      
       - @start_brk shown in /proc/$pid/stat output and accounted in brk()
         syscall if RLIMIT_DATA is set; also this member is tested to
         find a symbolic name of mmap event for perf system (we choose
         if event is generated for "heap" area); one more aplication is
         selinux -- we test if a process has PROCESS__EXECHEAP permission
         if trying to make heap area being executable with mprotect() syscall;
      
       - @brk is a current value for brk() syscall which lays inside heap
         area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
         provides new memory area to a user space upon brk() completion the
         mm::brk is updated to carry new value;
      
         Both @start_brk and @brk are actively used in /proc/$pid/maps
         and /proc/$pid/smaps output to find a symbolic name "heap" for
         VMA being scanned;
      
       - @start_stack is printed out in /proc/$pid/stat and used to
         find a symbolic name "stack" for task and threads in
         /proc/$pid/maps and /proc/$pid/smaps output, and as the same
         as with @start_brk -- perf system uses it for event naming.
         Also kernel treat this member as a start address of where
         to map vDSO pages and to check if there is enough space
         for shmat() syscall;
      
       - @arg_start, @arg_end, @env_start and @env_end are printed out
         in /proc/$pid/stat. Another access to the data these members
         represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
         Any attempt to read these areas kernel tests with access_process_vm
         helper so a user must have enough rights for this action;
      
       - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
         speaking kernel doesn't care much about which exactly data is
         sitting there because it is solely for userspace;
      
       - @exe_fd is referred from /proc/$pid/exe and when generating
         coredump. We uses prctl_set_mm_exe_file_locked helper to update
         this member, so exe-file link modification remains one-shot
         action.
      
      Still note that updating exe-file link now doesn't require sys-resource
      capability anymore, after all there is no much profit in preventing setup
      own file link (there are a number of ways to execute own code -- ptrace,
      ld-preload, so that the only reliable way to find which exactly code is
      executed is to inspect running program memory).  Still we require the
      caller to be at least user-namespace root user.
      
      I believe the old interface should be deprecated and ripped off in a
      couple of kernel releases if no one against.
      
      To test if new interface is implemented in the kernel one can pass
      PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
      supported struct prctl_mm_map.
      
      [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: default avatarAndrew Vagin <avagin@openvz.org>
      Tested-by: default avatarAndrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f606b77f
    • Cyrill Gorcunov's avatar
      prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file · 71fe97e1
      Cyrill Gorcunov authored
      Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
      and rename the helper to prctl_set_mm_exe_file_locked().  This will allow
      to reuse this function in a next patch.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71fe97e1
    • Cyrill Gorcunov's avatar
      mm: use may_adjust_brk helper · 8764b338
      Cyrill Gorcunov authored
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8764b338
    • Cyrill Gorcunov's avatar
      mm: introduce check_data_rlimit helper · 9c599024
      Cyrill Gorcunov authored
      To eliminate code duplication lets introduce check_data_rlimit helper
      which we will use in brk() and prctl() syscalls.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c599024
    • David Rientjes's avatar
      mm, compaction: pass gfp mask to compact_control · 6d7ce559
      David Rientjes authored
      struct compact_control currently converts the gfp mask to a migratetype,
      but we need the entire gfp mask in a follow-up patch.
      
      Pass the entire gfp mask as part of struct compact_control.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d7ce559
    • David Rientjes's avatar
      mm: rename allocflags_to_migratetype for clarity · 43e7a34d
      David Rientjes authored
      The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
      ALLOC_CPUSET) that have separate semantics.
      
      The function allocflags_to_migratetype() actually takes gfp flags, not
      alloc flags, and returns a migratetype.  Rename it to
      gfpflags_to_migratetype().
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43e7a34d
    • Vlastimil Babka's avatar
      mm, compaction: skip buddy pages by their order in the migrate scanner · 99c0fd5e
      Vlastimil Babka authored
      The migration scanner skips PageBuddy pages, but does not consider their
      order as checking page_order() is generally unsafe without holding the
      zone->lock, and acquiring the lock just for the check wouldn't be a good
      tradeoff.
      
      Still, this could avoid some iterations over the rest of the buddy page,
      and if we are careful, the race window between PageBuddy() check and
      page_order() is small, and the worst thing that can happen is that we skip
      too much and miss some isolation candidates.  This is not that bad, as
      compaction can already fail for many other reasons like parallel
      allocations, and those have much larger race window.
      
      This patch therefore makes the migration scanner obtain the buddy page
      order and use it to skip the whole buddy page, if the order appears to be
      in the valid range.
      
      It's important that the page_order() is read only once, so that the value
      used in the checks and in the pfn calculation is the same.  But in theory
      the compiler can replace the local variable by multiple inlines of
      page_order().  Therefore, the patch introduces page_order_unsafe() that
      uses ACCESS_ONCE to prevent this.
      
      Testing with stress-highalloc from mmtests shows a 15% reduction in number
      of pages scanned by migration scanner.  The reduction is >60% with
      __GFP_NO_KSWAPD allocations, along with success rates better by few
      percent.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99c0fd5e
    • Vlastimil Babka's avatar
      mm, compaction: remember position within pageblock in free pages scanner · e14c720e
      Vlastimil Babka authored
      Unlike the migration scanner, the free scanner remembers the beginning of
      the last scanned pageblock in cc->free_pfn.  It might be therefore
      rescanning pages uselessly when called several times during single
      compaction.  This might have been useful when pages were returned to the
      buddy allocator after a failed migration, but this is no longer the case.
      
      This patch changes the meaning of cc->free_pfn so that if it points to a
      middle of a pageblock, that pageblock is scanned only from cc->free_pfn to
      the end.  isolate_freepages_block() will record the pfn of the last page
      it looked at, which is then used to update cc->free_pfn.
      
      In the mmtests stress-highalloc benchmark, this has resulted in lowering
      the ratio between pages scanned by both scanners, from 2.5 free pages per
      migrate page, to 2.25 free pages per migrate page, without affecting
      success rates.
      
      With __GFP_NO_KSWAPD allocations, this appears to result in a worse ratio
      (2.1 instead of 1.8), but page migration successes increased by 10%, so
      this could mean that more useful work can be done until need_resched()
      aborts this kind of compaction.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e14c720e
    • Vlastimil Babka's avatar
      mm, compaction: skip rechecks when lock was already held · 69b7189f
      Vlastimil Babka authored
      Compaction scanners try to lock zone locks as late as possible by checking
      many page or pageblock properties opportunistically without lock and
      skipping them if not unsuitable.  For pages that pass the initial checks,
      some properties have to be checked again safely under lock.  However, if
      the lock was already held from a previous iteration in the initial checks,
      the rechecks are unnecessary.
      
      This patch therefore skips the rechecks when the lock was already held.
      This is now possible to do, since we don't (potentially) drop and
      reacquire the lock between the initial checks and the safe rechecks
      anymore.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69b7189f
    • Vlastimil Babka's avatar
      mm, compaction: periodically drop lock and restore IRQs in scanners · 8b44d279
      Vlastimil Babka authored
      Compaction scanners regularly check for lock contention and need_resched()
      through the compact_checklock_irqsave() function.  However, if there is no
      contention, the lock can be held and IRQ disabled for potentially long
      time.
      
      This has been addressed by commit b2eef8c0 ("mm: compaction: minimise
      the time IRQs are disabled while isolating pages for migration") for the
      migration scanner.  However, the refactoring done by commit 2a1402aa
      ("mm: compaction: acquire the zone->lru_lock as late as possible") has
      changed the conditions so that the lock is dropped only when there's
      contention on the lock or need_resched() is true.  Also, need_resched() is
      checked only when the lock is already held.  The comment "give a chance to
      irqs before checking need_resched" is therefore misleading, as IRQs remain
      disabled when the check is done.
      
      This patch restores the behavior intended by commit b2eef8c0 and also
      tries to better balance and make more deterministic the time spent by
      checking for contention vs the time the scanners might run between the
      checks.  It also avoids situations where checking has not been done often
      enough before.  The result should be avoiding both too frequent and too
      infrequent contention checking, and especially the potentially
      long-running scans with IRQs disabled and no checking of need_resched() or
      for fatal signal pending, which can happen when many consecutive pages or
      pageblocks fail the preliminary tests and do not reach the later call site
      to compact_checklock_irqsave(), as explained below.
      
      Before the patch:
      
      In the migration scanner, compact_checklock_irqsave() was called each
      loop, if reached.  If not reached, some lower-frequency checking could
      still be done if the lock was already held, but this would not result in
      aborting contended async compaction until reaching
      compact_checklock_irqsave() or end of pageblock.  In the free scanner, it
      was similar but completely without the periodical checking, so lock can be
      potentially held until reaching the end of pageblock.
      
      After the patch, in both scanners:
      
      The periodical check is done as the first thing in the loop on each
      SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
      function, which always unlocks the lock (if locked) and aborts async
      compaction if scheduling is needed.  It also aborts any type of compaction
      when a fatal signal is pending.
      
      The compact_checklock_irqsave() function is replaced with a slightly
      different compact_trylock_irqsave().  The biggest difference is that the
      function is not called at all if the lock is already held.  The periodical
      need_resched() checking is left solely to compact_unlock_should_abort().
      The lock contention avoidance for async compaction is achieved by the
      periodical unlock by compact_unlock_should_abort() and by using trylock in
      compact_trylock_irqsave() and aborting when trylock fails.  Sync
      compaction does not use trylock.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b44d279
    • Vlastimil Babka's avatar
      mm, compaction: khugepaged should not give up due to need_resched() · 1f9efdef
      Vlastimil Babka authored
      Async compaction aborts when it detects zone lock contention or
      need_resched() is true.  David Rientjes has reported that in practice,
      most direct async compactions for THP allocation abort due to
      need_resched().  This means that a second direct compaction is never
      attempted, which might be OK for a page fault, but khugepaged is intended
      to attempt a sync compaction in such case and in these cases it won't.
      
      This patch replaces "bool contended" in compact_control with an int that
      distinguishes between aborting due to need_resched() and aborting due to
      lock contention.  This allows propagating the abort through all compaction
      functions as before, but passing the abort reason up to
      __alloc_pages_slowpath() which decides when to continue with direct
      reclaim and another compaction attempt.
      
      Another problem is that try_to_compact_pages() did not act upon the
      reported contention (both need_resched() or lock contention) immediately
      and would proceed with another zone from the zonelist.  When
      need_resched() is true, that means initializing another zone compaction,
      only to check again need_resched() in isolate_migratepages() and aborting.
       For zone lock contention, the unintended consequence is that the lock
      contended status reported back to the allocator is detrmined from the last
      zone where compaction was attempted, which is rather arbitrary.
      
      This patch fixes the problem in the following way:
      - async compaction of a zone aborting due to need_resched() or fatal signal
        pending means that further zones should not be tried. We report
        COMPACT_CONTENDED_SCHED to the allocator.
      - aborting zone compaction due to lock contention means we can still try
        another zone, since it has different set of locks. We report back
        COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
        it was aborted due to lock contention.
      
      As a result of these fixes, khugepaged will proceed with second sync
      compaction as intended, when the preceding async compaction aborted due to
      need_resched().  Page fault compactions aborting due to need_resched()
      will spare some cycles previously wasted by initializing another zone
      compaction only to abort again.  Lock contention will be reported only
      when compaction in all zones aborted due to lock contention, and therefore
      it's not a good idea to try again after reclaim.
      
      In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
      has improved number of THP collapse allocations by 10%, which shows
      positive effect on khugepaged.  The benchmark's success rates are
      unchanged as it is not recognized as khugepaged.  Numbers of compact_stall
      and compact_fail events have however decreased by 20%, with
      compact_success still a bit improved, which is good.  With benchmark
      configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
      collapse allocations, and only slight improvement in stalls and failures.
      
      [akpm@linux-foundation.org: fix warnings]
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f9efdef
    • Vlastimil Babka's avatar
      mm, compaction: reduce zone checking frequency in the migration scanner · 7d49d886
      Vlastimil Babka authored
      The unification of the migrate and free scanner families of function has
      highlighted a difference in how the scanners ensure they only isolate
      pages of the intended zone.  This is important for taking zone lock or lru
      lock of the correct zone.  Due to nodes overlapping, it is however
      possible to encounter a different zone within the range of the zone being
      compacted.
      
      The free scanner, since its inception by commit 748446bb ("mm:
      compaction: memory compaction core"), has been checking the zone of the
      first valid page in a pageblock, and skipping the whole pageblock if the
      zone does not match.
      
      This checking was completely missing from the migration scanner at first,
      and later added by commit dc908600 ("mm: compaction: check for
      overlapping nodes during isolation for migration") in a reaction to a bug
      report.  But the zone comparison in migration scanner is done once per a
      single scanned page, which is more defensive and thus more costly than a
      check per pageblock.
      
      This patch unifies the checking done in both scanners to once per
      pageblock, through a new pageblock_pfn_to_page() function, which also
      includes pfn_valid() checks.  It is more defensive than the current free
      scanner checks, as it checks both the first and last page of the
      pageblock, but less defensive by the migration scanner per-page checks.
      It assumes that node overlapping may result (on some architecture) in a
      boundary between two nodes falling into the middle of a pageblock, but
      that there cannot be a node0 node1 node0 interleaving within a single
      pageblock.
      
      The result is more code being shared and a bit less per-page CPU cost in
      the migration scanner.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d49d886
    • Vlastimil Babka's avatar
      mm, compaction: move pageblock checks up from isolate_migratepages_range() · edc2ca61
      Vlastimil Babka authored
      isolate_migratepages_range() is the main function of the compaction
      scanner, called either on a single pageblock by isolate_migratepages()
      during regular compaction, or on an arbitrary range by CMA's
      __alloc_contig_migrate_range().  It currently perfoms two pageblock-wide
      compaction suitability checks, and because of the CMA callpath, it tracks
      if it crossed a pageblock boundary in order to repeat those checks.
      
      However, closer inspection shows that those checks are always true for CMA:
      - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
      - migrate_async_suitable() check is skipped because CMA uses sync compaction
      
      We can therefore move the compaction-specific checks to
      isolate_migratepages() and simplify isolate_migratepages_range().
      Furthermore, we can mimic the freepage scanner family of functions, which
      has isolate_freepages_block() function called both by compaction from
      isolate_freepages() and by CMA from isolate_freepages_range(), where each
      use-case adds own specific glue code.  This allows further code
      simplification.
      
      Thus, we rename isolate_migratepages_range() to
      isolate_migratepages_block() and limit its functionality to a single
      pageblock (or its subset).  For CMA, a new different
      isolate_migratepages_range() is created as a CMA-specific wrapper for the
      _block() function.  The checks specific to compaction are moved to
      isolate_migratepages().  As part of the unification of these two families
      of functions, we remove the redundant zone parameter where applicable,
      since zone pointer is already passed in cc->zone.
      
      Furthermore, going back to compact_zone() and compact_finished() when
      pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
      - the checks are meant to skip pageblocks quickly.  The patch therefore
      also introduces a simple loop into isolate_migratepages() so that it does
      not return immediately on failed pageblock checks, but keeps going until
      isolate_migratepages_range() gets called once.  Similarily to
      isolate_freepages(), the function periodically checks if it needs to
      reschedule or abort async compaction.
      
      [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edc2ca61
    • Vlastimil Babka's avatar
      mm, compaction: do not recheck suitable_migration_target under lock · f8224aa5
      Vlastimil Babka authored
      isolate_freepages_block() rechecks if the pageblock is suitable to be a
      target for migration after it has taken the zone->lock.  However, the
      check has been optimized to occur only once per pageblock, and
      compact_checklock_irqsave() might be dropping and reacquiring lock, which
      means somebody else might have changed the pageblock's migratetype
      meanwhile.
      
      Furthermore, nothing prevents the migratetype to change right after
      isolate_freepages_block() has finished isolating.  Given how imperfect
      this is, it's simpler to just rely on the check done in
      isolate_freepages() without lock, and not pretend that the recheck under
      lock guarantees anything.  It is just a heuristic after all.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8224aa5
    • Vlastimil Babka's avatar
      mm, compaction: do not count compact_stall if all zones skipped compaction · 98dd3b48
      Vlastimil Babka authored
      The compact_stall vmstat counter counts the number of allocations stalled
      by direct compaction.  It does not count when all attempted zones had
      deferred compaction, but it does count when all zones skipped compaction.
      The skipping is decided based on very early check of
      compaction_suitable(), based on watermarks and memory fragmentation.
      Therefore it makes sense not to count skipped compactions as stalls.
      Moreover, compact_success or compact_fail is also already not being
      counted when compaction was skipped, so this patch changes the
      compact_stall counting to match the other two.
      
      Additionally, restructure __alloc_pages_direct_compact() code for better
      readability.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98dd3b48
    • Vlastimil Babka's avatar
      mm, compaction: defer each zone individually instead of preferred zone · 53853e2d
      Vlastimil Babka authored
      When direct sync compaction is often unsuccessful, it may become deferred
      for some time to avoid further useless attempts, both sync and async.
      Successful high-order allocations un-defer compaction, while further
      unsuccessful compaction attempts prolong the compaction deferred period.
      
      Currently the checking and setting deferred status is performed only on
      the preferred zone of the allocation that invoked direct compaction.  But
      compaction itself is attempted on all eligible zones in the zonelist, so
      the behavior is suboptimal and may lead both to scenarios where 1)
      compaction is attempted uselessly, or 2) where it's not attempted despite
      good chances of succeeding, as shown on the examples below:
      
      1) A direct compaction with Normal preferred zone failed and set
         deferred compaction for the Normal zone.  Another unrelated direct
         compaction with DMA32 as preferred zone will attempt to compact DMA32
         zone even though the first compaction attempt also included DMA32 zone.
      
         In another scenario, compaction with Normal preferred zone failed to
         compact Normal zone, but succeeded in the DMA32 zone, so it will not
         defer compaction.  In the next attempt, it will try Normal zone which
         will fail again, instead of skipping Normal zone and trying DMA32
         directly.
      
      2) Kswapd will balance DMA32 zone and reset defer status based on
         watermarks looking good.  A direct compaction with preferred Normal
         zone will skip compaction of all zones including DMA32 because Normal
         was still deferred.  The allocation might have succeeded in DMA32, but
         won't.
      
      This patch makes compaction deferring work on individual zone basis
      instead of preferred zone.  For each zone, it checks compaction_deferred()
      to decide if the zone should be skipped.  If watermarks fail after
      compacting the zone, defer_compaction() is called.  The zone where
      watermarks passed can still be deferred when the allocation attempt is
      unsuccessful.  When allocation is successful, compaction_defer_reset() is
      called for the zone containing the allocated page.  This approach should
      approximate calling defer_compaction() only on zones where compaction was
      attempted and did not yield allocated page.  There might be corner cases
      but that is inevitable as long as the decision to stop compacting dues not
      guarantee that a page will be allocated.
      
      Due to a new COMPACT_DEFERRED return value, some functions relying
      implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
      more accurate.  The did_some_progress output parameter of
      __alloc_pages_direct_compact() is removed completely, as the caller
      actually does not use it after compaction sets it - it is only considered
      when direct reclaim sets it.
      
      During testing on a two-node machine with a single very small Normal zone
      on node 1, this patch has improved success rates in stress-highalloc
      mmtests benchmark.  The success here were previously made worse by commit
      3a025760 ("mm: page_alloc: spill to remote nodes before waking
      kswapd") as kswapd was no longer resetting often enough the deferred
      compaction for the Normal zone, and DMA32 zones on both nodes were thus
      not considered for compaction.  On different machine, success rates were
      improved with __GFP_NO_KSWAPD allocations.
      
      [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53853e2d
    • Vlastimil Babka's avatar
      mm, THP: don't hold mmap_sem in khugepaged when allocating THP · 8b164568
      Vlastimil Babka authored
      When allocating huge page for collapsing, khugepaged currently holds
      mmap_sem for reading on the mm where collapsing occurs.  Afterwards the
      read lock is dropped before write lock is taken on the same mmap_sem.
      
      Holding mmap_sem during whole huge page allocation is therefore useless,
      the vma needs to be rechecked after taking the write lock anyway.
      Furthemore, huge page allocation might involve a rather long sync
      compaction, and thus block any mmap_sem writers and i.e.  affect workloads
      that perform frequent m(un)map or mprotect oterations.
      
      This patch simply releases the read lock before allocating a huge page.
      It also deletes an outdated comment that assumed vma must be stable, as it
      was using alloc_hugepage_vma().  This is no longer true since commit
      9f1b868a ("mm: thp: khugepaged: add policy for finding target node").
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b164568
    • Akinobu Mita's avatar
      block_dev: implement readpages() to optimize sequential read · 447f05bb
      Akinobu Mita authored
      Sequential read from a block device is expected to be equal or faster than
      from the file on a filesystem.  But it is not correct due to the lack of
      effective readpages() in the address space operations for block device.
      
      This implements readpages() operation for block device by using
      mpage_readpages() which can create multipage BIOs instead of BIOs for each
      page and reduce system CPU time consumption.
      
      Install 1GB of RAM disk storage:
      
      	# modprobe scsi_debug dev_size_mb=1024 delay=0
      
      Sequential read from file on a filesystem:
      
      	# mkfs.ext4 /dev/$DEV
      	# mount /dev/$DEV /mnt
      	# fio --name=t --size=512m --rw=read --filename=/mnt/file
      	...
      	  read : io=524288KB, bw=2133.4MB/s, iops=546133, runt=   240msec
      
      Sequential read from a block device:
      	# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
      	...
      (Without this commit)
      	  read : io=524288KB, bw=1700.2MB/s, iops=435455, runt=   301msec
      
      (With this commit)
      	  read : io=524288KB, bw=2160.4MB/s, iops=553046, runt=   237msec
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      447f05bb
    • Akinobu Mita's avatar
      vfs: guard end of device for mpage interface · 4db96b71
      Akinobu Mita authored
      Add guard_bio_eod() check for mpage code in order to allow us to do IO
      even on the odd last sectors of a device, even if the block size is some
      multiple of the physical sector size.
      
      Using mpage_readpages() for block device requires this guard check.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4db96b71