1. 15 Jan, 2016 40 commits
    • Johannes Weiner's avatar
      net: tcp_memcontrol: properly detect ancestor socket pressure · 8c2c2358
      Johannes Weiner authored
      When charging socket memory, the code currently checks only the local
      page counter for excess to determine whether the memcg is under socket
      pressure.  But even if the local counter is fine, one of the ancestors
      could have breached its limit, which should also force this child to
      enter socket pressure.  This currently doesn't happen.
      
      Fix this by using page_counter_try_charge() first.  If that fails, it
      means that either the local counter or one of the ancestors are in
      excess of their limit, and the child should enter socket pressure.
      
      Fixes: 3e32cb2e ("mm: memcontrol: lockless page counters")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c2c2358
    • Johannes Weiner's avatar
      mm: memcontrol: export root_mem_cgroup · 7d828602
      Johannes Weiner authored
      A later patch will need this symbol in files other than memcontrol.c, so
      export it now and replace mem_cgroup_root_css at the same time.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d828602
    • Geliang Tang's avatar
      mm/ksm.c: use list_for_each_entry_safe · 03640418
      Geliang Tang authored
      Use list_for_each_entry_safe() instead of list_for_each_safe() to
      simplify the code.
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03640418
    • Geliang Tang's avatar
      mm/readahead.c, mm/vmscan.c: use lru_to_page instead of list_to_page · c8ad6302
      Geliang Tang authored
      list_to_page() in readahead.c is the same as lru_to_page() in vmscan.c.
      So I move lru_to_page to internal.h and drop list_to_page().
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8ad6302
    • Joonsoo Kim's avatar
      mm/compaction.c: __compact_pgdat() code cleanuup · 75469345
      Joonsoo Kim authored
      This patch uses is_via_compact_memory() to distinguish compaction from
      sysfs or sysctl.  And, this patch also reduces indentation on
      compaction_defer_reset() by filtering these cases first before checking
      watermark.
      
      There is no functional change.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75469345
    • Geliang Tang's avatar
      mm/swapfile.c: use list_{next,first}_entry · a8ae4991
      Geliang Tang authored
      To make the intention clearer, use list_{next,first}_entry instead of
      list_entry().
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8ae4991
    • Alexander Kuleshov's avatar
      mm/memblock: introduce for_each_memblock_type() · 8c9c1701
      Alexander Kuleshov authored
      We already have the for_each_memblock() macro in <linux/memblock.h>
      which provides ability to iterate over memblock regions of a known type.
      The for_each_memblock() macro allows us to pass the pointer to the
      struct memblock_type, instead we need to pass name of the type.
      
      This patch introduces a new macro for_each_memblock_type() which allows
      us iterate over memblock regions with the given type when the type is
      unknown.
      Signed-off-by: default avatarAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c9c1701
    • Alexander Kuleshov's avatar
      mm/memblock: remove rgnbase and rgnsize variables · f14516fb
      Alexander Kuleshov authored
      Remove rgnbase and rgnsize variables from memblock_overlaps_region().
      We use these variables only for passing to the memblock_addrs_overlap()
      function and that's all.  Let's remove them.
      Signed-off-by: default avatarAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f14516fb
    • Michal Hocko's avatar
      mm, oom: give __GFP_NOFAIL allocations access to memory reserves · 5020e285
      Michal Hocko authored
      __GFP_NOFAIL is a big hammer used to ensure that the allocation request
      can never fail.  This is a strong requirement and as such it also
      deserves a special treatment when the system is OOM.  The primary
      problem here is that the allocation request might have come with some
      locks held and the oom victim might be blocked on the same locks.  This
      is basically an OOM deadlock situation.
      
      This patch tries to reduce the risk of such a deadlocks by giving
      __GFP_NOFAIL allocations a special treatment and let them dive into
      memory reserves after oom killer invocation.  This should help them to
      make a progress and release resources they are holding.  The OOM victim
      should compensate for the reserves consumption.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5020e285
    • Geliang Tang's avatar
      mm/page_alloc.c: use list_for_each_entry in mark_free_pages() · 86760a2c
      Geliang Tang authored
      Use list_for_each_entry instead of list_for_each + list_entry to
      simplify the code.
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86760a2c
    • Geliang Tang's avatar
      mm/page_alloc.c: use list_{first,last}_entry instead of list_entry · a16601c5
      Geliang Tang authored
      To make the intention clearer, use list_{first,last}_entry instead of
      list_entry.
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a16601c5
    • Mel Gorman's avatar
      mm/page_alloc.c: remove unnecessary parameter from __rmqueue · 6ac0206b
      Mel Gorman authored
      Commit 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order
      atomic allocations on demand") added an unnecessary and unused parameter
      to __rmqueue.  It was a parameter that was used in an earlier version of
      the patch and then left behind.  This patch cleans it up.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ac0206b
    • Seth Jennings's avatar
      drivers/base/memory.c: rename remove_memory_block() to remove_memory_section() · cc292b0b
      Seth Jennings authored
      The function removes a section, not a block.  Rename to reflect actual
      functionality.
      Signed-off-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Russ Anderson <rja@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc292b0b
    • Seth Jennings's avatar
      drivers/base/memory.c: clean up section counting · 56c6b5d3
      Seth Jennings authored
      Right now, section_count is calculated in add_memory_block().  However,
      init_memory_block() increments section_count as well, which, at first,
      seems like it would lead to an off-by-one error.  There is no harm done
      because add_memory_block() immediately overwrites the
      mem->section_count, but it is messy.
      
      This commit moves the increment out of the common init_memory_block()
      (called by both add_memory_block() and register_new_memory()) and adds
      it to register_new_memory().
      Signed-off-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Russ Anderson <rja@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56c6b5d3
    • Johannes Weiner's avatar
      proc: meminfo: estimate available memory more conservatively · 84ad5802
      Johannes Weiner authored
      The MemAvailable item in /proc/meminfo is to give users a hint of how
      much memory is allocatable without causing swapping, so it excludes the
      zones' low watermarks as unavailable to userspace.
      
      However, for a userspace allocation, kswapd will actually reclaim until
      the free pages hit a combination of the high watermark and the page
      allocator's lowmem protection that keeps a certain amount of DMA and
      DMA32 memory from userspace as well.
      
      Subtract the full amount we know to be unavailable to userspace from the
      number of free pages when calculating MemAvailable.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84ad5802
    • Johannes Weiner's avatar
      mm: page_alloc: generalize the dirty balance reserve · a8d01437
      Johannes Weiner authored
      The dirty balance reserve that dirty throttling has to consider is
      merely memory not available to userspace allocations.  There is nothing
      writeback-specific about it.  Generalize the name so that it's reusable
      outside of that context.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8d01437
    • Michal Hocko's avatar
      mm: allow GFP_{FS,IO} for page_cache_read page cache allocation · c20cd45e
      Michal Hocko authored
      page_cache_read has been historically using page_cache_alloc_cold to
      allocate a new page.  This means that mapping_gfp_mask is used as the
      base for the gfp_mask.  Many filesystems are setting this mask to
      GFP_NOFS to prevent from fs recursion issues.  page_cache_read is called
      from the vm_operations_struct::fault() context during the page fault.
      This context doesn't need the reclaim protection normally.
      
      ceph and ocfs2 which call filemap_fault from their fault handlers seem
      to be OK because they are not taking any fs lock before invoking generic
      implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
      reclaim recursion POV because this lock serializes truncate and punch
      hole with the page faults and it doesn't get involved in the reclaim.
      
      There is simply no reason to deliberately use a weaker allocation
      context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS protection
      might be even harmful.  There is a push to fail GFP_NOFS allocations
      rather than loop within allocator indefinitely with a very limited
      reclaim ability.  Once we start failing those requests the OOM killer
      might be triggered prematurely because the page cache allocation failure
      is propagated up the page fault path and end up in
      pagefault_out_of_memory.
      
      We cannot play with mapping_gfp_mask directly because that would be racy
      wrt.  parallel page faults and it might interfere with other users who
      really rely on NOFS semantic from the stored gfp_mask.  The mask is also
      inode proper so it would even be a layering violation.  What we can do
      instead is to push the gfp_mask into struct vm_fault and allow fs layer
      to overwrite it should the callback need to be called with a different
      allocation context.
      
      Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
      because this should be safe from the page fault path normally.  Why do
      we care about mapping_gfp_mask at all then? Because this doesn't hold
      only reclaim protection flags but it also might contain zone and
      movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
      to respect those.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarJan Kara <jack@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c20cd45e
    • Yaowei Bai's avatar
      mm/compaction: improve comment for compact_memory tunable knob handler · fec4eb2c
      Yaowei Bai authored
      sysctl_compaction_handler() is the handler function for compact_memory
      tunable knob under /proc/sys/vm, add the missing knob name to make this
      more accurate in comment.
      
      No functional change.
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fec4eb2c
    • Daniel Cashman's avatar
      x86: mm: support ARCH_MMAP_RND_BITS · 9e08f57d
      Daniel Cashman authored
      x86: arch_mmap_rnd() uses hard-coded values, 8 for 32-bit and 28 for
      64-bit, to generate the random offset for the mmap base address.  This
      value represents a compromise between increased ASLR effectiveness and
      avoiding address-space fragmentation.  Replace it with a Kconfig option,
      which is sensibly bounded, so that platform developers may choose where
      to place this compromise.  Keep default values as new minimums.
      Signed-off-by: default avatarDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e08f57d
    • Daniel Cashman's avatar
      arm64: mm: support ARCH_MMAP_RND_BITS · 8f0d3aa9
      Daniel Cashman authored
      arm64: arch_mmap_rnd() uses STACK_RND_MASK to generate the random offset
      for the mmap base address.  This value represents a compromise between
      increased ASLR effectiveness and avoiding address-space fragmentation.
      Replace it with a Kconfig option, which is sensibly bounded, so that
      platform developers may choose where to place this compromise.  Keep
      default values as new minimums.
      Signed-off-by: default avatarDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f0d3aa9
    • Daniel Cashman's avatar
      arm: mm: support ARCH_MMAP_RND_BITS · e0c25d95
      Daniel Cashman authored
      arm: arch_mmap_rnd() uses a hard-code value of 8 to generate the random
      offset for the mmap base address.  This value represents a compromise
      between increased ASLR effectiveness and avoiding address-space
      fragmentation.  Replace it with a Kconfig option, which is sensibly
      bounded, so that platform developers may choose where to place this
      compromise.  Keep 8 as the minimum acceptable value.
      
      [arnd@arndb.de: ARM: avoid ARCH_MMAP_RND_BITS for NOMMU]
      Signed-off-by: default avatarDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0c25d95
    • Daniel Cashman's avatar
      mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259
      Daniel Cashman authored
      Address Space Layout Randomization (ASLR) provides a barrier to
      exploitation of user-space processes in the presence of security
      vulnerabilities by making it more difficult to find desired code/data
      which could help an attack.  This is done by adding a random offset to
      the location of regions in the process address space, with a greater
      range of potential offset values corresponding to better protection/a
      larger search-space for brute force, but also to greater potential for
      fragmentation.
      
      The offset added to the mmap_base address, which provides the basis for
      the majority of the mappings for a process, is set once on process exec
      in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
      which reflect, hopefully, the best compromise for all systems.  The
      trade-off between increased entropy in the offset value generation and
      the corresponding increased variability in address space fragmentation
      is not absolute, however, and some platforms may tolerate higher amounts
      of entropy.  This patch introduces both new Kconfig values and a sysctl
      interface which may be used to change the amount of entropy used for
      offset generation on a system.
      
      The direct motivation for this change was in response to the
      libstagefright vulnerabilities that affected Android, specifically to
      information provided by Google's project zero at:
      
        http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
      
      The attack presented therein, by Google's project zero, specifically
      targeted the limited randomness used to generate the offset added to the
      mmap_base address in order to craft a brute-force-based attack.
      Concretely, the attack was against the mediaserver process, which was
      limited to respawning every 5 seconds, on an arm device.  The hard-coded
      8 bits used resulted in an average expected success rate of defeating
      the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
      piece).  With this patch, and an accompanying increase in the entropy
      value to 16 bits, the same attack would take an average expected time of
      over 45 hours (32768 tries), which makes it both less feasible and more
      likely to be noticed.
      
      The introduced Kconfig and sysctl options are limited by per-arch
      minimum and maximum values, the minimum of which was chosen to match the
      current hard-coded value and the maximum of which was chosen so as to
      give the greatest flexibility without generating an invalid mmap_base
      address, generally a 3-4 bits less than the number of bits in the
      user-space accessible virtual address space.
      
      When decided whether or not to change the default value, a system
      developer should consider that mmap_base address could be placed
      anywhere up to 2^(value) bits away from the non-randomized location,
      which would introduce variable-sized areas above and below the mmap_base
      address such that the maximum vm_area_struct size may be reduced,
      preventing very large allocations.
      
      This patch (of 4):
      
      ASLR only uses as few as 8 bits to generate the random offset for the
      mmap base address on 32 bit architectures.  This value was chosen to
      prevent a poorly chosen value from dividing the address space in such a
      way as to prevent large allocations.  This may not be an issue on all
      platforms.  Allow the specification of a minimum number of bits so that
      platforms desiring greater ASLR protection may determine where to place
      the trade-off.
      Signed-off-by: default avatarDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d07e2259
    • Piotr Kwapulinski's avatar
      mm/mmap.c: remove incorrect MAP_FIXED flag comparison from mmap_region · bc36f701
      Piotr Kwapulinski authored
      The following flag comparison in mmap_region makes no sense:
      
          if (!(vm_flags & MAP_FIXED))
              return -ENOMEM;
      
      The condition is always false and thus the above "return -ENOMEM" is
      never executed.  The vm_flags must not be compared with MAP_FIXED flag.
      The vm_flags may only be compared with VM_* flags.  MAP_FIXED has the
      same value as VM_MAYREAD.
      
      Hitting the rlimit is a slow path and find_vma_intersection should
      realize that there is no overlapping VMA for !MAP_FIXED case pretty
      quickly.
      Signed-off-by: default avatarPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc36f701
    • Michal Hocko's avatar
      mm, vmscan: consider isolated pages in zone_reclaimable_pages · 9f6c399d
      Michal Hocko authored
      zone_reclaimable_pages counts how many pages are reclaimable in the
      given zone.  This currently includes all pages on file lrus and anon
      lrus if there is an available swap storage.  We do not consider
      NR_ISOLATED_{ANON,FILE} counters though which is not correct because
      these counters reflect temporarily isolated pages which are still
      reclaimable because they either get back to their LRU or get freed
      either by the page reclaim or page migration.
      
      The number of these pages might be sufficiently high to confuse users of
      zone_reclaimable_pages (e.g.  mbind can migrate large ranges of memory
      at once).
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f6c399d
    • Andrew Morton's avatar
      fs/block_dev.c:bdev_write_page(): use blk_queue_enter(..., GFP_NOIO) · b832861c
      Andrew Morton authored
      bdev_write_page() is used by swapout and by writepage where we cannot
      use __GFP_FS or __GFP_IO.  So it is misleading to mention GFP_KERNEL
      here.
      
      blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no
      bugs were harmed in the making of this patch.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b832861c
    • Vladimir Davydov's avatar
      memcg: do not allow to disable tcp accounting after limit is set · 9ee11ba4
      Vladimir Davydov authored
      There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED
      and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former
      is never cleared while the latter can be cleared by unsetting the limit.
      This allows to disable tcp socket accounting for new sockets after it
      was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still
      guaranteeing that memcg_socket_limit_enabled static key will be
      decremented on memcg destruction.
      
      This functionality looks dubious, because it is not clear what a use
      case would be.  By enabling tcp accounting a user accepts the price.  If
      they then find the performance degradation unacceptable, they can always
      restart their workload with tcp accounting disabled.  It does not seem
      there is any need to flip it while the workload is running.
      
      Besides, it contradicts to how kmem accounting API works: writing
      whatever to memory.kmem.limit_in_bytes enables kmem accounting for the
      cgroup in question, after which it cannot be disabled.  Therefore one
      might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just
      enables socket accounting w/o limiting it, which might be useful by
      itself, but it isn't true.
      
      Since this API peculiarity is not documented anywhere, I propose to drop
      it.  This will allow to simplify the code by dropping cg_proto->flags.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ee11ba4
    • Vladimir Davydov's avatar
      vmscan: do not force-scan file lru if its absolute size is small · 316bda0e
      Vladimir Davydov authored
      We assume there is enough inactive page cache if the size of inactive
      file lru is greater than the size of active file lru, in which case we
      force-scan file lru ignoring anonymous pages.  While this logic works
      fine when there are plenty of page cache pages, it fails if the size of
      file lru is small (several MB): in this case (lru_size >> prio) will be
      0 for normal scan priorities, as a result, if inactive file lru happens
      to be larger than active file lru, anonymous pages of a cgroup will
      never get evicted unless the system experiences severe memory pressure,
      even if there are gigabytes of unused anonymous memory there, which is
      unfair in respect to other cgroups, whose workloads might be page cache
      oriented.
      
      This patch attempts to fix this by elaborating the "enough inactive page
      cache" check: it makes it not only check that inactive lru size > active
      lru size, but also that we will scan something from the cgroup at the
      current scan priority.  If these conditions do not hold, we proceed to
      SCAN_FRACT as usual.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      316bda0e
    • David Rientjes's avatar
      mm, vmalloc: remove VM_VPAGES · 244d63ee
      David Rientjes authored
      VM_VPAGES is unnecessary, it's easier to check is_vmalloc_addr() when
      reading /proc/vmallocinfo.
      
      [akpm@linux-foundation.org: remove VM_VPAGES reference via kvfree()]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      244d63ee
    • Geliang Tang's avatar
      mm, thp: use list_first_entry_or_null() · 14669347
      Geliang Tang authored
      Simplify the code with list_first_entry_or_null().
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14669347
    • Jerome Marchand's avatar
      mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status · 8cee852e
      Jerome Marchand authored
      There are several shortcomings with the accounting of shared memory
      (SysV shm, shared anonymous mapping, mapping of a tmpfs file).  The
      values in /proc/<pid>/status and <...>/statm don't allow to distinguish
      between shmem memory and a shared mapping to a regular file, even though
      theirs implication on memory usage are quite different: during reclaim,
      file mapping can be dropped or written back on disk, while shmem needs a
      place in swap.
      
      Also, to distinguish the memory occupied by anonymous and file mappings,
      one has to read the /proc/pid/statm file, which has a field for the file
      mappings (again, including shmem) and total memory occupied by these
      mappings (i.e.  equivalent to VmRSS in the <...>/status file.  Getting
      the value for anonymous mappings only is thus not exactly user-friendly
      (the statm file is intended to be rather efficiently machine-readable).
      
      To address both of these shortcomings, this patch adds a breakdown of
      VmRSS in /proc/<pid>/status via new fields RssAnon, RssFile and
      RssShmem, making use of the previous preparatory patch.  These fields
      tell the user the memory occupied by private anonymous pages, mapped
      regular files and shmem, respectively.  Other existing fields in /status
      and /statm files are left without change.  The /statm file can be
      extended in the future, if there's a need for that.
      
      Example (part of) /proc/pid/status output including the new Rss* fields:
      
      VmPeak:  2001008 kB
      VmSize:  2001004 kB
      VmLck:         0 kB
      VmPin:         0 kB
      VmHWM:      5108 kB
      VmRSS:      5108 kB
      RssAnon:              92 kB
      RssFile:            1324 kB
      RssShmem:           3692 kB
      VmData:      192 kB
      VmStk:       136 kB
      VmExe:         4 kB
      VmLib:      1784 kB
      VmPTE:      3928 kB
      VmPMD:        20 kB
      VmSwap:        0 kB
      HugetlbPages:          0 kB
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cee852e
    • Jerome Marchand's avatar
      mm, shmem: add internal shmem resident memory accounting · eca56ff9
      Jerome Marchand authored
      Currently looking at /proc/<pid>/status or statm, there is no way to
      distinguish shmem pages from pages mapped to a regular file (shmem pages
      are mapped to /dev/zero), even though their implication in actual memory
      use is quite different.
      
      The internal accounting currently counts shmem pages together with
      regular files.  As a preparation to extend the userspace interfaces,
      this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
      shmem pages separately from MM_FILEPAGES.  The next patch will expose it
      to userspace - this patch doesn't change the exported values yet, by
      adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
      used before.  The only user-visible change after this patch is the OOM
      killer message that separates the reported "shmem-rss" from "file-rss".
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eca56ff9
    • Vlastimil Babka's avatar
      mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings · 48131e03
      Vlastimil Babka authored
      Following the previous patch, further reduction of /proc/pid/smaps cost
      is possible for private writable shmem mappings with unpopulated areas
      where the page walk invokes the .pte_hole function.  We can use radix
      tree iterator for each such area instead of calling find_get_entry() in
      a loop.  This is possible at the extra maintenance cost of introducing
      another shmem function shmem_partial_swap_usage().
      
      To demonstrate the diference, I have measured this on a process that
      creates a private writable 2GB mapping of a partially swapped out
      /dev/shm/file (which cannot employ the optimizations from the prvious
      patch) and doesn't populate it at all.  I time how long does it take to
      cat /proc/pid/smaps of this process 100 times.
      
      Before this patch:
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      After this patch:
      
      real    0m1.176s
      user    0m0.180s
      sys     0m0.684s
      
      The time is similar to the case where a radix tree iterator is employed
      on the whole mapping.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48131e03
    • Vlastimil Babka's avatar
      mm, proc: reduce cost of /proc/pid/smaps for shmem mappings · 6a15a370
      Vlastimil Babka authored
      The previous patch has improved swap accounting for shmem mapping, which
      however made /proc/pid/smaps more expensive for shmem mappings, as we
      consult the radix tree for each pte_none entry, so the overal complexity
      is O(n*log(n)).
      
      We can reduce this significantly for mappings that cannot contain COWed
      pages, because then we can either use the statistics tha shmem object
      itself tracks (if the mapping contains the whole object, or the swap
      usage of the whole object is zero), or use the radix tree iterator,
      which is much more effective than repeated find_get_entry() calls.
      
      This patch therefore introduces a function shmem_swap_usage(vma) and
      makes /proc/pid/smaps use it when possible.  Only for writable private
      mappings of shmem objects (i.e.  tmpfs files) with the shmem object
      itself (partially) swapped outwe have to resort to the find_get_entry()
      approach.
      
      Hopefully such mappings are relatively uncommon.
      
      To demonstrate the diference, I have measured this on a process that
      creates a 2GB mapping and dirties single pages with a stride of 2MB, and
      time how long does it take to cat /proc/pid/smaps of this process 100
      times.
      
      Private writable mapping of a /dev/shm/file (the most complex case):
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
      (which needs to employ the radix tree iterator).
      
      real    0m1.351s
      user    0m0.096s
      sys     0m0.768s
      
      Same, but with /dev/shm/file not swapped (so no radix tree walk needed)
      
      real    0m0.935s
      user    0m0.128s
      sys     0m0.344s
      
      Private anonymous mapping:
      
      real    0m0.949s
      user    0m0.116s
      sys     0m0.348s
      
      The cost is now much closer to the private anonymous mapping case, unless
      the shmem mapping is private and writable.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a15a370
    • Vlastimil Babka's avatar
      mm, proc: account for shmem swap in /proc/pid/smaps · c261e7d9
      Vlastimil Babka authored
      Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
      shmem-backed mappings, even if the mapped portion does contain pages
      that were swapped out.  This is because unlike private anonymous
      mappings, shmem does not change pte to swap entry, but pte_none when
      swapping the page out.  In the smaps page walk, such page thus looks
      like it was never faulted in.
      
      This patch changes smaps_pte_entry() to determine the swap status for
      such pte_none entries for shmem mappings, similarly to how
      mincore_page() does it.  Swapped out shmem pages are thus accounted for.
      For private mappings of tmpfs files that COWed some of the pages, swaped
      out status of the original shmem pages is naturally ignored.  If some of
      the private copies was also swapped out, they are accounted via their
      page table swap entries, so the resulting reported swap usage is then a
      sum of both swapped out private copies, and swapped out shmem pages that
      were not COWed.  No double accounting can thus happen.
      
      The accounting is arguably still not as precise as for private anonymous
      mappings, since now we will count also pages that the process in
      question never accessed, but another process populated them and then let
      them become swapped out.  I believe it is still less confusing and
      subtle than not showing any swap usage by shmem mappings at all.
      Swapped out counter might of interest of users who would like to prevent
      from future swapins during performance critical operation and pre-fault
      them at their convenience.  Especially for larger swapped out regions
      the cost of swapin is much higher than a fresh page allocation.  So a
      differentiation between pte_none vs.  swapped out is important for those
      usecases.
      
      One downside of this patch is that it makes /proc/pid/smaps more
      expensive for shmem mappings, as we consult the radix tree for each
      pte_none entry, so the overal complexity is O(n*log(n)).  I have
      measured this on a process that creates a 2GB mapping and dirties single
      pages with a stride of 2MB, and time how long does it take to cat
      /proc/pid/smaps of this process 100 times.
      
      Private anonymous mapping:
      
      real    0m0.949s
      user    0m0.116s
      sys     0m0.348s
      
      Mapping of a /dev/shm/file:
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      The difference is rather substantial, so the next patch will reduce the
      cost for shared or read-only mappings.
      
      In a less controlled experiment, I've gathered pids of processes on my
      desktop that have either '/dev/shm/*' or 'SYSV*' in smaps.  This
      included the Chrome browser and some KDE processes.  Again, I've run cat
      /proc/pid/smaps on each 100 times.
      
      Before this patch:
      
      real    0m9.050s
      user    0m0.518s
      sys     0m8.066s
      
      After this patch:
      
      real    0m9.221s
      user    0m0.541s
      sys     0m8.187s
      
      This suggests low impact on average systems.
      
      Note that this patch doesn't attempt to adjust the SwapPss field for
      shmem mappings, which would need extra work to determine who else could
      have the pages mapped.  Thus the value stays zero except for COWed
      swapped out pages in a shmem mapping, which are accounted as usual.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c261e7d9
    • Vlastimil Babka's avatar
      mm, documentation: clarify /proc/pid/status VmSwap limitations for shmem · bf9683d6
      Vlastimil Babka authored
      This series is based on Jerome Marchand's [1] so let me quote the first
      paragraph from there:
      
      There are several shortcomings with the accounting of shared memory
      (sysV shm, shared anonymous mapping, mapping to a tmpfs file).  The
      values in /proc/<pid>/status and statm don't allow to distinguish
      between shmem memory and a shared mapping to a regular file, even though
      their implications on memory usage are quite different: at reclaim, file
      mapping can be dropped or written back on disk while shmem needs a place
      in swap.  As for shmem pages that are swapped-out or in swap cache, they
      aren't accounted at all.
      
      The original motivation for myself is that a customer found (IMHO
      rightfully) confusing that e.g.  top output for process swap usage is
      unreliable with respect to swapped out shmem pages, which are not
      accounted for.
      
      The fundamental difference between private anonymous and shmem pages is
      that the latter has PTE's converted to pte_none, and not swapents.  As
      such, they are not accounted to the number of swapents visible e.g.  in
      /proc/pid/status VmSwap row.  It might be theoretically possible to use
      swapents when swapping out shmem (without extra cost, as one has to
      change all mappers anyway), and on swap in only convert the swapent for
      the faulting process, leaving swapents in other processes until they
      also fault (so again no extra cost).  But I don't know how many
      assumptions this would break, and it would be too disruptive change for
      a relatively small benefit.
      
      Instead, my approach is to document the limitation of VmSwap, and
      provide means to determine the swap usage for shmem areas for those who
      are interested and willing to pay the price, using /proc/pid/smaps.
      Because outside of ipcs, I don't think it's possible to currently to
      determine the usage at all.  The previous patchset [1] did introduce new
      shmem-specific fields into smaps output, and functions to determine the
      values.  I take a simpler approach, noting that smaps output already has
      a "Swap: X kB" line, where currently X == 0 always for shmem areas.  I
      think we can just consider this a bug and provide the proper value by
      consulting the radix tree, as e.g.  mincore_page() does.  In the patch
      changelog I explain why this is also not perfect (and cannot be without
      swapents), but still arguably much better than showing a 0.
      
      The last two patches are adapted from Jerome's patchset and provide a
      VmRSS breakdown to RssAnon, RssFile and RssShm in /proc/pid/status.
      Hugh noted that this is a welcome addition, and I agree that it might
      help e.g.  debugging process memory usage at albeit non-zero, but still
      rather low cost of extra per-mm counter and some page flag checks.
      
      [1] http://lwn.net/Articles/611966/
      
      This patch (of 6):
      
      The documentation for /proc/pid/status does not mention that the value
      of VmSwap counts only swapped out anonymous private pages, and not
      swapped out pages of the underlying shmem objects (for shmem mappings).
      This is not obvious, so document this limitation.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf9683d6
    • Yaowei Bai's avatar
      mm/mmzone.c: memmap_valid_within() can be boolean · 5b80287a
      Yaowei Bai authored
      Make memmap_valid_within return bool due to this particular function
      only using either one or zero as its return value.
      
      No functional change.
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b80287a
    • Geliang Tang's avatar
      mm/vmalloc.c: use list_{next,first}_entry · 6219c2a2
      Geliang Tang authored
      To make the intention clearer, use list_{next,first}_entry instead of
      list_entry.
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6219c2a2
    • Michal Hocko's avatar
      mm/page_alloc.c: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim · 33d53103
      Michal Hocko authored
      __alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
      __GFP_NOFAIL is requested.  This is fragile because we are basically
      relying on somebody else to make the reclaim (be it the direct reclaim
      or OOM killer) for us.  The caller might be holding resources (e.g.
      locks) which block other other reclaimers from making any progress for
      example.  Remove the retry loop and rely on __alloc_pages_slowpath to
      invoke all allowed reclaim steps and retry logic.
      
      We have to be careful about __GFP_NOFAIL allocations from the
      PF_MEMALLOC context even though this is a very bad idea to begin with
      because no progress can be gurateed at all.  We shouldn't break the
      __GFP_NOFAIL semantic here though.  It could be argued that this is
      essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
      is much harder to check for existing users because they might happen
      deep down the code path performed much later after setting the flag so
      we cannot really rule out there is no kernel path triggering this
      combination.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33d53103
    • Michal Hocko's avatar
      mm/page_alloc.c: get rid of __alloc_pages_high_priority() · fde82aaa
      Michal Hocko authored
      __alloc_pages_high_priority doesn't do anything special other than it
      calls get_page_from_freelist and loops around GFP_NOFAIL allocation
      until it succeeds.  It would be better if the first part was done in
      __alloc_pages_slowpath where we modify the zonelist because this would
      be easier to read and understand.  Opencoding the function into its only
      caller allows to simplify it a bit as well.
      
      This patch doesn't introduce any functional changes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fde82aaa
    • Yaowei Bai's avatar
      mm/zonelist: enumerate zonelists array index · c00eb15a
      Yaowei Bai authored
      Hardcoding index to zonelists array in gfp_zonelist() is not a good
      idea, let's enumerate it to improve readability.
      
      No functional change.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
      [n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c00eb15a