1. 04 Jun, 2014 40 commits
    • Vladimir Davydov's avatar
      slab: get_online_mems for kmem_cache_{create,destroy,shrink} · 03afc0e2
      Vladimir Davydov authored
      When we create a sl[au]b cache, we allocate kmem_cache_node structures
      for each online NUMA node.  To handle nodes taken online/offline, we
      register memory hotplug notifier and allocate/free kmem_cache_node
      corresponding to the node that changes its state for each kmem cache.
      
      To synchronize between the two paths we hold the slab_mutex during both
      the cache creationg/destruction path and while tuning per-node parts of
      kmem caches in memory hotplug handler, but that's not quite right,
      because it does not guarantee that a newly created cache will have all
      kmem_cache_nodes initialized in case it races with memory hotplug.  For
      instance, in case of slub:
      
          CPU0                            CPU1
          ----                            ----
          kmem_cache_create:              online_pages:
           __kmem_cache_create:            slab_memory_callback:
                                            slab_mem_going_online_callback:
                                             lock slab_mutex
                                             for each slab_caches list entry
                                                 allocate kmem_cache node
                                             unlock slab_mutex
            lock slab_mutex
            init_kmem_cache_nodes:
             for_each_node_state(node, N_NORMAL_MEMORY)
                 allocate kmem_cache node
            add kmem_cache to slab_caches list
            unlock slab_mutex
                                          online_pages (continued):
                                           node_states_set_node
      
      As a result we'll get a kmem cache with not all kmem_cache_nodes
      allocated.
      
      To avoid issues like that we should hold get/put_online_mems() during
      the whole kmem cache creation/destruction/shrink paths, just like we
      deal with cpu hotplug.  This patch does the trick.
      
      Note, that after it's applied, there is no need in taking the slab_mutex
      for kmem_cache_shrink any more, so it is removed from there.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03afc0e2
    • Vladimir Davydov's avatar
      mem-hotplug: implement get/put_online_mems · bfc8c901
      Vladimir Davydov authored
      kmem_cache_{create,destroy,shrink} need to get a stable value of
      cpu/node online mask, because they init/destroy/access per-cpu/node
      kmem_cache parts, which can be allocated or destroyed on cpu/mem
      hotplug.  To protect against cpu hotplug, these functions use
      {get,put}_online_cpus.  However, they do nothing to synchronize with
      memory hotplug - taking the slab_mutex does not eliminate the
      possibility of race as described in patch 2.
      
      What we need there is something like get_online_cpus, but for memory.
      We already have lock_memory_hotplug, which serves for the purpose, but
      it's a bit of a hammer right now, because it's backed by a mutex.  As a
      result, it imposes some limitations to locking order, which are not
      desirable, and can't be used just like get_online_cpus.  That's why in
      patch 1 I substitute it with get/put_online_mems, which work exactly
      like get/put_online_cpus except they block not cpu, but memory hotplug.
      
      [ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
        myself, because it used an rw semaphore for get/put_online_mems,
        making them dead lock prune.  ]
      
      This patch (of 2):
      
      {un}lock_memory_hotplug, which is used to synchronize against memory
      hotplug, is currently backed by a mutex, which makes it a bit of a
      hammer - threads that only want to get a stable value of online nodes
      mask won't be able to proceed concurrently.  Also, it imposes some
      strong locking ordering rules on it, which narrows down the set of its
      usage scenarios.
      
      This patch introduces get/put_online_mems, which are the same as
      get/put_online_cpus, but for memory hotplug, i.e.  executing a code
      inside a get/put_online_mems section will guarantee a stable value of
      online nodes, present pages, etc.
      
      lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfc8c901
    • Vladimir Davydov's avatar
      memcg: un-export __memcg_kmem_get_cache · e8d9df3a
      Vladimir Davydov authored
      It is only used in slab and should not be used anywhere else so there is
      no need in exporting it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8d9df3a
    • Mel Gorman's avatar
      mm: page_alloc: do not cache reclaim distances · 5f7a75ac
      Mel Gorman authored
      pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
      by zone_reclaim due to its distance.  As it is expected that
      zone_reclaim_mode will be rarely enabled it is unreasonable for all
      machines to take a penalty.  Fortunately, the zone_reclaim_mode() path
      is already slow and it is the path that takes the hit.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f7a75ac
    • Mel Gorman's avatar
      mm: disable zone_reclaim_mode by default · 4f9b16a6
      Mel Gorman authored
      When it was introduced, zone_reclaim_mode made sense as NUMA distances
      punished and workloads were generally partitioned to fit into a NUMA
      node.  NUMA machines are now common but few of the workloads are
      NUMA-aware and it's routine to see major performance degradation due to
      zone_reclaim_mode being enabled but relatively few can identify the
      problem.
      
      Those that require zone_reclaim_mode are likely to be able to detect
      when it needs to be enabled and tune appropriately so lets have a
      sensible default for the bulk of users.
      
      This patch (of 2):
      
      zone_reclaim_mode causes processes to prefer reclaiming memory from
      local node instead of spilling over to other nodes.  This made sense
      initially when NUMA machines were almost exclusively HPC and the
      workload was partitioned into nodes.  The NUMA penalties were
      sufficiently high to justify reclaiming the memory.  On current machines
      and workloads it is often the case that zone_reclaim_mode destroys
      performance but not all users know how to detect this.  Favour the
      common case and disable it by default.  Users that are sophisticated
      enough to know they need zone_reclaim_mode will detect it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f9b16a6
    • Luiz Capitulino's avatar
      hugetlb: add support for gigantic page allocation at runtime · 944d9fec
      Luiz Capitulino authored
      HugeTLB is limited to allocating hugepages whose size are less than
      MAX_ORDER order.  This is so because HugeTLB allocates hugepages via the
      buddy allocator.  Gigantic pages (that is, pages whose size is greater
      than MAX_ORDER order) have to be allocated at boottime.
      
      However, boottime allocation has at least two serious problems.  First,
      it doesn't support NUMA and second, gigantic pages allocated at boottime
      can't be freed.
      
      This commit solves both issues by adding support for allocating gigantic
      pages during runtime.  It works just like regular sized hugepages,
      meaning that the interface in sysfs is the same, it supports NUMA, and
      gigantic pages can be freed.
      
      For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
      gigantic pages on node 1, one can do:
      
       # echo 2 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      And to free them all:
      
       # echo 0 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      The one problem with gigantic page allocation at runtime is that it
      can't be serviced by the buddy allocator.  To overcome that problem,
      this commit scans all zones from a node looking for a large enough
      contiguous region.  When one is found, it's allocated by using CMA, that
      is, we call alloc_contig_range() to do the actual allocation.  For
      example, on x86_64 we scan all zones looking for a 1GB contiguous
      region.  When one is found, it's allocated by alloc_contig_range().
      
      One expected issue with that approach is that such gigantic contiguous
      regions tend to vanish as runtime goes by.  The best way to avoid this
      for now is to make gigantic page allocations very early during system
      boot, say from a init script.  Other possible optimization include using
      compaction, which is supported by CMA but is not explicitly used by this
      commit.
      
      It's also important to note the following:
      
       1. Gigantic pages allocated at boottime by the hugepages= command-line
          option can be freed at runtime just fine
      
       2. This commit adds support for gigantic pages only to x86_64. The
          reason is that I don't have access to nor experience with other archs.
          The code is arch indepedent though, so it should be simple to add
          support to different archs
      
       3. I didn't add support for hugepage overcommit, that is allocating
          a gigantic page on demand when
         /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
         think it's reasonable to do the hard and long work required for
         allocating a gigantic page at fault time. But it should be simple
         to add this if wanted
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      944d9fec
    • Luiz Capitulino's avatar
      hugetlb: move helpers up in the file · 1cac6f2c
      Luiz Capitulino authored
      Next commit will add new code which will want to call
      for_each_node_mask_to_alloc() macro.  Move it, its buddy
      for_each_node_mask_to_free() and their dependencies up in the file so the
      new code can use them.  This is just code movement, no logic change.
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cac6f2c
    • Luiz Capitulino's avatar
      hugetlb: update_and_free_page(): don't clear PG_reserved bit · a7407a27
      Luiz Capitulino authored
      Hugepages pages never get the PG_reserved bit set, so don't clear it.
      
      However, note that if the bit gets mistakenly set free_pages_check() will
      catch it.
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7407a27
    • Luiz Capitulino's avatar
      hugetlb: add hstate_is_gigantic() · bae7f4ae
      Luiz Capitulino authored
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bae7f4ae
    • Luiz Capitulino's avatar
      hugetlb: prep_compound_gigantic_page(): drop __init marker · 2906dd52
      Luiz Capitulino authored
      The HugeTLB subsystem uses the buddy allocator to allocate hugepages
      during runtime.  This means that hugepages allocation during runtime is
      limited to MAX_ORDER order.  For archs supporting gigantic pages (that
      is, page sizes greater than MAX_ORDER), this in turn means that those
      pages can't be allocated at runtime.
      
      HugeTLB supports gigantic page allocation during boottime, via the boot
      allocator.  To this end the kernel provides the command-line options
      hugepagesz= and hugepages=, which can be used to instruct the kernel to
      allocate N gigantic pages during boot.
      
      For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages
      can be allocated and freed at runtime.  If one wants to allocate 1G
      gigantic pages, this has to be done at boot via the hugepagesz= and
      hugepages= command-line options.
      
      Now, gigantic page allocation at boottime has two serious problems:
      
       1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
          evenly distributes boottime allocated hugepages among nodes.
      
          For example, suppose you have a four-node NUMA machine and want
          to allocate four 1G gigantic pages at boottime. The kernel will
          allocate one gigantic page per node.
      
          On the other hand, we do have users who want to be able to specify
          which NUMA node gigantic pages should allocated from. So that they
          can place virtual machines on a specific NUMA node.
      
       2. Gigantic pages allocated at boottime can't be freed
      
      At this point it's important to observe that regular hugepages allocated
      at runtime don't have those problems.  This is so because HugeTLB
      interface for runtime allocation in sysfs supports NUMA and runtime
      allocated pages can be freed just fine via the buddy allocator.
      
      This series adds support for allocating gigantic pages at runtime.  It
      does so by allocating gigantic pages via CMA instead of the buddy
      allocator.  Releasing gigantic pages is also supported via CMA.  As this
      series builds on top of the existing HugeTLB interface, it makes gigantic
      page allocation and releasing just like regular sized hugepages.  This
      also means that NUMA support just works.
      
      For example, to allocate two 1G gigantic pages on node 1, one can do:
      
       # echo 2 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      And, to release all gigantic pages on the same node:
      
       # echo 0 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      Please, refer to patch 5/5 for full technical details.
      
      Finally, please note that this series is a follow up for a previous series
      that tried to extend the command-line options set to be NUMA aware:
      
       http://marc.info/?l=linux-mm&m=139593335312191&w=2
      
      During the discussion of that series it was agreed that having runtime
      allocation support for gigantic pages was a better solution.
      
      This patch (of 5):
      
      This function is going to be used by non-init code in a future
      commit.
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2906dd52
    • Duan Jiong's avatar
      mm/mmap.c: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO · 14bd5b45
      Duan Jiong authored
      Fix a coccinelle error regarding usage of IS_ERR and PTR_ERR instead of
      PTR_ERR_OR_ZERO.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14bd5b45
    • Vladimir Davydov's avatar
      slab: document kmalloc_order · cea371f4
      Vladimir Davydov authored
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cea371f4
    • Li Zhong's avatar
      memory-hotplug: update documentation to hide information about SECTIONS and remove end_phys_index · 56a3c655
      Li Zhong authored
      Seems we all agree that information about SECTION, e.g. section size,
      sections per memory block should be kept as kernel internals, and not
      exposed to userspace.
      
      This patch updates Documentation/memory-hotplug.txt to refer to memory
      blocks instead of memory sections where appropriate and added a
      paragraph to explain that memory blocks are made of memory sections.
      The documentation update is mostly provided by Nathan.
      
      Also, as end_phys_index in code is actually not the end section id, but
      the end memory block id, which should always be the same as phys_index.
      So it is removed here.
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56a3c655
    • Dave Hansen's avatar
      mm: pass VM_BUG_ON() reason to dump_page() · e4f67422
      Dave Hansen authored
      I recently added a patch to let folks pass a "reason" string dump_page()
      which gets dumped out along with the page's data.  This essentially
      saves the bug-reader a trip in to the source to figure out why we
      BUG_ON()'d.
      
      The new VM_BUG_ON_PAGE() passes in NULL for "reason".  It seems like we
      might as well pass the BUG_ON() condition if we have it.  This will
      bloat kernels a bit with ~160 new strings, but this is all under a
      debugging option anyway.
      
      	page:ffffea0008560280 count:1 mapcount:0 mapping:(null) index:0x0
      	page flags: 0xbfffc0000000001(locked)
      	page dumped because: VM_BUG_ON_PAGE(PageLocked(page))
      	------------[ cut here ]------------
      	kernel BUG at /home/davehans/linux.git/mm/filemap.c:464!
      	invalid opcode: 0000 [#1] SMP
      	CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.14.0+ #251
      	Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      	...
      
      [akpm@linux-foundation.org: include stringify.h]
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4f67422
    • Johannes Weiner's avatar
      mm: memcontrol: remove hierarchy restrictions for swappiness and oom_control · 3dae7fec
      Johannes Weiner authored
      Per-memcg swappiness and oom killing can currently not be tweaked on a
      memcg that is part of a hierarchy, but not the root of that hierarchy.
      Users have complained that they can't configure this when they turned on
      hierarchy mode.  In fact, with hierarchy mode becoming the default, this
      restriction disables the tunables entirely.
      
      But there is no good reason for this restriction.  The settings for
      swappiness and OOM killing are taken from whatever memcg whose limit
      triggered reclaim and OOM invocation, regardless of its position in the
      hierarchy tree.
      
      Allow setting swappiness on any group.  The knob on the root memcg
      already reads the global VM swappiness, make it writable as well.
      
      Allow disabling the OOM killer on any non-root memcg.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3dae7fec
    • Sebastian Ott's avatar
      mm/mempool: warn about __GFP_ZERO usage · 8bf8fcb0
      Sebastian Ott authored
      Memory obtained via mempool_alloc is not always zeroed even when
      called with __GFP_ZERO. Add a note and VM_BUG_ON statement to make
      that clear.
      
      [akpm@linux-foundation.org: use VM_WARN_ON_ONCE]
      Signed-off-by: default avatarSebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8bf8fcb0
    • Andrew Morton's avatar
      include/linux/mmdebug.h: add VM_WARN_ON() and VM_WARN_ON_ONCE() · 02a8efed
      Andrew Morton authored
      WARN_ON() and WARN_ON_ONCE(), dependent on CONFIG_DEBUG_VM
      
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02a8efed
    • Andrew Morton's avatar
      mm/huge_memory.c: complete conversion to pr_foo() · ae3a8c1c
      Andrew Morton authored
      It was using a mix of pr_foo() and printk(KERN_ERR ...).
      
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae3a8c1c
    • Kirill A. Shutemov's avatar
      thp: consolidate assert checks in __split_huge_page() · ff9e43eb
      Kirill A. Shutemov authored
      It doesn't make sense to have two assert checks for each invariant: one
      for printing and one for BUG().
      
      Let's trigger BUG() if we print error message.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff9e43eb
    • Akinobu Mita's avatar
      arch/x86/kernel/pci-dma.c: fix dma_generic_alloc_coherent() when CONFIG_DMA_CMA is enabled · 38f7ea5a
      Akinobu Mita authored
      dma_generic_alloc_coherent() firstly attempts to allocate by
      dma_alloc_from_contiguous() if CONFIG_DMA_CMA is enabled.  But the
      memory region allocated by it may not fit within the device's DMA mask.
      This change makes it fall back to usual alloc_pages_node() allocation
      for such cases.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38f7ea5a
    • Akinobu Mita's avatar
      cma: add placement specifier for "cma=" kernel parameter · 5ea3b1b2
      Akinobu Mita authored
      Currently, "cma=" kernel parameter is used to specify the size of CMA,
      but we can't specify where it is located.  We want to locate CMA below
      4GB for devices only supporting 32-bit addressing on 64-bit systems
      without iommu.
      
      This enables to specify the placement of CMA by extending "cma=" kernel
      parameter.
      
      Examples:
       1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
       2. locate 64MB CMA exact at 512MB by "cma=64M@512M"
      
      Note that the DMA contiguous memory allocator on x86 assumes that
      page_address() works for the pages to allocate.  So this change requires
      to limit end address of contiguous memory area upto max_pfn_mapped to
      prevent from locating it on highmem area by the argument of
      dma_contiguous_reserve().
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ea3b1b2
    • Akinobu Mita's avatar
      memblock: introduce memblock_alloc_range() · 2bfc2862
      Akinobu Mita authored
      This introduces memblock_alloc_range() which allocates memblock from the
      specified range of physical address.  I would like to use this function
      to specify the location of CMA.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bfc2862
    • Akinobu Mita's avatar
      intel-iommu: integrate DMA CMA · 36746436
      Akinobu Mita authored
      This adds support for the DMA Contiguous Memory Allocator for
      intel-iommu.  This change enables dma_alloc_coherent() to allocate big
      contiguous memory.
      
      It is achieved in the same way as nommu_dma_ops currently does, i.e.
      trying to allocate memory by dma_alloc_from_contiguous() and
      alloc_pages() is used as a fallback.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36746436
    • Akinobu Mita's avatar
      x86: enable DMA CMA with swiotlb · 9c5a3621
      Akinobu Mita authored
      The DMA Contiguous Memory Allocator support on x86 is disabled when
      swiotlb config option is enabled.  So DMA CMA is always disabled on
      x86_64 because swiotlb is always enabled.  This attempts to support for
      DMA CMA with enabling swiotlb config option.
      
      The contiguous memory allocator on x86 is integrated in the function
      dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops
      for dma_alloc_coherent().
      
      x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops
      tries to allocate with dma_generic_alloc_coherent() firstly and then
      swiotlb_alloc_coherent() is called as a fallback.
      
      The main part of supporting DMA CMA with swiotlb is that changing
      x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops
      for dma_free_coherent() so that it can distinguish memory allocated by
      dma_generic_alloc_coherent() from one allocated by
      swiotlb_alloc_coherent() and release it with dma_generic_free_coherent()
      which can handle contiguous memory.  This change requires making
      is_swiotlb_buffer() global function.
      
      This also needs to change .free callback in the dma_map_ops for amd_gart
      and sta2x11, because these dma_ops are also using
      dma_generic_alloc_coherent().
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c5a3621
    • Akinobu Mita's avatar
      x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled · d92ef66c
      Akinobu Mita authored
      This patchset enhances the DMA Contiguous Memory Allocator on x86.
      
      Currently the DMA CMA is only supported with pci-nommu dma_map_ops and
      furthermore it can't be enabled on x86_64.  But I would like to allocate
      big contiguous memory with dma_alloc_coherent() and tell it to the device
      that requires it, regardless of which dma mapping implementation is
      actually used in the system.
      
      So this makes it work with swiotlb and intel-iommu dma_map_ops, too.  And
      this also extends "cma=" kernel parameter to specify placement constraint
      by the physical address range of memory allocations.  For example, CMA
      allocates memory below 4GB by "cma=64M@0-4G", it is required for the
      devices only supporting 32-bit addressing on 64-bit systems without iommu.
      
      This patch (of 5):
      
      Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.
      
      But when the contiguous memory allocator (CMA) is enabled on x86 and the
      memory region is allocated by dma_alloc_from_contiguous(), it doesn't
      return zeroed memory.  Because dma_generic_alloc_coherent() forgot to fill
      the memory region with zero if it was allocated by
      dma_alloc_from_contiguous()
      
      Most implementations of dma_alloc_coherent() return zeroed memory
      regardless of whether __GFP_ZERO is specified.  So this fixes it by
      unconditionally zeroing the allocated memory region.
      
      Alternatively, we could fix dma_alloc_from_contiguous() to return zeroed
      out memory and remove memset() from all caller of it.  But we can't simply
      remove the memset on arm because __dma_clear_buffer() is used there for
      ensuring cache flushing and it is used in many places.  Of course we can
      do redundant memset in dma_alloc_from_contiguous(), but I think this patch
      is less impact for fixing this problem.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d92ef66c
    • Davidlohr Bueso's avatar
      mm,vmacache: optimize overflow system-wide flushing · 6b4ebc3a
      Davidlohr Bueso authored
      For single threaded workloads, we can avoid flushing and iterating through
      the entire list of tasks, making the whole function a lot faster,
      requiring only a single atomic read for the mm_users.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b4ebc3a
    • Davidlohr Bueso's avatar
      mm,vmacache: add debug data · 4f115147
      Davidlohr Bueso authored
      Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
      hit rate -- exported in /proc/vmstat.
      
      Any updates to the caching scheme needs this kind of data, thus it can
      save some work re-implementing the counting all the time.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f115147
    • Suleiman Souhlal's avatar
      mm: only force scan in reclaim when none of the LRUs are big enough. · 6f04f48d
      Suleiman Souhlal authored
      Prior to this change, we would decide whether to force scan a LRU during
      reclaim if that LRU itself was too small for the current priority.
      However, this can lead to the file LRU getting force scanned even if
      there are a lot of anonymous pages we can reclaim, leading to hot file
      pages getting needlessly reclaimed.
      
      To address this, we instead only force scan when none of the reclaimable
      LRUs are big enough.
      
      Gives huge improvements with zswap.  For example, when doing -j20 kernel
      build in a 500MB container with zswap enabled, runtime (in seconds) is
      greatly reduced:
      
      x without this change
      + with this change
          N           Min           Max        Median           Avg        Stddev
      x   5       700.997       790.076       763.928        754.05      39.59493
      +   5       141.634       197.899       155.706         161.9     21.270224
      Difference at 95.0% confidence
              -592.15 +/- 46.3521
              -78.5293% +/- 6.14709%
              (Student's t, pooled s = 31.7819)
      
      Should also give some improvements in regular (non-zswap) swap cases.
      
      Yes, hughd found significant speedup using regular swap, with several
      memcgs under pressure; and it should also be effective in the non-memcg
      case, whenever one or another zone LRU is forced too small.
      Signed-off-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Luigi Semenzato <semenzato@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f04f48d
    • Cyrill Gorcunov's avatar
      mm: softdirty: clear VM_SOFTDIRTY flag inside clear_refs_write() instead of clear_soft_dirty() · c86c97ff
      Cyrill Gorcunov authored
      clear_refs_write() is called earlier than clear_soft_dirty() and it is
      more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not
      PTEs) that early instead of clearing it a way deeper inside call chain.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c86c97ff
    • Cyrill Gorcunov's avatar
      mm: softdirty: don't forget to save file map softdiry bit on unmap · b43790ee
      Cyrill Gorcunov authored
      pte_file_mksoft_dirty operates with argument passed by a value and
      returns modified result thus we need to assign @ptfile here, otherwise
      itis a no-op which may lead to loss of the softdirty bit.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b43790ee
    • Cyrill Gorcunov's avatar
      mm: softdirty: make freshly remapped file pages being softdirty unconditionally · 0bf07331
      Cyrill Gorcunov authored
      Hugh reported:
      
       | I noticed your soft_dirty work in install_file_pte(): which looked
       | good at first, until I realized that it's propagating the soft_dirty
       | of a pte it's about to zap completely, to the unrelated entry it's
       | about to insert in its place.  Which seems very odd to me.
      
      Indeed this code ends up being nop in result -- pte_file_mksoft_dirty()
      operates with pte_t argument and returns new pte_t which were never used
      after.  After looking more I think what we need is to soft-dirtify all
      newely remapped file pages because it should look like a new mapping for
      memory tracker.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0bf07331
    • Vladimir Davydov's avatar
      mm: get rid of __GFP_KMEMCG · 52383431
      Vladimir Davydov authored
      Currently to allocate a page that should be charged to kmemcg (e.g.
      threadinfo), we pass __GFP_KMEMCG flag to the page allocator.  The page
      allocated is then to be freed by free_memcg_kmem_pages.  Apart from
      looking asymmetrical, this also requires intrusion to the general
      allocation path.  So let's introduce separate functions that will
      alloc/free pages charged to kmemcg.
      
      The new functions are called alloc_kmem_pages and free_kmem_pages.  They
      should be used when the caller actually would like to use kmalloc, but
      has to fall back to the page allocator for the allocation is large.
      They only differ from alloc_pages and free_pages in that besides
      allocating or freeing pages they also charge them to the kmem resource
      counter of the current memory cgroup.
      
      [sfr@canb.auug.org.au: export kmalloc_order() to modules]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52383431
    • Vladimir Davydov's avatar
      sl[au]b: charge slabs to kmemcg explicitly · 5dfb4175
      Vladimir Davydov authored
      We have only a few places where we actually want to charge kmem so
      instead of intruding into the general page allocation path with
      __GFP_KMEMCG it's better to explictly charge kmem there.  All kmem
      charges will be easier to follow that way.
      
      This is a step towards removing __GFP_KMEMCG.  It removes __GFP_KMEMCG
      from memcg caches' allocflags.  Instead it makes slab allocation path
      call memcg_charge_kmem directly getting memcg to charge from the cache's
      memcg params.
      
      This also eliminates any possibility of misaccounting an allocation
      going from one memcg's cache to another memcg, because now we always
      charge slabs against the memcg the cache belongs to.  That's why this
      patch removes the big comment to memcg_kmem_get_cache.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dfb4175
    • Dave Hansen's avatar
      mm: slub: fix ALLOC_SLOWPATH stat · 8eae1492
      Dave Hansen authored
      There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
      got bumped in that exit path.  Now there are two, and a bunch of gotos.
      ALLOC_SLOWPATH can now get set more than once during a single call to
      __slab_alloc() which is pretty bogus.  Here's the sequence:
      
      1. Enter __slab_alloc(), fall through all the way to the
         stat(s, ALLOC_SLOWPATH);
      2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
         new_slab (goto #1)
      3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
         (goto #2)
      4. Fall through in the same path we did before all the way to
         stat(s, ALLOC_SLOWPATH)
      5. bump ALLOC_REFILL stat, then return
      
      Doing this is obviously bogus.  It keeps us from being able to
      accurately compare ALLOC_SLOWPATH vs.  ALLOC_FASTPATH.  It also means
      that the total number of allocs always exceeds the total number of
      frees.
      
      This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
      place that __slab_alloc() is.  This makes it much less likely that
      ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
      __slab_alloc().
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8eae1492
    • David Rientjes's avatar
      mm, slab: suppress out of memory warning unless debug is enabled · 9a02d699
      David Rientjes authored
      When the slab or slub allocators cannot allocate additional slab pages,
      they emit diagnostic information to the kernel log such as current
      number of slabs, number of objects, active objects, etc.  This is always
      coupled with a page allocation failure warning since it is controlled by
      !__GFP_NOWARN.
      
      Suppress this out of memory warning if the allocator is configured
      without debug supported.  The page allocation failure warning will
      indicate it is a failed slab allocation, the order, and the gfp mask, so
      this is only useful to diagnose allocator issues.
      
      Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
      allocator, there is no functional change with this patch.  If debug is
      disabled, however, the warnings are now suppressed.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a02d699
    • Fabian Frederick's avatar
      mm/slub.c: convert vnsprintf-static to va_format · ecc42fbe
      Fabian Frederick authored
      Inspired by Joe Perches suggestion in ntfs logging clean-up.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecc42fbe
    • Fabian Frederick's avatar
      mm/slub.c: convert printk to pr_foo() · f9f58285
      Fabian Frederick authored
      All printk(KERN_foo converted to pr_foo()
      
      Default printk converted to pr_warn()
      
      Coalesce format fragments
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9f58285
    • Yinghai Lu's avatar
      x86, mm: probe memory block size for generic x86 64bit · 982792c7
      Yinghai Lu authored
      On system with 2TiB ram, current x86_64 have 128M as section size, and
      one memory_block only include one section.  So will have 16400 entries
      under /sys/devices/system/memory/.
      
      Current code try to use block id to find block pointer in /sys for any
      section, and reuse that block pointer.  that finding will take some time
      even after commit 7c243c71 ("mm: speedup in __early_pfn_to_nid")
      that will skip the search in that case during booting up.
      
      So solution could be increase block size just like SGI UV system did.
      (harded code to 2g).
      
      This patch is trying to probe the block size to make it match mmio remap
      size.  for example, Intel Nehalem later system will have memory range [0,
      TOML), [4g, TOMH].  If the memory hole is 2g and total is 128g, TOM will
      be 2g, and TOM2 will be 130g.
      
      We could use 2g as block size instead of default 128M.  That will reduce
      number of entries in /sys/devices/system/memory/
      
      On system 6TiB system will reduce boot time by 35 seconds.
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      982792c7
    • Mel Gorman's avatar
      x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels · c46a7c81
      Mel Gorman authored
      _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
      faults on x86.  Care is taken such that _PAGE_NUMA is used only in
      situations where the VMA flags distinguish between NUMA hinting faults
      and prot_none faults.  This decision was x86-specific and conceptually
      it is difficult requiring special casing to distinguish between PROTNONE
      and NUMA ptes based on context.
      
      Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
      between an entry that is really unmapped and a page that is protected
      for NUMA hinting faults as if the PTE is not present then a fault will
      be trapped.
      
      Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
      This patch shrinks the maximum possible swap size and uses the bit to
      uniquely distinguish between NUMA hinting ptes and swap ptes.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c46a7c81
    • Mel Gorman's avatar
      x86: require x86-64 for automatic NUMA balancing · 4468dd76
      Mel Gorman authored
      32-bit support for NUMA is an oddity on its own but with automatic NUMA
      balancing on top there is a reasonable risk that the CPUPID information
      cannot be stored in the page flags.  This patch removes support for
      automatic NUMA support on 32-bit x86.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4468dd76