- 04 Jun, 2014 40 commits
-
-
Vladimir Davydov authored
When we create a sl[au]b cache, we allocate kmem_cache_node structures for each online NUMA node. To handle nodes taken online/offline, we register memory hotplug notifier and allocate/free kmem_cache_node corresponding to the node that changes its state for each kmem cache. To synchronize between the two paths we hold the slab_mutex during both the cache creationg/destruction path and while tuning per-node parts of kmem caches in memory hotplug handler, but that's not quite right, because it does not guarantee that a newly created cache will have all kmem_cache_nodes initialized in case it races with memory hotplug. For instance, in case of slub: CPU0 CPU1 ---- ---- kmem_cache_create: online_pages: __kmem_cache_create: slab_memory_callback: slab_mem_going_online_callback: lock slab_mutex for each slab_caches list entry allocate kmem_cache node unlock slab_mutex lock slab_mutex init_kmem_cache_nodes: for_each_node_state(node, N_NORMAL_MEMORY) allocate kmem_cache node add kmem_cache to slab_caches list unlock slab_mutex online_pages (continued): node_states_set_node As a result we'll get a kmem cache with not all kmem_cache_nodes allocated. To avoid issues like that we should hold get/put_online_mems() during the whole kmem cache creation/destruction/shrink paths, just like we deal with cpu hotplug. This patch does the trick. Note, that after it's applied, there is no need in taking the slab_mutex for kmem_cache_shrink any more, so it is removed from there. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Davydov authored
kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node online mask, because they init/destroy/access per-cpu/node kmem_cache parts, which can be allocated or destroyed on cpu/mem hotplug. To protect against cpu hotplug, these functions use {get,put}_online_cpus. However, they do nothing to synchronize with memory hotplug - taking the slab_mutex does not eliminate the possibility of race as described in patch 2. What we need there is something like get_online_cpus, but for memory. We already have lock_memory_hotplug, which serves for the purpose, but it's a bit of a hammer right now, because it's backed by a mutex. As a result, it imposes some limitations to locking order, which are not desirable, and can't be used just like get_online_cpus. That's why in patch 1 I substitute it with get/put_online_mems, which work exactly like get/put_online_cpus except they block not cpu, but memory hotplug. [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by myself, because it used an rw semaphore for get/put_online_mems, making them dead lock prune. ] This patch (of 2): {un}lock_memory_hotplug, which is used to synchronize against memory hotplug, is currently backed by a mutex, which makes it a bit of a hammer - threads that only want to get a stable value of online nodes mask won't be able to proceed concurrently. Also, it imposes some strong locking ordering rules on it, which narrows down the set of its usage scenarios. This patch introduces get/put_online_mems, which are the same as get/put_online_cpus, but for memory hotplug, i.e. executing a code inside a get/put_online_mems section will guarantee a stable value of online nodes, present pages, etc. lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Davydov authored
It is only used in slab and should not be used anywhere else so there is no need in exporting it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by zone_reclaim due to its distance. As it is expected that zone_reclaim_mode will be rarely enabled it is unreasonable for all machines to take a penalty. Fortunately, the zone_reclaim_mode() path is already slow and it is the path that takes the hit. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. This patch (of 2): zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Luiz Capitulino authored
HugeTLB is limited to allocating hugepages whose size are less than MAX_ORDER order. This is so because HugeTLB allocates hugepages via the buddy allocator. Gigantic pages (that is, pages whose size is greater than MAX_ORDER order) have to be allocated at boottime. However, boottime allocation has at least two serious problems. First, it doesn't support NUMA and second, gigantic pages allocated at boottime can't be freed. This commit solves both issues by adding support for allocating gigantic pages during runtime. It works just like regular sized hugepages, meaning that the interface in sysfs is the same, it supports NUMA, and gigantic pages can be freed. For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G gigantic pages on node 1, one can do: # echo 2 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages And to free them all: # echo 0 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages The one problem with gigantic page allocation at runtime is that it can't be serviced by the buddy allocator. To overcome that problem, this commit scans all zones from a node looking for a large enough contiguous region. When one is found, it's allocated by using CMA, that is, we call alloc_contig_range() to do the actual allocation. For example, on x86_64 we scan all zones looking for a 1GB contiguous region. When one is found, it's allocated by alloc_contig_range(). One expected issue with that approach is that such gigantic contiguous regions tend to vanish as runtime goes by. The best way to avoid this for now is to make gigantic page allocations very early during system boot, say from a init script. Other possible optimization include using compaction, which is supported by CMA but is not explicitly used by this commit. It's also important to note the following: 1. Gigantic pages allocated at boottime by the hugepages= command-line option can be freed at runtime just fine 2. This commit adds support for gigantic pages only to x86_64. The reason is that I don't have access to nor experience with other archs. The code is arch indepedent though, so it should be simple to add support to different archs 3. I didn't add support for hugepage overcommit, that is allocating a gigantic page on demand when /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't think it's reasonable to do the hard and long work required for allocating a gigantic page at fault time. But it should be simple to add this if wanted [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Luiz Capitulino authored
Next commit will add new code which will want to call for_each_node_mask_to_alloc() macro. Move it, its buddy for_each_node_mask_to_free() and their dependencies up in the file so the new code can use them. This is just code movement, no logic change. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Luiz Capitulino authored
Hugepages pages never get the PG_reserved bit set, so don't clear it. However, note that if the bit gets mistakenly set free_pages_check() will catch it. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Luiz Capitulino authored
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Luiz Capitulino authored
The HugeTLB subsystem uses the buddy allocator to allocate hugepages during runtime. This means that hugepages allocation during runtime is limited to MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes greater than MAX_ORDER), this in turn means that those pages can't be allocated at runtime. HugeTLB supports gigantic page allocation during boottime, via the boot allocator. To this end the kernel provides the command-line options hugepagesz= and hugepages=, which can be used to instruct the kernel to allocate N gigantic pages during boot. For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can be allocated and freed at runtime. If one wants to allocate 1G gigantic pages, this has to be done at boot via the hugepagesz= and hugepages= command-line options. Now, gigantic page allocation at boottime has two serious problems: 1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel evenly distributes boottime allocated hugepages among nodes. For example, suppose you have a four-node NUMA machine and want to allocate four 1G gigantic pages at boottime. The kernel will allocate one gigantic page per node. On the other hand, we do have users who want to be able to specify which NUMA node gigantic pages should allocated from. So that they can place virtual machines on a specific NUMA node. 2. Gigantic pages allocated at boottime can't be freed At this point it's important to observe that regular hugepages allocated at runtime don't have those problems. This is so because HugeTLB interface for runtime allocation in sysfs supports NUMA and runtime allocated pages can be freed just fine via the buddy allocator. This series adds support for allocating gigantic pages at runtime. It does so by allocating gigantic pages via CMA instead of the buddy allocator. Releasing gigantic pages is also supported via CMA. As this series builds on top of the existing HugeTLB interface, it makes gigantic page allocation and releasing just like regular sized hugepages. This also means that NUMA support just works. For example, to allocate two 1G gigantic pages on node 1, one can do: # echo 2 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages And, to release all gigantic pages on the same node: # echo 0 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages Please, refer to patch 5/5 for full technical details. Finally, please note that this series is a follow up for a previous series that tried to extend the command-line options set to be NUMA aware: http://marc.info/?l=linux-mm&m=139593335312191&w=2 During the discussion of that series it was agreed that having runtime allocation support for gigantic pages was a better solution. This patch (of 5): This function is going to be used by non-init code in a future commit. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David Rientjes <rientjes@google.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Duan Jiong authored
Fix a coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Davydov authored
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zhong authored
Seems we all agree that information about SECTION, e.g. section size, sections per memory block should be kept as kernel internals, and not exposed to userspace. This patch updates Documentation/memory-hotplug.txt to refer to memory blocks instead of memory sections where appropriate and added a paragraph to explain that memory blocks are made of memory sections. The documentation update is mostly provided by Nathan. Also, as end_phys_index in code is actually not the end section id, but the end memory block id, which should always be the same as phys_index. So it is removed here. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dave Hansen authored
I recently added a patch to let folks pass a "reason" string dump_page() which gets dumped out along with the page's data. This essentially saves the bug-reader a trip in to the source to figure out why we BUG_ON()'d. The new VM_BUG_ON_PAGE() passes in NULL for "reason". It seems like we might as well pass the BUG_ON() condition if we have it. This will bloat kernels a bit with ~160 new strings, but this is all under a debugging option anyway. page:ffffea0008560280 count:1 mapcount:0 mapping:(null) index:0x0 page flags: 0xbfffc0000000001(locked) page dumped because: VM_BUG_ON_PAGE(PageLocked(page)) ------------[ cut here ]------------ kernel BUG at /home/davehans/linux.git/mm/filemap.c:464! invalid opcode: 0000 [#1] SMP CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.14.0+ #251 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 ... [akpm@linux-foundation.org: include stringify.h] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
Per-memcg swappiness and oom killing can currently not be tweaked on a memcg that is part of a hierarchy, but not the root of that hierarchy. Users have complained that they can't configure this when they turned on hierarchy mode. In fact, with hierarchy mode becoming the default, this restriction disables the tunables entirely. But there is no good reason for this restriction. The settings for swappiness and OOM killing are taken from whatever memcg whose limit triggered reclaim and OOM invocation, regardless of its position in the hierarchy tree. Allow setting swappiness on any group. The knob on the root memcg already reads the global VM swappiness, make it writable as well. Allow disabling the OOM killer on any non-root memcg. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sebastian Ott authored
Memory obtained via mempool_alloc is not always zeroed even when called with __GFP_ZERO. Add a note and VM_BUG_ON statement to make that clear. [akpm@linux-foundation.org: use VM_WARN_ON_ONCE] Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrew Morton authored
WARN_ON() and WARN_ON_ONCE(), dependent on CONFIG_DEBUG_VM Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrew Morton authored
It was using a mix of pr_foo() and printk(KERN_ERR ...). Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
It doesn't make sense to have two assert checks for each invariant: one for printing and one for BUG(). Let's trigger BUG() if we print error message. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Akinobu Mita authored
dma_generic_alloc_coherent() firstly attempts to allocate by dma_alloc_from_contiguous() if CONFIG_DMA_CMA is enabled. But the memory region allocated by it may not fit within the device's DMA mask. This change makes it fall back to usual alloc_pages_node() allocation for such cases. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Akinobu Mita authored
Currently, "cma=" kernel parameter is used to specify the size of CMA, but we can't specify where it is located. We want to locate CMA below 4GB for devices only supporting 32-bit addressing on 64-bit systems without iommu. This enables to specify the placement of CMA by extending "cma=" kernel parameter. Examples: 1. locate 64MB CMA below 4GB by "cma=64M@0-4G" 2. locate 64MB CMA exact at 512MB by "cma=64M@512M" Note that the DMA contiguous memory allocator on x86 assumes that page_address() works for the pages to allocate. So this change requires to limit end address of contiguous memory area upto max_pfn_mapped to prevent from locating it on highmem area by the argument of dma_contiguous_reserve(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Akinobu Mita authored
This introduces memblock_alloc_range() which allocates memblock from the specified range of physical address. I would like to use this function to specify the location of CMA. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Akinobu Mita authored
This adds support for the DMA Contiguous Memory Allocator for intel-iommu. This change enables dma_alloc_coherent() to allocate big contiguous memory. It is achieved in the same way as nommu_dma_ops currently does, i.e. trying to allocate memory by dma_alloc_from_contiguous() and alloc_pages() is used as a fallback. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Akinobu Mita authored
The DMA Contiguous Memory Allocator support on x86 is disabled when swiotlb config option is enabled. So DMA CMA is always disabled on x86_64 because swiotlb is always enabled. This attempts to support for DMA CMA with enabling swiotlb config option. The contiguous memory allocator on x86 is integrated in the function dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops for dma_alloc_coherent(). x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops tries to allocate with dma_generic_alloc_coherent() firstly and then swiotlb_alloc_coherent() is called as a fallback. The main part of supporting DMA CMA with swiotlb is that changing x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops for dma_free_coherent() so that it can distinguish memory allocated by dma_generic_alloc_coherent() from one allocated by swiotlb_alloc_coherent() and release it with dma_generic_free_coherent() which can handle contiguous memory. This change requires making is_swiotlb_buffer() global function. This also needs to change .free callback in the dma_map_ops for amd_gart and sta2x11, because these dma_ops are also using dma_generic_alloc_coherent(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Akinobu Mita authored
This patchset enhances the DMA Contiguous Memory Allocator on x86. Currently the DMA CMA is only supported with pci-nommu dma_map_ops and furthermore it can't be enabled on x86_64. But I would like to allocate big contiguous memory with dma_alloc_coherent() and tell it to the device that requires it, regardless of which dma mapping implementation is actually used in the system. So this makes it work with swiotlb and intel-iommu dma_map_ops, too. And this also extends "cma=" kernel parameter to specify placement constraint by the physical address range of memory allocations. For example, CMA allocates memory below 4GB by "cma=64M@0-4G", it is required for the devices only supporting 32-bit addressing on 64-bit systems without iommu. This patch (of 5): Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory. But when the contiguous memory allocator (CMA) is enabled on x86 and the memory region is allocated by dma_alloc_from_contiguous(), it doesn't return zeroed memory. Because dma_generic_alloc_coherent() forgot to fill the memory region with zero if it was allocated by dma_alloc_from_contiguous() Most implementations of dma_alloc_coherent() return zeroed memory regardless of whether __GFP_ZERO is specified. So this fixes it by unconditionally zeroing the allocated memory region. Alternatively, we could fix dma_alloc_from_contiguous() to return zeroed out memory and remove memset() from all caller of it. But we can't simply remove the memset on arm because __dma_clear_buffer() is used there for ensuring cache flushing and it is used in many places. Of course we can do redundant memset in dma_alloc_from_contiguous(), but I think this patch is less impact for fixing this problem. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Davidlohr Bueso authored
For single threaded workloads, we can avoid flushing and iterating through the entire list of tasks, making the whole function a lot faster, requiring only a single atomic read for the mm_users. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Davidlohr Bueso authored
Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache hit rate -- exported in /proc/vmstat. Any updates to the caching scheme needs this kind of data, thus it can save some work re-implementing the counting all the time. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Suleiman Souhlal authored
Prior to this change, we would decide whether to force scan a LRU during reclaim if that LRU itself was too small for the current priority. However, this can lead to the file LRU getting force scanned even if there are a lot of anonymous pages we can reclaim, leading to hot file pages getting needlessly reclaimed. To address this, we instead only force scan when none of the reclaimable LRUs are big enough. Gives huge improvements with zswap. For example, when doing -j20 kernel build in a 500MB container with zswap enabled, runtime (in seconds) is greatly reduced: x without this change + with this change N Min Max Median Avg Stddev x 5 700.997 790.076 763.928 754.05 39.59493 + 5 141.634 197.899 155.706 161.9 21.270224 Difference at 95.0% confidence -592.15 +/- 46.3521 -78.5293% +/- 6.14709% (Student's t, pooled s = 31.7819) Should also give some improvements in regular (non-zswap) swap cases. Yes, hughd found significant speedup using regular swap, with several memcgs under pressure; and it should also be effective in the non-memcg case, whenever one or another zone LRU is forced too small. Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Bob Liu <bob.liu@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Luigi Semenzato <semenzato@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Cyrill Gorcunov authored
clear_refs_write() is called earlier than clear_soft_dirty() and it is more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not PTEs) that early instead of clearing it a way deeper inside call chain. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Cyrill Gorcunov authored
pte_file_mksoft_dirty operates with argument passed by a value and returns modified result thus we need to assign @ptfile here, otherwise itis a no-op which may lead to loss of the softdirty bit. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Cyrill Gorcunov authored
Hugh reported: | I noticed your soft_dirty work in install_file_pte(): which looked | good at first, until I realized that it's propagating the soft_dirty | of a pte it's about to zap completely, to the unrelated entry it's | about to insert in its place. Which seems very odd to me. Indeed this code ends up being nop in result -- pte_file_mksoft_dirty() operates with pte_t argument and returns new pte_t which were never used after. After looking more I think what we need is to soft-dirtify all newely remapped file pages because it should look like a new mapping for memory tracker. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reported-by: Hugh Dickins <hughd@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Davydov authored
Currently to allocate a page that should be charged to kmemcg (e.g. threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page allocated is then to be freed by free_memcg_kmem_pages. Apart from looking asymmetrical, this also requires intrusion to the general allocation path. So let's introduce separate functions that will alloc/free pages charged to kmemcg. The new functions are called alloc_kmem_pages and free_kmem_pages. They should be used when the caller actually would like to use kmalloc, but has to fall back to the page allocator for the allocation is large. They only differ from alloc_pages and free_pages in that besides allocating or freeing pages they also charge them to the kmem resource counter of the current memory cgroup. [sfr@canb.auug.org.au: export kmalloc_order() to modules] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Davydov authored
We have only a few places where we actually want to charge kmem so instead of intruding into the general page allocation path with __GFP_KMEMCG it's better to explictly charge kmem there. All kmem charges will be easier to follow that way. This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG from memcg caches' allocflags. Instead it makes slab allocation path call memcg_charge_kmem directly getting memcg to charge from the cache's memcg params. This also eliminates any possibility of misaccounting an allocation going from one memcg's cache to another memcg, because now we always charge slabs against the memcg the cache belongs to. That's why this patch removes the big comment to memcg_kmem_get_cache. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dave Hansen authored
There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH got bumped in that exit path. Now there are two, and a bunch of gotos. ALLOC_SLOWPATH can now get set more than once during a single call to __slab_alloc() which is pretty bogus. Here's the sequence: 1. Enter __slab_alloc(), fall through all the way to the stat(s, ALLOC_SLOWPATH); 2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to new_slab (goto #1) 3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo (goto #2) 4. Fall through in the same path we did before all the way to stat(s, ALLOC_SLOWPATH) 5. bump ALLOC_REFILL stat, then return Doing this is obviously bogus. It keeps us from being able to accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means that the total number of allocs always exceeds the total number of frees. This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same place that __slab_alloc() is. This makes it much less likely that ALLOC_SLOWPATH will get botched again in the spaghetti-code inside __slab_alloc(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Rientjes authored
When the slab or slub allocators cannot allocate additional slab pages, they emit diagnostic information to the kernel log such as current number of slabs, number of objects, active objects, etc. This is always coupled with a page allocation failure warning since it is controlled by !__GFP_NOWARN. Suppress this out of memory warning if the allocator is configured without debug supported. The page allocation failure warning will indicate it is a failed slab allocation, the order, and the gfp mask, so this is only useful to diagnose allocator issues. Since CONFIG_SLUB_DEBUG is already enabled by default for the slub allocator, there is no functional change with this patch. If debug is disabled, however, the warnings are now suppressed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
Inspired by Joe Perches suggestion in ntfs logging clean-up. Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joe Perches <joe@perches.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
All printk(KERN_foo converted to pr_foo() Default printk converted to pr_warn() Coalesce format fragments Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joe Perches <joe@perches.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yinghai Lu authored
On system with 2TiB ram, current x86_64 have 128M as section size, and one memory_block only include one section. So will have 16400 entries under /sys/devices/system/memory/. Current code try to use block id to find block pointer in /sys for any section, and reuse that block pointer. that finding will take some time even after commit 7c243c71 ("mm: speedup in __early_pfn_to_nid") that will skip the search in that case during booting up. So solution could be increase block size just like SGI UV system did. (harded code to 2g). This patch is trying to probe the block size to make it match mmio remap size. for example, Intel Nehalem later system will have memory range [0, TOML), [4g, TOMH]. If the memory hole is 2g and total is 128g, TOM will be 2g, and TOM2 will be 130g. We could use 2g as block size instead of default 128M. That will reduce number of entries in /sys/devices/system/memory/ On system 6TiB system will reduce boot time by 35 seconds. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting faults on x86. Care is taken such that _PAGE_NUMA is used only in situations where the VMA flags distinguish between NUMA hinting faults and prot_none faults. This decision was x86-specific and conceptually it is difficult requiring special casing to distinguish between PROTNONE and NUMA ptes based on context. Fundamentally, we only need the _PAGE_NUMA bit to tell the difference between an entry that is really unmapped and a page that is protected for NUMA hinting faults as if the PTE is not present then a fault will be trapped. Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset. This patch shrinks the maximum possible swap size and uses the bit to uniquely distinguish between NUMA hinting ptes and swap ptes. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Noonan <steven@uplinklabs.net> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
32-bit support for NUMA is an oddity on its own but with automatic NUMA balancing on top there is a reasonable risk that the CPUPID information cannot be stored in the page flags. This patch removes support for automatic NUMA support on 32-bit x86. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Noonan <steven@uplinklabs.net> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-