- 25 Feb, 2017 40 commits
-
-
Alexey Dobriyan authored
After staring at this code for a while I've figured using small 2-entry array describing ARGV and ENVP is the way to address code duplication critique. Link: http://lkml.kernel.org/r/20170105185724.GA12027@avx2Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Geliang Tang authored
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Link: http://lkml.kernel.org/r/4fd1f82818665705ce75c5156a060ae7caa8e0a9.1482160150.git.geliangtang@gmail.comSigned-off-by: Geliang Tang <geliangtang@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Juergen Gross <jgross@suse.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Davidlohr Bueso authored
Given that the arch does not add its own implementations, simply use the asm-generic/current.h (generic-y) header instead of duplicating code. Link: http://lkml.kernel.org/r/1485992878-4780-2-git-send-email-dave@stgolabs.netSigned-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sudip Mukherjee authored
The build of frv defconfig gives warning: arch/frv/mb93090-mb00/pci-frv.c:176:5: warning: ignoring return value of 'pci_assign_resource', declared with attribute warn_unused_result Just print an error message to silence the warning. We can not do much here on error. Link: http://lkml.kernel.org/r/1484256471-5379-1-git-send-email-sudipm.mukherjee@gmail.comSigned-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Greg Thelen authored
Make a kasan test which uses a SLAB_ACCOUNT slab cache. If the test is run within a non default memcg, then it uncovers the bug fixed by "kasan: drain quarantine of memcg slab objects"[1]. If run without fix [1] it shows "Slab cache still has objects", and the kmem_cache structure is leaked. Here's an unpatched kernel test: $ dmesg -c > /dev/null $ mkdir /sys/fs/cgroup/memory/test $ echo $$ > /sys/fs/cgroup/memory/test/tasks $ modprobe test_kasan 2> /dev/null $ dmesg | grep -B1 still [ 123.456789] kasan test: memcg_accounted_kmem_cache allocate memcg accounted object [ 124.456789] kmem_cache_destroy test_cache: Slab cache still has objects Kernels with fix [1] don't have the "Slab cache still has objects" warning or the underlying leak. The new test runs and passes in the default (root) memcg, though in the root memcg it won't uncover the problem fixed by [1]. Link: http://lkml.kernel.org/r/1482257462-36948-2-git-send-email-gthelen@google.comSigned-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Greg Thelen authored
Per memcg slab accounting and kasan have a problem with kmem_cache destruction. - kmem_cache_create() allocates a kmem_cache, which is used for allocations from processes running in root (top) memcg. - Processes running in non root memcg and allocating with either __GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg kmem_cache. - Kasan catches use-after-free by having kfree() and kmem_cache_free() defer freeing of objects. Objects are placed in a quarantine. - kmem_cache_destroy() destroys root and non root kmem_caches. It takes care to drain the quarantine of objects from the root memcg's kmem_cache, but ignores objects associated with non root memcg. This causes leaks because quarantined per memcg objects refer to per memcg kmem cache being destroyed. To see the problem: 1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,) 2) from non root memcg, allocate and free a few objects from cache 3) dispose of the cache with kmem_cache_destroy() kmem_cache_destroy() will trigger a "Slab cache still has objects" warning indicating that the per memcg kmem_cache structure was leaked. Fix the leak by draining kasan quarantined objects allocated from non root memcg. Racing memcg deletion is tricky, but handled. kmem_cache_destroy() => shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache() flushes per memcg quarantined objects, even if that memcg has been rmdir'd and gone through memcg_deactivate_kmem_caches(). This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is enabled. So I don't think it's worth patching stable kernels. Link: http://lkml.kernel.org/r/1482257462-36948-1-git-send-email-gthelen@google.comSigned-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nathan Fontenot authored
Commit 31bc3858 ("add automatic onlining policy for the newly added memory") provides the capability to have added memory automatically onlined during add, but this appears to be slightly broken. The current implementation uses walk_memory_range() to call online_memory_block, which uses memory_block_change_state() to online the memory. Instead, we should be calling device_online() for the memory block in online_memory_block(). This would online the memory (the memory bus online routine memory_subsys_online() called from device_online calls memory_block_change_state()) and properly update the device struct offline flag. As a result of the current implementation, attempting to remove a memory block after adding it using auto online fails. This is because doing a remove, for instance echo offline > /sys/devices/system/memory/memoryXXX/state uses device_offline() which checks the dev->offline flag. Link: http://lkml.kernel.org/r/20170222220744.8119.19687.stgit@ltcalpine2-lp14.aus.stglabs.ibm.comSigned-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Minchan Kim authored
With rw_page, page_endio is used for completing IO on a page and it propagates write error to the address space if the IO fails. The problem is it accesses page->mapping directly which might be okay for file-backed pages but it shouldn't for anonymous page. Otherwise, it can corrupt one of field from anon_vma under us and system goes panic randomly. swap_writepage bdev_writepage ops->rw_page I encountered the BUG during developing new zram feature and it was really hard to figure it out because it made random crash, somtime mmap_sem lockdep, sometime other places where places never related to zram/zsmalloc, and not reproducible with some configuration. When I consider how that bug is subtle and people do fast-swap test with brd, it's worth to add stable mark, I think. Fixes: dd6bd0d9 ("swap: use bdev_read_page() / bdev_write_page()") Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Aneesh Kumar K.V authored
We are using the wrong flag value in task_numa_falt function. This can result in us doing wrong numa fault statistics update, because we update num_pages_migrate and numa_fault_locality etc based on the flag argument passed. Fixes: bae473a4 ("mm: introduce fault_env") Link: http://lkml.kernel.org/r/1487498395-9544-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Aneesh Kumar K.V authored
Do the prot_none/FOLL_NUMA check after we are sure this is a THP pte. Archs can implement prot_none such that it can return true for regular pmd entries. Link: http://lkml.kernel.org/r/1487498326-8734-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miles Chen authored
cleanup rest of dma_addr_t and phys_addr_t type casting in mm use %pad for dma_addr_t use %pa for phys_addr_t Link: http://lkml.kernel.org/r/1486618489-13912-1-git-send-email-miles.chen@mediatek.comSigned-off-by: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yisheng Xie authored
The class index and fullness group are not encoded in (first)page->mapping any more, after commit 3783689a ("zsmalloc: introduce zspage structure"). Instead, they are store in struct zspage. Just delete this unneeded comment. Link: http://lkml.kernel.org/r/1486620822-36826-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: Yisheng Xie <xieyisheng1@huawei.com> Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wei Yang authored
arch_zone_lowest/highest_possible_pfn[] is set to 0 and [ZONE_MOVABLE] is skipped in the loop. No need to reset them to 0 again. This patch just removes the redundant code. Link: http://lkml.kernel.org/r/20170209141731.60208-1-richard.weiyang@gmail.comSigned-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yisheng Xie authored
We had used page->lru to link the component pages (except the first page) of a zspage, and used INIT_LIST_HEAD(&page->lru) to init it. Therefore, to get the last page's next page, which is NULL, we had to use page flag PG_Private_2 to identify it. But now, we use page->freelist to link all of the pages in zspage and init the page->freelist as NULL for last page, so no need to use PG_Private_2 anymore. This remove redundant SetPagePrivate2 in create_page_chain and ClearPagePrivate2 in reset_page(). Save a few cycles for migration of zsmalloc page :) Link: http://lkml.kernel.org/r/1487076509-49270-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vinayak Menon authored
At the end of a window period, if the reclaimed pages is greater than scanned, an unsigned underflow can result in a huge pressure value and thus a critical event. Reclaimed pages is found to go higher than scanned because of the addition of reclaimed slab pages to reclaimed in shrink_node without a corresponding increment to scanned pages. Minchan Kim mentioned that this can also happen in the case of a THP page where the scanned is 1 and reclaimed could be 512. Link: http://lkml.kernel.org/r/1486641577-11685-1-git-send-email-vinmenon@codeaurora.orgSigned-off-by: Vinayak Menon <vinmenon@codeaurora.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Shiraz Hashim <shashim@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
Remove the prototypes for shmem_mapping() and shmem_zero_setup() from linux/mm.h, since they are already provided in linux/shmem_fs.h. But shmem_fs.h must then provide the inline stub for shmem_mapping() when CONFIG_SHMEM is not set, and a few more cfiles now need to #include it. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvilsSigned-off-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Gavin Shan authored
When @node_reclaim_node isn't 0, the page allocator tries to reclaim pages if the amount of free memory in the zones are below the low watermark. On Power platform, none of NUMA nodes are scanned for page reclaim because no nodes match the condition in zone_allows_reclaim(). On Power platform, RECLAIM_DISTANCE is set to 10 which is the distance of Node-A to Node-A. So the preferred node even won't be scanned for page reclaim. __alloc_pages_nodemask() get_page_from_freelist() zone_allows_reclaim() Anton proposed the test code as below: # cat alloc.c : int main(int argc, char *argv[]) { void *p; unsigned long size; unsigned long start, end; start = time(NULL); size = strtoul(argv[1], NULL, 0); printf("To allocate %ldGB memory\n", size); size <<= 30; p = malloc(size); assert(p); memset(p, 0, size); end = time(NULL); printf("Used time: %ld seconds\n", end - start); sleep(3600); return 0; } The system I use for testing has two NUMA nodes. Both have 128GB memory. In below scnario, the page caches on node#0 should be reclaimed when it encounters pressure to accommodate request of allocation. # echo 2 > /proc/sys/vm/zone_reclaim_mode; \ sync; \ echo 3 > /proc/sys/vm/drop_caches; \ # taskset -c 0 cat file.32G > /dev/null; \ grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 33619712 kB # taskset -c 0 ./alloc 128 # grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 33619840 kB # grep MemFree /sys/devices/system/node/node0/meminfo Node 0 MemFree: 186816 kB With the patch applied, the pagecache on node-0 is reclaimed when its free memory is running out. It's the expected behaviour. # echo 2 > /proc/sys/vm/zone_reclaim_mode; \ sync; \ echo 3 > /proc/sys/vm/drop_caches # taskset -c 0 cat file.32G > /dev/null; \ grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 33605568 kB # taskset -c 0 ./alloc 128 # grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 1379520 kB # grep MemFree /sys/devices/system/node/node0/meminfo Node 0 MemFree: 317120 kB Fixes: 5f7a75ac ("mm: page_alloc: do not cache reclaim distances") Link: http://lkml.kernel.org/r/1486532455-29613-1-git-send-email-gwshan@linux.vnet.ibm.comSigned-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anton Blanchard <anton@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> [3.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
zhong jiang authored
When mainline introduced commit a96dfddb ("base/memory, hotplug: fix a kernel oops in show_valid_zones()"), it obtained the valid start and end pfn from the given pfn range. The valid start pfn can fix the actual issue, but it introduced another issue. The valid end pfn will may exceed the given end_pfn. Although the incorrect overflow will not result in actual problem at present, but I think it need to be fixed. [toshi.kani@hpe.com: remove assumption that end_pfn is aligned by MAX_ORDER_NR_PAGES] Fixes: a96dfddb ("base/memory, hotplug: fix a kernel oops in show_valid_zones()") Link: http://lkml.kernel.org/r/1486467299-22648-1-git-send-email-zhongjiang@huawei.comSigned-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
zhouxianrong authored
The idea is that without doing more calculations we extend zero pages to same element pages for zram. zero page is special case of same element page with zero element. 1. the test is done under android 7.0 2. startup too many applications circularly 3. sample the zero pages, same pages (none-zero element) and total pages in function page_zero_filled the result is listed as below: ZERO SAME TOTAL 36214 17842 598196 ZERO/TOTAL SAME/TOTAL (ZERO+SAME)/TOTAL ZERO/SAME AVERAGE 0.060631909 0.024990816 0.085622726 2.663825038 STDEV 0.00674612 0.005887625 0.009707034 2.115881328 MAX 0.069698422 0.030046087 0.094975336 7.56043956 MIN 0.03959586 0.007332205 0.056055193 1.928985507 from the above data, the benefit is about 2.5% and up to 3% of total swapout pages. The defect of the patch is that when we recovery a page from non-zero element the operations are low efficient for partial read. This patch extends zero_page to same_page so if there is any user to have monitored zero_pages, he will be surprised if the number is increased but it's not harmful, I believe. [minchan@kernel.org: do not free same element pages in zram_meta_free] Link: http://lkml.kernel.org/r/20170207065741.GA2567@bbox Link: http://lkml.kernel.org/r/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com Link: http://lkml.kernel.org/r/1486307804-27903-1-git-send-email-minchan@kernel.orgSigned-off-by: zhouxianrong <zhouxianrong@huawei.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Steven Rostedt (VMware) authored
The likely/unlikely profiler noticed that the unlikely statement in wb_domain_writeout_inc() is constantly wrong. This is due to the "not" (!) being outside the unlikely statement. It is likely that dom->period_time will be set, but unlikely that it wont be. Move the not into the unlikely statement. Link: http://lkml.kernel.org/r/20170206120035.3c2e2b91@gandalf.local.homeSigned-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Aneesh Kumar K.V authored
With this our protnone becomes a present pte with READ/WRITE/EXEC bit cleared. By default we also set _PAGE_PRIVILEGED on such pte. This is now used to help us identify a protnone pte that as saved write bit. For such pte, we will clear the _PAGE_PRIVILEGED bit. The pte still remain non-accessible from both user and kernel. [aneesh.kumar@linux.vnet.ibm.com: v3] Link: http://lkml.kernel.org/r/1487498625-10891-4-git-send-email-aneesh.kumar@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1487050314-3892-3-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Michael Neuling <mikey@neuling.org> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <michaele@au1.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Aneesh Kumar K.V authored
Without this KSM will consider the page write protected, but a numa fault can later mark the page writable. This can result in memory corruption. Link: http://lkml.kernel.org/r/1487498625-10891-3-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Aneesh Kumar K.V authored
Patch series "Numabalancing preserve write fix", v2. This patch series address an issue w.r.t THP migration and autonuma preserve write feature. migrate_misplaced_transhuge_page() cannot deal with concurrent modification of the page. It does a page copy without following the migration pte sequence. IIUC, this was done to keep the migration simpler and at the time of implemenation we didn't had THP page cache which would have required a more elaborate migration scheme. That means thp autonuma migration expect the protnone with saved write to be done such that both kernel and user cannot update the page content. This patch series enables archs like ppc64 to do that. We are good with the hash translation mode with the current code, because we never create a hardware page table entry for a protnone pte. This patch (of 2): Autonuma preserves the write permission across numa fault to avoid taking a writefault after a numa fault (Commit: b191f9b1 " mm: numa: preserve PTE write permissions across a NUMA hinting fault"). Architecture can implement protnone in different ways and some may choose to implement that by clearing Read/ Write/Exec bit of pte. Setting the write bit on such pte can result in wrong behaviour. Fix this up by allowing arch to override how to save the write bit on a protnone pte. [aneesh.kumar@linux.vnet.ibm.com: don't mark pte saved write in case of dirty_accountable] Link: http://lkml.kernel.org/r/1487942884-16517-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com [aneesh.kumar@linux.vnet.ibm.com: v3] Link: http://lkml.kernel.org/r/1487498625-10891-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1487050314-3892-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Michael Neuling <mikey@neuling.org> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <michaele@au1.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Aneesh Kumar K.V authored
Architectures like ppc64, use privilege access bit to mark pte non accessible. This implies that kernel can do a copy_to_user to an address marked for numa fault. This also implies that there can be a parallel hardware update for the pte. set_pte_at cannot be used in such scenarios. Hence switch the pte update to use ptep_get_and_clear and set_pte_at combination. [akpm@linux-foundation.org: remove unwanted ppc change, per Aneesh] Link: http://lkml.kernel.org/r/1486400776-28114-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Steven Rostedt (VMware) authored
Running my likely/unlikely profiler, I discovered that the test in shmem_write_begin() that tests for info->seals as unlikely, is always incorrect. This is because shmem_get_inode() sets info->seals to have F_SEAL_SEAL set by default, and it is unlikely to be cleared when shmem_write_begin() is called. Thus, the if statement is very likely. But as the if statement block only cares about F_SEAL_WRITE and F_SEAL_GROW, change the test to only test those two bits. Link: http://lkml.kernel.org/r/20170203105656.7aec6237@gandalf.local.homeSigned-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
Hillf Danton pointed out that since commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of nodes") that PGDAT_WRITEBACK is no longer cleared. It was not noticed as triggering it requires pages under writeback to cycle twice through the LRU and before kswapd gets stalled. Historically, such issues tended to occur on small machines writing heavily to slow storage such as a USB stick. Once kswapd stalls, direct reclaim stalls may be higher but due to the fact that memory pressure is required, it would not be very noticable. Michal Hocko suggested removing the flag entirely but the conservative fix is to restore the intended PGDAT_WRITEBACK behaviour and clear the flag when a suitable zone is balanced. Fixes: 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of nodes") Link: http://lkml.kernel.org/r/20170203203222.gq7hk66yc36lpgtb@suse.deSigned-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Ross Zwisler authored
The fault wrappers drm_vm_fault(), drm_vm_shm_fault(), drm_vm_dma_fault() and drm_vm_sg_fault() used to provide extra logic beyond what was in the "drm_do_*" versions of these functions, but as of commit ca0b07d9 ("drm: convert drm from nopage to fault") they are just unnecessary wrappers that do nothing. Remove them, and rename the the drm_do_* fault handlers to remove the "do_" since they no longer have corresponding wrappers. Link: http://lkml.kernel.org/r/1486155698-25717-1-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Sean Paul <seanpaul@chromium.org> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Tobin C Harding authored
Fix whitespace issues, extraneous braces. Link: http://lkml.kernel.org/r/1485992240-10986-5-git-send-email-me@tobin.ccSigned-off-by: Tobin C Harding <me@tobin.cc> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Tobin C Harding authored
Patch fixes sparse warning: Using plain integer as NULL pointer. Replaces assignment of 0 to pointer with NULL assignment. Link: http://lkml.kernel.org/r/1485992240-10986-2-git-send-email-me@tobin.ccSigned-off-by: Tobin C Harding <me@tobin.cc> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Masanari Iida authored
Link: http://lkml.kernel.org/r/20170202011942.1609-1-standby24x7@gmail.comSigned-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
__vmalloc_area_node() allocates pages to cover the requested vmalloc size. This can be a lot of memory. If the current task is killed by the OOM killer, and thus has an unlimited access to memory reserves, it can consume all the memory theoretically. Fix this by checking for fatal_signal_pending and back off early. Link: http://lkml.kernel.org/r/20170201092706.9966-4-mhocko@kernel.orgSigned-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jaewon Kim authored
There are many reasons of CMA allocation failure such as EBUSY, ENOMEM, EINTR. But we did not know error reason so far. This patch prints the error value. Additionally if CONFIG_CMA_DEBUG is enabled, this patch shows bitmap status to know available pages. Actually CMA internally tries on all available regions because some regions can be failed because of EBUSY. Bitmap status is useful to know in detail on both ENONEM and EBUSY; ENOMEM: not tried at all because of no available region it could be too small total region or could be fragmentation issue EBUSY: tried some region but all failed This is an ENOMEM example with this patch. [2: Binder:714_1: 744] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12 If CONFIG_CMA_DEBUG is enabled, avabile pages also will be shown as concatenated size@position format. So 4@572 means that there are 4 available pages at 572 position starting from 0 position. [2: Binder:714_1: 744] cma: number of available pages: 4@572+7@585+7@601+8@632+38@730+166@1114+127@1921=> 357 free of 2048 total pages Link: http://lkml.kernel.org/r/1485909785-3952-1-git-send-email-jaewon31.kim@samsung.comSigned-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Rientjes authored
If madvise(2) advice will result in the underlying vma being split and the number of areas mapped by the process will exceed /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN. EAGAIN is returned by madvise(2) when a kernel resource, such as slab, is temporarily unavailable. It indicates that userspace should retry the advice in the near future. This is important for advice such as MADV_DONTNEED which is often used by malloc implementations to free memory back to the system: we really do want to free memory back when madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas, or mempolicies) cannot be allocated. Encountering /proc/sys/vm/max_map_count is not a temporary failure, however, so return ENOMEM to indicate this is a more serious issue. A followup patch to the man page will specify this behavior. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.comSigned-off-by: David Rientjes <rientjes@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Lucas Stach authored
The callers of the DMA alloc functions already provide the proper context GFP flags. Make sure to pass them through to the CMA allocator, to make the CMA compaction context aware. Link: http://lkml.kernel.org/r/20170127172328.18574-3-l.stach@pengutronix.deSigned-off-by: Lucas Stach <l.stach@pengutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alexander Graf <agraf@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Lucas Stach authored
Most users of this interface just want to use it with the default GFP_KERNEL flags, but for cases where DMA memory is allocated it may be called from a different context. No functional change yet, just passing through the flag to the underlying alloc_contig_range function. Link: http://lkml.kernel.org/r/20170127172328.18574-2-l.stach@pengutronix.deSigned-off-by: Lucas Stach <l.stach@pengutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alexander Graf <agraf@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Lucas Stach authored
Currently alloc_contig_range assumes that the compaction should be done with the default GFP_KERNEL flags. This is probably right for all current uses of this interface, but may change as CMA is used in more use-cases (including being the default DMA memory allocator on some platforms). Change the function prototype, to allow for passing through the GFP mask set by upper layers. Also respect global restrictions by applying memalloc_noio_flags to the passed in flags. Link: http://lkml.kernel.org/r/20170127172328.18574-1-l.stach@pengutronix.deSigned-off-by: Lucas Stach <l.stach@pengutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alexander Graf <agraf@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
Add documentation about new userfaultfd features and events Link: http://lkml.kernel.org/r/1487716431-5551-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
In the non-cooperative userfaultfd case, the process exit may race with outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC instead of -EINVAL when mm is already gone will allow uffd monitor to distinguish this case from other error conditions. Link: http://lkml.kernel.org/r/1485542673-24387-6-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
The memory mapping of a process may change between #PF event and the call to mcopy_atomic that comes to resolve the page fault. In such case, there will be no VMA covering the range passed to mcopy_atomic or the VMA will not have userfaultfd context. To allow uffd monitor to distinguish those case from other errors, let's return -ENOENT instead of -EINVAL. Note, that despite availability of UFFD_EVENT_UNMAP there still might be race between the processing of UFFD_EVENT_UNMAP and outstanding mcopy_atomic in case of non-cooperative uffd usage. [rppt@linux.vnet.ibm.com: update cases returning -ENOENT] Link: http://lkml.kernel.org/r/20170207150249.GA6709@rapoport-lnx [aarcange@redhat.com: merge fix] [akpm@linux-foundation.org: fix the merge fix] Link: http://lkml.kernel.org/r/1485542673-24387-5-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
Allow userfaultfd monitor track termination of the processes that have memory backed by the uffd. [rppt@linux.vnet.ibm.com: add comment] Link: http://lkml.kernel.org/r/20170202135448.GB19804@rapoport-lnxLink: http://lkml.kernel.org/r/1485542673-24387-4-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-