- 08 Sep, 2015 40 commits
-
-
Michal Hocko authored
Most of the exported functions in this header are not marked extern so change the rest to follow the same style. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
The only user is cgwb_bdi_init and that one depends on CONFIG_CGROUP_WRITEBACK which in turn depends on CONFIG_MEMCG so it doesn't make much sense to definte an empty stub for !CONFIG_MEMCG. Moreover ERR_PTR(-EINVAL) is ugly and would lead to runtime crashes if used in unguarded code paths. Better fail during compilation. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
mem_cgroup structure is defined in mm/memcontrol.c currently which means that the code outside of this file has to use external API even for trivial access stuff. This patch exports mm_struct with its dependencies and makes some of the exported functions inlines. This even helps to reduce the code size a bit (make defconfig + CONFIG_MEMCG=y) text data bss dec hex filename 12355346 1823792 1089536 15268674 e8fb42 vmlinux.before 12354970 1823792 1089536 15268298 e8f9ca vmlinux.after This is not much (370B) but better than nothing. We also save a function call in some hot paths like callers of mem_cgroup_count_vm_event which is used for accounting. The patch doesn't introduce any functional changes. [vdavykov@parallels.com: inline memcg_kmem_is_active] [vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG] [akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx] [akpm@linux-foundation.org: export mem_cgroup_from_task() to modules] Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
"memcg: export struct mem_cgroup" will add includes into linux/memcontrol.h which lead to further header dependency issues as reported by Guenter Roeck: In file included from include/linux/highmem.h:7:0, from include/linux/bio.h:23, from include/linux/writeback.h:192, from include/linux/memcontrol.h:30, from include/linux/swap.h:8, from ./arch/sparc/include/asm/pgtable_32.h:17, from ./arch/sparc/include/asm/pgtable.h:6, from arch/sparc/kernel/traps_32.c:23: include/linux/mm.h: In function 'is_vmalloc_addr': include/linux/mm.h:371:17: error: 'VMALLOC_START' undeclared (first use in this function) include/linux/mm.h:371:17: note: each undeclared identifier is reported only once for each function it appears in include/linux/mm.h:371:41: error: 'VMALLOC_END' undeclared (first use in this function) include/linux/mm.h: In function 'maybe_mkwrite': include/linux/mm.h:556:3: error: implicit declaration of function 'pte_mkwrite' The issue is that pgtable_32.h depends on swap.h to get swap_entry_t but that goes all the way down to linux/mm.h which wants to have VMALLOC_* which is defined later in pgtable_32.h, though. swap_entry_t is defined in include/mm_types.h so it should be sufficient to include this header without more dependencies. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sergey Senozhatsky authored
dma_pool_destroy() does not tolerate a NULL dma_pool pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all dma_pool_destroy() callers to do a NULL check if (pool) dma_pool_destroy(pool); Or, otherwise, be invalid dma_pool_destroy() users. Tweak dma_pool_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sergey Senozhatsky authored
mempool_destroy() does not tolerate a NULL mempool_t pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all mempool_destroy() callers to do a NULL check if (pool) mempool_destroy(pool); Or, otherwise, be invalid mempool_destroy() users. Tweak mempool_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sergey Senozhatsky authored
kmem_cache_destroy() does not tolerate a NULL kmem_cache pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all kmem_cache_destroy() callers (200+ as of 4.1) to do a NULL check if (cache) kmem_cache_destroy(cache); Or, otherwise, be invalid kmem_cache_destroy() users. Tweak kmem_cache_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Rientjes authored
The "killed" variable in out_of_memory() can be removed since the call to oom_kill_process() where we should block to allow the process time to exit is obvious. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Rientjes authored
Describe the purpose of struct oom_control and what each member does. Also make gfp_mask and order const since they are never manipulated or passed to functions that discard the qualifier. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Rientjes authored
Sysrq+f is used to kill a process either for debug or when the VM is otherwise unresponsive. It is not intended to trigger a panic when no process may be killed. Avoid panicking the system for sysrq+f when no processes are killed. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Rientjes authored
The force_kill member of struct oom_control isn't needed if an order of -1 is used instead. This is the same as order == -1 in struct compact_control which requires full memory compaction. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Rientjes authored
There are essential elements to an oom context that are passed around to multiple functions. Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nicholas Krause authored
This makes set_recommended_min_free_kbytes() have a return type of void as it cannot fail. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Rientjes authored
Explicitly state that __GFP_NORETRY will attempt direct reclaim and memory compaction before returning NULL and that the oom killer is not called in the current implementation of the page allocator. [akpm@linux-foundation.org: s/has/have/] Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dave Hansen authored
=== Short summary ==== iov_iter_fault_in_readable() works around a really rare case and we can avoid the deadlock it addresses in another way: disable page faults and work around copy failures by faulting after the copy in a slow path instead of before in a hot one. I have a little microbenchmark that does repeated, small writes to tmpfs. This patch speeds that micro up by 6.2%. === Long version === When doing a sys_write() we have a source buffer in userspace and then a target file page. If both of those are the same physical page, there is a potential deadlock that we avoid. It would happen something like this: 1. We start the write to the file 2. Allocate page cache page and set it !Uptodate 3. Touch the userspace buffer to copy in the user data 4. Page fault (since source of the write not yet mapped) 5. Page fault code tries to lock the page and deadlocks (more details on this below) To avoid this, we prefault the page to guarantee that this fault does not occur. But, this prefault comes at a cost. It is one of the most expensive things that we do in a hot write() path (especially if we compare it to the read path). It is working around a pretty rare case. To fix this, it's pretty simple. We move the "prefault" code to run after we attempt the copy. We explicitly disable page faults _during_ the copy, detect the copy failure, then execute the "prefault" ouside of where the page lock needs to be held. iov_iter_copy_from_user_atomic() actually already has an implicit pagefault_disable() inside of it (at least on x86), but we add an explicit one. I don't think we can depend on every kmap_atomic() implementation to pagefault_disable() for eternity. =================================================== The stack trace when this happens looks like this: wait_on_page_bit_killable+0xc0/0xd0 __lock_page_or_retry+0x84/0xa0 filemap_fault+0x1ed/0x3d0 __do_fault+0x41/0xc0 handle_mm_fault+0x9bb/0x1210 __do_page_fault+0x17f/0x3d0 do_page_fault+0xc/0x10 page_fault+0x22/0x30 generic_perform_write+0xca/0x1a0 __generic_file_write_iter+0x190/0x1f0 ext4_file_write_iter+0xe9/0x460 __vfs_write+0xaa/0xe0 vfs_write+0xa6/0x1a0 SyS_write+0x46/0xa0 entry_SYSCALL_64_fastpath+0x12/0x6a 0xffffffffffffffff (Note, this does *NOT* happen in practice today because the kmap_atomic() does a pagefault_disable(). The trace above was obtained by taking out the pagefault_disable().) You can trigger the deadlock with this little code snippet: fd = open("foo", O_RDWR); fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0); write(fd, &fdmap[0], 1); Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Paul Cassella <cassella@cray.com> Cc: Greg Thelen <gthelen@google.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Minchan Kim authored
We want to know per-process workingset size for smart memory management on userland and we use swap(ex, zram) heavily to maximize memory efficiency so workingset includes swap as well as RSS. On such system, if there are lots of shared anonymous pages, it's really hard to figure out exactly how many each process consumes memory(ie, rss + wap) if the system has lots of shared anonymous memory(e.g, android). This patch introduces SwapPss field on /proc/<pid>/smaps so we can get more exact workingset size per process. Bongkyu tested it. Result is below. 1. 50M used swap SwapTotal: 461976 kB SwapFree: 411192 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 48236 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 141184 2. 240M used swap SwapTotal: 461976 kB SwapFree: 216808 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 230315 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 1387744 [akpm@linux-foundation.org: simplify kunmap_atomic() call] Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Bongkyu Kim <bongkyu.kim@lge.com> Tested-by: Bongkyu Kim <bongkyu.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Murzin authored
memtest does not require these headers to be included. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Leon Romanovsky <leon@leon.nu> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Murzin authored
- prefer pr_info(... to printk(KERN_INFO ... - use %pa for phys_addr_t - use cpu_to_be64 while printing pattern in reserve_bad_mem() Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Leon Romanovsky <leon@leon.nu> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Murzin authored
Since simple_strtoul is obsolete and memtest_pattern is type of int, use kstrtouint instead. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Leon Romanovsky <leon@leon.nu> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Konstantin Khlebnikov authored
Notes about recent changes. [akpm@linux-foundation.org: various tweaks] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mark Williamson <mwilliamson@undo-software.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Konstantin Khlebnikov authored
This patch sets bit 56 in pagemap if this page is mapped only once. It allows to detect exclusively used pages without exposing PFN: present file exclusive state 0 0 0 non-present 1 1 0 file page mapped somewhere else 1 1 1 file page mapped only here 1 0 0 anon non-CoWed page (shared with parent/child) 1 0 1 anon CoWed page (or never forked) CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context. MMap-exclusive bit doesn't reflect potential page-sharing via swapcache: page could be mapped once but has several swap-ptes which point to it. Application could detect that by swap bit in pagemap entry and touch that pte via /proc/pid/mem to get real information. See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com Requested by Mark Williamson. [akpm@linux-foundation.org: fix spello] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Konstantin Khlebnikov authored
This patch makes pagemap readable for normal users and hides physical addresses from them. For some use-cases PFN isn't required at all. See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name Fixes: ab676b7d ("pagemap: do not leak physical addresses to non-privileged userspace") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Konstantin Khlebnikov authored
This patch moves pmd dissection out of reporting loop: huge pages are reported as bunch of normal pages with contiguous PFNs. Add missing "FILE" bit in hugetlb vmas. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Konstantin Khlebnikov authored
This patch removes page-shift bits (scheduled to remove since 3.11) and completes migration to the new bit layout. Also it cleans messy macro. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Konstantin Khlebnikov authored
This patchset makes pagemap useable again in the safe way (after row hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access for non-privileged users but hides PFNs from them. Also it adds bit 'map-exclusive' which is set if page is mapped only here: it helps in estimation of working set without exposing pfns and allows to distinguish CoWed and non-CoWed private anonymous pages. Second patch removes page-shift bits and completes migration to the new pagemap format: flags soft-dirty and mmap-exclusive are available only in the new format. This patch (of 5): This patch moves permission checks from pagemap_read() into pagemap_open(). Pointer to mm is saved in file->private_data. This reference pins only mm_struct itself. /proc/*/mem, maps, smaps already work in the same way. See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.comSigned-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vineet Gupta authored
It has no callers. Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wei Yang authored
Each memblock_region has flags to indicates the type of this range. For the overlap case, memblock_add_range() inserts the lower part and leave the upper part as indicated in the overlapped region. If the flags of the new range differs from the overlapped region, the information recorded is not correct. This patch adds a WARN_ON when the flags of the new range differs from the overlapped region. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wei Yang authored
Commit febd5949 ("mm/memory hotplug: init the zone's size when calculating node totalpages") refines the function free_area_init_core(). After doing so, these two parameters are not used anymore. This patch removes these two parameters. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wei Yang authored
nr_node_ids records the highest possible node id, which is calculated by scanning the bitmap node_states[N_POSSIBLE]. Current implementation scan the bitmap from the beginning, which will scan the whole bitmap. This patch reverses the order by scanning from the end with find_last_bit(). Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
__dax_fault() takes i_mmap_lock for write. Let's pair it with write unlock on do_cow_fault() side. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap. __dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from all mappings. We need to drop i_mmap_lock there to avoid lock deadlock. Re-aquiring the lock should be fine since we check i_size after the point. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox authored
I was basically open-coding it (thanks to copying code from do_fault() which probably also needs to be fixed). Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox authored
If the first access to a huge page was a store, there would be no existing zero pmd in this process's page tables. There could be a zero pmd in another process's page tables, if it had done a load. We can detect this case by noticing that the buffer_head returned from the filesystem is New, and ensure that other processes mapping this huge page have their page tables flushed. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
This is another place where DAX assumed that pgtable_t was a pointer. Open code the important parts of set_huge_zero_page() in DAX and make set_huge_zero_page() static again. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
The original DAX code assumed that pgtable_t was a pointer, which isn't true on all architectures. Restructure the code to not rely on that assumption. [willy@linux.intel.com: further fixes integrated into this patch] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
The DAX code neglected to put the refcount on the huge zero page. Also we must notify on splits. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox authored
If two threads write-fault on the same hole at the same time, the winner of the race will return to userspace and complete their store, only to have the loser overwrite their store with zeroes. Fix this for now by taking the i_mmap_sem for write instead of read, and do so outside the call to get_block(). Now the loser of the race will see the block has already been zeroed, and will not zero it again. This severely limits our scalability. I have ideas for improving it, but those can wait for a later patch. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox authored
Jan Kara pointed out that in the case where we are writing to a hole, we can end up with a lock inversion between the page lock and the journal lock. We can avoid this by starting the transaction in ext4 before calling into DAX. The journal lock nests inside the superblock pagefault lock, so we have to duplicate that code from dax_fault, like XFS does. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox authored
DAX wants different semantics from any currently-existing ext4 get_block callback. Unlike ext4_get_block_write(), it needs to honour the 'create' flag, and unlike ext4_get_block(), it needs to be able to return unwritten extents. So introduce a new ext4_get_block_dax() which has those semantics. We could also change ext4_get_block_write() to honour the 'create' flag, but that might have consequences on other users that I do not currently understand. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox authored
Jan Kara pointed out I should be more explicit here about the perils of racing against truncate. The comment is mostly the same as for the PTE case. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-