- 15 Dec, 2020 40 commits
-
-
Kalesh Singh authored
Android needs to move large memory regions for garbage collection. The GC requires moving physical pages of multi-gigabyte heap using mremap. During this move, the application threads have to be paused for correctness. It is critical to keep this pause as short as possible to avoid jitters during user interaction. Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD level if the source and destination addresses are PUD-aligned. For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves PGD entries, since the PUD entry is “folded back” onto the PGD entry. Add HAVE_MOVE_PUD so that architectures where moving at the PUD level isn't supported/tested can turn this off by not selecting the config. Link: https://lkml.kernel.org/r/20201014005320.2233162-4-kaleshsingh@google.comSigned-off-by: Kalesh Singh <kaleshsingh@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Gavin Shan <gshan@redhat.com> Cc: Hassan Naveed <hnaveed@wavecomp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kalesh Singh authored
Patch series "Speed up mremap on large regions", v4. mremap time can be optimized by moving entries at the PMD/PUD level if the source and destination addresses are PMD/PUD-aligned and PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and x86. Other architectures where this type of move is supported and known to be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD and HAVE_MOVE_PUD. Observed Performance Improvements for remapping a PUD-aligned 1GB-sized region on x86 and arm64: - HAVE_MOVE_PMD is already enabled on x86 : N/A - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD give a total of ~150x speed up on arm64. This patch (of 4): Test mremap on regions of various sizes and alignments and validate data after remapping. Also provide total time for remapping the region which is useful for performance comparison of the mremap optimizations that move pages at the PMD/PUD levels if HAVE_MOVE_PMD and/or HAVE_MOVE_PUD are enabled. Link: https://lkml.kernel.org/r/20201014005320.2233162-1-kaleshsingh@google.com Link: https://lkml.kernel.org/r/20201014005320.2233162-2-kaleshsingh@google.comSigned-off-by: Kalesh Singh <kaleshsingh@google.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Hassan Naveed <hnaveed@wavecomp.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Jia He <justin.he@arm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: SeongJae Park <sjpark@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dan Williams authored
Cleanup fill_list() to keep all the pgmap manipulations in a single location of the function. Update the exit unwind path accordingly. Link: http://lore.kernel.org/r/6186fa28-d123-12db-6171-a75cb6e615a5@oracle.com Link: https://lkml.kernel.org/r/160272253442.3136502.16683842453317773487.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Shakeel Butt authored
For many workloads, pagetable consumption is significant and it makes sense to expose it in the memory.stat for the memory cgroups. However at the moment, the pagetables are accounted per-zone. Converting them to per-node and using the right interface will correctly account for the memory cgroups as well. [akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/] Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.comSigned-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Shakeel Butt authored
Patch series "memcg: add pagetable comsumption to memory.stat", v2. Many workloads consumes significant amount of memory in pagetables. One specific use-case is the user space network driver which mmaps the application memory to provide zero copy transfer. This driver can consume a large amount memory in page tables. This patch series exposes the pagetable comsumption for each memory cgroup. This patch (of 2): This does not change any functionality and only move the functions which update the lruvec stats to vmstat.h from memcontrol.h. The main reason for this patch is to be able to use these functions in the page table contructor function which is defined in mm.h and we can not include the memcontrol.h in that file. Also this is a better place for this interface in general. The lruvec abstraction, while invented for memcg, isn't specific to memcg at all. Link: https://lkml.kernel.org/r/20201130212541.2781790-2-shakeelb@google.comSigned-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Alex Shi authored
Swapcache readahead pages are charged before being used, so it is unlikely that they will be migrated before charging. Remove the incorrect comment. Link: https://lkml.kernel.org/r/1605864930-49405-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kaixu Xia authored
Fix the following coccinelle warnings: mm/memcontrol.c:7341:2-22: WARNING: Assignment of 0/1 to bool variable mm/memcontrol.c:7343:2-22: WARNING: Assignment of 0/1 to bool variable Link: https://lkml.kernel.org/r/1604737495-6418-1-git-send-email-kaixuxia@tencent.comSigned-off-by: Kaixu Xia <kaixuxia@tencent.com> Reported-by: Tosk Robot <tencent_os_robot@tencent.com> Acked-by: Souptick Joarder <jrdr.linux@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Muchun Song authored
The *_lruvec_slab_state is also suitable for pages allocated from buddy, not just for the slab objects. But the function name seems to tell us that only slab object is applicable. So we can rename the keyword of slab to kmem. Link: https://lkml.kernel.org/r/20201117085249.24319-1-songmuchun@bytedance.comSigned-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Lukas Bulwahn authored
Commit 2ef1bf118c40 ("mm: memcg: deprecate the non-hierarchical mode") removed the only use of memcg_has_children() in mem_cgroup_hierarchy_write() as part of the feature deprecation. Hence, since then, make CC=clang W=1 warns: mm/memcontrol.c:3421:20: warning: unused function 'memcg_has_children' [-Wunused-function] Simply remove this obsolete unused function. Link: https://lkml.kernel.org/r/20201116055043.20886-1-lukas.bulwahn@gmail.comSigned-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hui Su authored
Use page_counter_read() in page_counter_set_max(). Link: https://lkml.kernel.org/r/20201113141048.GA178922@rlkSigned-off-by: Hui Su <sh_def@163.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Roman Gushchin authored
With the deprecation of the non-hierarchical mode of the memory controller there are no more examples of broken hierarchies left. Let's remove the cgroup core code which was supposed to print warnings about creating of broken hierarchies. Link: https://lkml.kernel.org/r/20201110220800.929549-4-guro@fb.comSigned-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Roman Gushchin authored
Update cgroup v1 docs after the deprecation of the non-hierarchical mode of the memory controller. Link: https://lkml.kernel.org/r/20201110220800.929549-3-guro@fb.comSigned-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Roman Gushchin authored
Patch series "mm: memcg: deprecate cgroup v1 non-hierarchical mode", v1. The non-hierarchical cgroup v1 mode is a legacy of early days of the memory controller and doesn't bring any value today. However, it complicates the code and creates many edge cases all over the memory controller code. It's a good time to deprecate it completely. This patchset removes the internal logic, adjusts the user interface and updates the documentation. The alt patch removes some bits of the cgroup core code, which become obsolete. Michal Hocko said: "All that we know today is that we have a warning in place to complain loudly when somebody relies on use_hierarchy=0 with a deeper hierarchy. For all those years we have seen _zero_ reports that would describe a sensible usecase. Moreover we (SUSE) have backported this warning into old distribution kernels (since 3.0 based kernels) to extend the coverage and didn't hear even for users who adopt new kernels only very slowly. The only report we have seen so far was a LTP test suite which doesn't really reflect any real life usecase" This patch (of 3): The non-hierarchical cgroup v1 mode is a legacy of early days of the memory controller and doesn't bring any value today. However, it complicates the code and creates many edge cases all over the memory controller code. It's a good time to deprecate it completely. Functionally this patch enabled is by default for all cgroups and forbids switching it off. Nothing changes if cgroup v2 is used: hierarchical mode was enforced from scratch. To protect the ABI memory.use_hierarchy interface is preserved with a limited functionality: reading always returns "1", writing of "1" passes silently, writing of any other value fails with -EINVAL and a warning to dmesg (on the first occasion). Link: https://lkml.kernel.org/r/20201110220800.929549-1-guro@fb.com Link: https://lkml.kernel.org/r/20201110220800.929549-2-guro@fb.comSigned-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Roman Gushchin authored
This patch fixes/removes some obsolete comments in the code related to the kernel memory accounting: - kmem_cache->memcg_params.memcg_caches has been removed by commit 9855609b ("mm: memcg/slab: use a single set of kmem_caches for all accounted allocations") - memcg->kmemcg_id is not used as a gate for kmem accounting since commit 0b8f73e1 ("mm: memcontrol: clean up alloc, online, offline, free functions") Link: https://lkml.kernel.org/r/20201110184615.311974-1-guro@fb.comSigned-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Alex Shi authored
The page->mem_cgroup member is replaced by memcg_data, and add a helper page_memcg() for it. Need to update comments to avoid confusing. Link: https://lkml.kernel.org/r/1491c150-1cc0-6062-08ea-9c891548a3bc@linux.alibaba.comSigned-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Shakeel Butt authored
Since commit 369ea824 ("mm/rmap: update to new mmu_notifier semantic v2"), the code to check the secondary MMU's page table access bit is broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the secondary MMU's page table before the check. More specifically for those secondary MMUs which unmap the memory in mmu_notifier_invalidate_range_start() like kvm. However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the absence of TTU_IGNORE_ACCESS and it explicitly performs the page table access check before trying to unmap the page. So, at worst the reclaim will miss accesses in a very short window if we remove page table access check in unmapping code. There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg reclaim. From memcg reclaim the page_referenced() only account the accesses from the processes which are in the same memcg of the target page but the unmapping code is considering accesses from all the processes, so, decreasing the effectiveness of memcg reclaim. The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping code. Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com Fixes: 369ea824 ("mm/rmap: update to new mmu_notifier semantic v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Muchun Song authored
The rcu_read_lock/unlock only can guarantee that the memcg will not be freed, but it cannot guarantee the success of css_get to memcg. If the whole process of a cgroup offlining is completed between reading a objcg->memcg pointer and bumping the css reference on another CPU, and there are exactly 0 external references to this memory cgroup (how we get to the obj_cgroup_charge() then?), css_get() can change the ref counter from 0 back to 1. Link: https://lkml.kernel.org/r/20201028035013.99711-2-songmuchun@bytedance.com Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Muchun Song authored
Consider the following memcg hierarchy. root / \ A B If we failed to get the reference on objcg of memcg A, the get_obj_cgroup_from_current can return the wrong objcg for the root memcg. Link: https://lkml.kernel.org/r/20201029164429.58703-1-songmuchun@bytedance.com Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Eugene Syromiatnikov <esyr@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Adrian Reber <areber@redhat.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
The mz->usage_in_excess >= mz_node->usage_in_excess check is exactly the else case of mz->usage_in_excess < mz_node->usage_in_excess. So we could replace else if (mz->usage_in_excess >= mz_node->usage_in_excess) with else equally. Also drop the comment which doesn't really explain much. Link: https://lkml.kernel.org/r/20201012131607.10656-1-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Muchun Song authored
Since commit 991e7673 ("mm: memcontrol: account kernel stack per node") there is no user of the mod_memcg_obj_state(). So just remove it. Also rework type of the idx parameter of the mod_objcg_state() from int to enum node_stat_item. Link: https://lkml.kernel.org/r/20201013153504.92602-1-songmuchun@bytedance.comSigned-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christopher Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
As huge page usage in the page cache and for shmem files proliferates in our production environment, the performance monitoring team has asked for per-cgroup stats on those pages. We already track and export anon_thp per cgroup. We already track file THP and shmem THP per node, so making them per-cgroup is only a matter of switching from node to lruvec counters. All callsites are in places where the pages are charged and locked, so page->memcg is stable. [hannes@cmpxchg.org: add documentation] Link: https://lkml.kernel.org/r/20201026174029.GC548555@cmpxchg.org Link: https://lkml.kernel.org/r/20201022151844.489337-1-hannes@cmpxchg.orgSigned-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Randy Dunlap authored
Fix a typo, punctuation, use uppercase for CPUs, and limit tmpfs to keeping only its files in virtual memory (phrasing). Link: https://lkml.kernel.org/r/20201202010934.18566-1-rdunlap@infradead.orgSigned-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hui Su authored
shmem_mapping() isn't worth an out-of-line call from any callsite. So make it inline by - make shmem_aops global - export shmem_aops - inline the shmem_mapping() and replace the direct call 'shmem_aops' with shmem_mapping() in shmem.c. Link: https://lkml.kernel.org/r/20201115165207.GA265355@rlkSigned-off-by: Hui Su <sh_def@163.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jeff Layton authored
With the merge of commit 2e169296 ("ceph: have ceph_writepages_start call pagevec_lookup_range_tag"), nothing calls this anymore. Link: https://lkml.kernel.org/r/20201021193926.101474-1-jlayton@kernel.orgSigned-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
We could use helper memset to fill the swap_map with SWAP_HAS_CACHE instead of a direct loop here to simplify the code. Also we can remove the local variable i and map this way. Link: https://lkml.kernel.org/r/20200921122224.7139-1-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
When the code went to the out label, it must have p == NULL. So what out label really does is redundant if check and return err. We should Remove this unnecessary out label because it does not handle resource free and so on. Link: https://lkml.kernel.org/r/20201009130337.29698-1-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
swap_ra_info() may leave ra_info untouched in non_swap_entry() case as page table lock is not held. In this case, we have ra_info.nr_pte == 0 and it is meaningless to continue with swap cache readahead. Skip such ops by init ra_info.win = 1. [akpm@linux-foundation.org: clean up struct init] Link: https://lkml.kernel.org/r/20201009133059.58407-1-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
Commit 570a335b ("swap_info: swap count continuations") introduced the func add_swap_count_continuation() but forgot to use the helper function swap_count() introduced by commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"). Link: https://lkml.kernel.org/r/20201009134306.18033-1-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Ralph Campbell authored
release_pages() is an optimized, inlined version of __put_pages() except that zone device struct pages that are not page_is_devmap_managed() (i.e., memory_type MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA), fall through to the code that could return the zone device page to the page allocator instead of adjusting the pgmap reference count. Clearly these type of pages are not having the reference count decremented to zero via release_pages() or page allocation problems would be seen. Just to be safe, handle the 1 to zero case in release_pages() like __put_page() does. Link: https://lkml.kernel.org/r/20201021194733.11530-1-rcampbell@nvidia.comSigned-off-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jason Gunthorpe authored
These functions accomplish the same thing but have different implementations. unpin_user_page() has a bug where it calls mod_node_page_state() after calling put_page() which creates a risk that the page could have been hot-uplugged from the system. Fix this by using put_compound_head() as the only implementation. __unpin_devmap_managed_user_page() and related can be deleted as well in favour of the simpler, but slower, version in put_compound_head() that has an extra atomic page_ref_sub, but always calls put_page() which internally contains the special devmap code. Move put_compound_head() to be directly after try_grab_compound_head() so people can find it in future. Link: https://lkml.kernel.org/r/0-v1-6730d4ee0d32+40e6-gup_combine_put_jgg@nvidia.com Fixes: 1970dc6f ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> CC: Joao Martins <joao.m.martins@oracle.com> CC: Jonathan Corbet <corbet@lwn.net> CC: Dan Williams <dan.j.williams@intel.com> CC: Dave Chinner <david@fromorbit.com> CC: Christoph Hellwig <hch@infradead.org> CC: Jane Chu <jane.chu@oracle.com> CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> CC: Michal Hocko <mhocko@suse.com> CC: Mike Kravetz <mike.kravetz@oracle.com> CC: Shuah Khan <shuah@kernel.org> CC: Muchun Song <songmuchun@bytedance.com> CC: Vlastimil Babka <vbabka@suse.cz> CC: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jason Gunthorpe authored
Long ago there wasn't a FOLL_LONGTERM flag so this DAX check was done by post-processing the VMA list. These days it is trivial to just check each VMA to see if it is DAX before processing it inside __get_user_pages() and return failure if a DAX VMA is encountered with FOLL_LONGTERM. Removing the allocation of the VMA list is a significant speed up for many call sites. Add an IS_ENABLED to vma_is_fsdax so that code generation is unchanged when DAX is compiled out. Remove the dummy version of __gup_longterm_locked() as !CONFIG_CMA already makes memalloc_nocma_save(), check_and_migrate_cma_pages(), and memalloc_nocma_restore() into a NOP. Link: https://lkml.kernel.org/r/0-v1-5551df3ed12e+b8-gup_dax_speedup_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jason Gunthorpe authored
Since commit 70e806e4 ("mm: Do early cow for pinned pages during fork() for ptes") pages under a FOLL_PIN will not be write protected during COW for fork. This means that pages returned from pin_user_pages(FOLL_WRITE) should not become write protected while the pin is active. However, there is a small race where get_user_pages_fast(FOLL_PIN) can establish a FOLL_PIN at the same time copy_present_page() is write protecting it: CPU 0 CPU 1 get_user_pages_fast() internal_get_user_pages_fast() copy_page_range() pte_alloc_map_lock() copy_present_page() atomic_read(has_pinned) == 0 page_maybe_dma_pinned() == false atomic_set(has_pinned, 1); gup_pgd_range() gup_pte_range() pte_t pte = gup_get_pte(ptep) pte_access_permitted(pte) try_grab_compound_head() pte = pte_wrprotect(pte) set_pte_at(); pte_unmap_unlock() // GUP now returns with a write protected page The first attempt to resolve this by using the write protect caused problems (and was missing a barrrier), see commit f3c64eda ("mm: avoid early COW write protect games during fork()") Instead wrap copy_p4d_range() with the write side of a seqcount and check the read side around gup_pgd_range(). If there is a collision then get_user_pages_fast() fails and falls back to slow GUP. Slow GUP is safe against this race because copy_page_range() is only called while holding the exclusive side of the mmap_lock on the src mm_struct. [akpm@linux-foundation.org: coding style fixes] Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com Fixes: f3c64eda ("mm: avoid early COW write protect games during fork()") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de> [seqcount_t parts] Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jason Gunthorpe authored
Patch series "Add a seqcount between gup_fast and copy_page_range()", v4. As discussed and suggested by Linus use a seqcount to close the small race between gup_fast and copy_page_range(). Ahmed confirms that raw_write_seqcount_begin() is the correct API to use in this case and it doesn't trigger any lockdeps. I was able to test it using two threads, one forking and the other using ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep made the window large enough to reliably hit to test the logic. This patch (of 2): The next patch in this series makes the lockless flow a little more complex, so move the entire block into a new function and remove a level of indention. Tidy a bit of cruft: - addr is always the same as start, so use start - Use the modern check_add_overflow() for computing end = start + len - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to avoid shift overflow, make the variables unsigned long to avoid coding casts in both places. nr_pinned was missing its cast - The handling of ret and nr_pinned can be streamlined a bit No functional change. Link: https://lkml.kernel.org/r/0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com Link: https://lkml.kernel.org/r/1-v4-908497cf359a+4782-gup_fork_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Barry Song authored
Without DEBUG_FS, all the code in gup_benchmark becomes meaningless. For sure kernel provides debugfs stub while DEBUG_FS is disabled, but the point here is that GUP_TEST can do nothing without DEBUG_FS. [song.bao.hua@hisilicon.com: add comment as a prompt to users as commented by John and Randy] Link: https://lkml.kernel.org/r/20201108083732.15336-1-song.bao.hua@hisilicon.com Link: https://lkml.kernel.org/r/20201104100552.20156-1-song.bao.hua@hisilicon.comSigned-off-by: Barry Song <song.bao.hua@hisilicon.com> Suggested-by: John Garry <john.garry@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Barry Song authored
gup_test_init() is only called during initialization, mark it as __init to save some memory. Link: https://lkml.kernel.org/r/20201103081016.16532-1-song.bao.hua@hisilicon.comSigned-off-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
John Hubbard authored
Each invocation of userfaultfd for "anon" and "shmem" was taking about 6.5 sec to run, contributing to an overall run time of about 22 sec for run_vmtests.sh. Reduce the size and bounce input values to the userfaultfd invocation within run_vmtests.sh, enough to get each invocation down to about 1.0 sec. This should still provide a reasonable smoke test, while staying within a nominal time budget of around 1 second or so per test. And this brings the overall running time of run_vmtests.sh down to 11 second. Link: https://lkml.kernel.org/r/20201026064021.3545418-10-jhubbard@nvidia.comSigned-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
John Hubbard authored
HMM selftests are incredibly useful, but they are only effective if people actually build and run them. All the other tests in selftests/vm can be built with very standard, always-available libraries: libpthread, librt. The hmm-tests.c program, on the other hand, requires something that is (much) less readily available: libhugetlbfs. And so the build will typically fail for many developers. A simple attempt to install libhugetlbfs will also run into complications on some common distros these days: Fedora and Arch Linux (yes, Arch AUR has it, but that's fragile, as always with AUR). The library is not maintained actively enough at the moment, for distros to deal with it. I had to build it from source, for Fedora, and that didn't go too smoothly either. It turns out that, out of 21 tests in hmm-tests.c, only 2 actually require functionality from libhugetlbfs. Therefore, if libhugetlbfs is missing, simply ifdef those two tests out and allow the developer to at least have the other 19 tests, if they don't want to pause to work through the above issues. Also issue a warning, so that it's clear that there is an imperfection in the build. In order to do that, a tiny shell script (check_config.sh) runs a quick compile (not link, that's too prone to false failures with library paths), and basically, if the compiler doesn't find hugetlbfs.h in its standard locations, then the script concludes that libhugetlbfs is not available. The output is in two files, one for inclusion in hmm-test.c (local_config.h), and one for inclusion in the Makefile (local_config.mk). Link: https://lkml.kernel.org/r/20201026064021.3545418-9-jhubbard@nvidia.comSigned-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
John Hubbard authored
Run benchmarks on the _fast variants of gup and pup, as originally intended. Run the new gup_test sub-test: dump pages. In addition to exercising the dump_page() call, it also demonstrates the various options you can use to specify which pages to dump, and how. Link: https://lkml.kernel.org/r/20201026064021.3545418-8-jhubbard@nvidia.comSigned-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
John Hubbard authored
For quite a while, I was doing a quick hack to gup_test.c (previously, gup_benchmark.c) whenever I wanted to try out my changes to dump_page(). This makes that hack unnecessary, and instead allows anyone to easily get the same coverage from a user space program. That saves a lot of time because you don't have to change the kernel, in order to test different pages and options. The new sub-test takes advantage of the existing gup_test infrastructure, which already provides a simple user space program, some allocated user space pages, an ioctl call, pinning of those pages (via either get_user_pages or pin_user_pages) and a corresponding kernel-side test invocation. There's not much more required, mainly just a couple of inputs from the user. In fact, the new test re-uses the existing command line options in order to get various helpful combinations (THP or normal, _fast or slow gup, gup vs. pup, and more). New command line options are: which pages to dump, and what type of "get/pin" to use. In order to figure out which pages to dump, the logic is: * If the user doesn't specify anything, the page 0 (the first page in the address range that the program sets up for testing) is dumped. * Or, the user can type up to 8 page indices anywhere on the command line. If you type more than 8, then it uses the first 8 and ignores the remaining items. For example: ./gup_test -ct -F 1 0 19 0x1000 Meaning: -c: dump pages sub-test -t: use THP pages -F 1: use pin_user_pages() instead of get_user_pages() 0 19 0x1000: dump pages 0, 19, and 4096 Link: https://lkml.kernel.org/r/20201026064021.3545418-7-jhubbard@nvidia.comSigned-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
John Hubbard authored
Therefore, some minor cleanup and improvements are in order: 1. Rename the other items appropriately. 2. Stop reporting timing information on the non-benchmark items. It's still being recorded and is available, but there's no point in cluttering up the report with data that no one reasonably needs to check. 3. Don't do iterations, for non-benchmark items. 4. Print out a shorter, more appropriate report for the non-benchmark tests. 5. Add the command that was run, to the report. This really helps, as there are quite a lot of options now. 6. Use a larger integer type for cmd, now that it's being compared Otherwise it doesn't work, because in this case cmd is about 3 billion, which is the perfect size for problems with signed vs unsigned int. Link: https://lkml.kernel.org/r/20201026064021.3545418-6-jhubbard@nvidia.comSigned-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-