- 29 Apr, 2022 40 commits
-
-
Yosry Ahmed authored
Currently, cg_read()/cg_write() returns 0 on success and -1 on failure. Modify them to return the -errno on failure. Link: https://lkml.kernel.org/r/20220425190040.2475377-3-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Shakeel Butt authored
This patch series adds a memory.reclaim proactive reclaim interface. The rationale behind the interface and how it works are in the first patch. This patch (of 4): Introduce a memcg interface to trigger memory reclaim on a memory cgroup. Use case: Proactive Reclaim --------------------------- A userspace proactive reclaimer can continuously probe the memcg to reclaim a small amount of memory. This gives more accurate and up-to-date workingset estimation as the LRUs are continuously sorted and can potentially provide more deterministic memory overcommit behavior. The memory overcommit controller can provide more proactive response to the changing behavior of the running applications instead of being reactive. A userspace reclaimer's purpose in this case is not a complete replacement for kswapd or direct reclaim, it is to proactively identify memory savings opportunities and reclaim some amount of cold pages set by the policy to free up the memory for more demanding jobs or scheduling new jobs. A user space proactive reclaimer is used in Google data centers. Additionally, Meta's TMO paper recently referenced a very similar interface used for user space proactive reclaim: https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 Benefits of a user space reclaimer: ----------------------------------- 1) More flexible on who should be charged for the cpu of the memory reclaim. For proactive reclaim, it makes more sense to be centralized. 2) More flexible on dedicating the resources (like cpu). The memory overcommit controller can balance the cost between the cpu usage and the memory reclaimed. 3) Provides a way to the applications to keep their LRUs sorted, so, under memory pressure better reclaim candidates are selected. This also gives more accurate and uptodate notion of working set for an application. Why memory.high is not enough? ------------------------------ - memory.high can be used to trigger reclaim in a memcg and can potentially be used for proactive reclaim. However there is a big downside in using memory.high. It can potentially introduce high reclaim stalls in the target application as the allocations from the processes or the threads of the application can hit the temporary memory.high limit. - Userspace proactive reclaimers usually use feedback loops to decide how much memory to proactively reclaim from a workload. The metrics used for this are usually either refaults or PSI, and these metrics will become messy if the application gets throttled by hitting the high limit. - memory.high is a stateful interface, if the userspace proactive reclaimer crashes for any reason while triggering reclaim it can leave the application in a bad state. - If a workload is rapidly expanding, setting memory.high to proactively reclaim memory can result in actually reclaiming more memory than intended. The benefits of such interface and shortcomings of existing interface were further discussed in this RFC thread: https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ Interface: ---------- Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to trigger reclaim in the target memory cgroup. The interface is introduced as a nested-keyed file to allow for future optional arguments to be easily added to configure the behavior of reclaim. Possible Extensions: -------------------- - This interface can be extended with an additional parameter or flags to allow specifying one or more types of memory to reclaim from (e.g. file, anon, ..). - The interface can also be extended with a node mask to reclaim from specific nodes. This has use cases for reclaim-based demotion in memory tiering systens. - A similar per-node interface can also be added to support proactive reclaim and reclaim-based demotion in systems without memcg. - Add a timeout parameter to make it easier for user space to call the interface without worrying about being blocked for an undefined amount of time. For now, let's keep things simple by adding the basic functionality. [yosryahmed@google.com: worked on versions v2 onwards, refreshed to current master, updated commit message based on recent discussions and use cases] Link: https://lkml.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220425190040.2475377-2-yosryahmed@google.comSigned-off-by: Shakeel Butt <shakeelb@google.com> Co-developed-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Wei Xu <weixugc@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Brian Geffon authored
Today it's only possible to write back as a page, idle, or huge. A user might want to writeback pages which are huge and idle first as these idle pages do not require decompression and make a good first pass for writeback. Idle writeback specifically has the advantage that a refault is unlikely given that the page has been swapped for some amount of time without being refaulted. Huge writeback has the advantage that you're guaranteed to get the maximum benefit from a single page writeback, that is, you're reclaiming one full page of memory. Pages which are compressed in zram being written back result in some benefit which is always less than a page size because of the fact that it was compressed. The primary use of this is for minimizing refaults in situations where the device has to be sensitive to storage endurance. On ChromeOS we have devices with slow eMMC and repeated writes and refaults can negatively affect performance and endurance. Link: https://lkml.kernel.org/r/20220322215821.1196994-1-bgeffon@google.comSigned-off-by: Brian Geffon <bgeffon@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Chen Wandun authored
There is no need to update last_pgdat for each zone, only update last_pgdat when iterating the first zone of a node. Link: https://lkml.kernel.org/r/20220322115635.2708989-1-chenwandun@huawei.comSigned-off-by: Chen Wandun <chenwandun@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Fix sparse warning: mm/kasan/shadow.c:496:15: warning: restricted kasan_vmalloc_flags_t degrades to integer Link: https://lkml.kernel.org/r/52d8fccdd3a48d4bdfd0ff522553bac2a13f1579.1649351254.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Zqiang authored
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 ........... CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.1-rt16-yocto-preempt-rt #22 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x60/0x8c dump_stack+0x10/0x12 __might_resched.cold+0x13b/0x173 rt_spin_lock+0x5b/0xf0 ___cache_free+0xa5/0x180 qlist_free_all+0x7a/0x160 per_cpu_remove_cache+0x5f/0x70 smp_call_function_many_cond+0x4c4/0x4f0 on_each_cpu_cond_mask+0x49/0xc0 kasan_quarantine_remove_cache+0x54/0xf0 kasan_cache_shrink+0x9/0x10 kmem_cache_shrink+0x13/0x20 acpi_os_purge_cache+0xe/0x20 acpi_purge_cached_objects+0x21/0x6d acpi_initialize_objects+0x15/0x3b acpi_init+0x130/0x5ba do_one_initcall+0xe5/0x5b0 kernel_init_freeable+0x34f/0x3ad kernel_init+0x1e/0x140 ret_from_fork+0x22/0x30 When the kmem_cache_shrink() was called, the IPI was triggered, the ___cache_free() is called in IPI interrupt context, the local-lock or spin-lock will be acquired. On PREEMPT_RT kernel, these locks are replaced with sleepbale rt-spinlock, so the above problem is triggered. Fix it by moving the qlist_free_allfrom() from IPI interrupt context to task context when PREEMPT_RT is enabled. [akpm@linux-foundation.org: reduce ifdeffery] Link: https://lkml.kernel.org/r/20220401134649.2222485-1-qiang1.zhang@intel.comSigned-off-by: Zqiang <qiang1.zhang@intel.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
Missed calling flush_cache_range() before removing the sharing PMD entrires, otherwise data consistence issue may be occurred on some architectures whose caches are strict and require a virtual>physical translation to exist for a virtual address. Thus add it. Now no architectures enabling PMD sharing will be affected, since they do not have a VIVT cache. That means this issue can not be happened in practice so far. Link: https://lkml.kernel.org/r/47441086affcabb6ecbe403173e9283b0d904b38.1650956489.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/419b0e777c9e6d1454dcd906e0f5b752a736d335.1650781755.git.baolin.wang@linux.alibaba.com Fixes: 6dfeaff9 ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
xu xin authored
Clean up the vma->vm_ops usage. Use vma_is_anonymous instead of vma->vm_ops to make it more understandable. Link: https://lkml.kernel.org/r/20220424071642.3234971-1-xu.xin16@zte.com.cnSigned-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peng Liu authored
Use more generic functions to deal with issues related to online nodes. The changes will make the code simplified. Link: https://lkml.kernel.org/r/20220429030218.644635-1-liupeng256@huawei.comSigned-off-by: Peng Liu <liupeng256@huawei.com> Suggested-by: Davidlohr Bueso <dave@stgolabs.net> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peng Liu authored
When __setup() return '0', using invalid option values causes the entire kernel boot option string to be reported as Unknown. Hugetlb calls __setup() and will return '0' when set invalid parameter string. The following phenomenon is observed: cmdline: hugepagesz=1Y hugepages=1 dmesg: HugeTLB: unsupported hugepagesz=1Y HugeTLB: hugepages=1 does not follow a valid hugepagesz, ignoring Unknown kernel command line parameters "hugepagesz=1Y hugepages=1" Since hugetlb will print warning/error information before return for invalid parameter string, just use return '1' to avoid print again. Link: https://lkml.kernel.org/r/20220413032915.251254-4-liupeng256@huawei.comSigned-off-by: Peng Liu <liupeng256@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Yuntao <liuyuntao10@huawei.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peng Liu authored
Hugepages can be specified to pernode since "hugetlbfs: extend the definition of hugepages parameter to support node allocation", but the following problem is observed. Confusing behavior is observed when both 1G and 2M hugepage is set after "numa=off". cmdline hugepage settings: hugepagesz=1G hugepages=0:3,1:3 hugepagesz=2M hugepages=0:1024,1:1024 results: HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages Furthermore, confusing behavior can be also observed when an invalid node behind a valid node. To fix this, never allocate any typical hugepage when an invalid parameter is received. Link: https://lkml.kernel.org/r/20220413032915.251254-3-liupeng256@huawei.com Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Signed-off-by: Peng Liu <liupeng256@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Yuntao <liuyuntao10@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peng Liu authored
Patch series "hugetlb: Fix some incorrect behavior", v3. This series fix three bugs of hugetlb: 1) Invalid use of nr_online_nodes; 2) Inconsistency between 1G hugepage and 2M hugepage; 3) Useless information in dmesg. This patch (of 4): Certain systems are designed to have sparse/discontiguous nodes. In this case, nr_online_nodes can not be used to walk through numa node. Also, a valid node may be greater than nr_online_nodes. However, in hugetlb, it is assumed that nodes are contiguous. For sparse/discontiguous nodes, the current code may treat a valid node as invalid, and will fail to allocate all hugepages on a valid node that "nid >= nr_online_nodes". As David suggested: if (tmp >= nr_online_nodes) goto invalid; Just imagine node 0 and node 2 are online, and node 1 is offline. Assuming that "node < 2" is valid is wrong. Recheck all the places that use nr_online_nodes, and repair them one by one. [liupeng256@huawei.com: v4] Link: https://lkml.kernel.org/r/20220416103526.3287348-1-liupeng256@huawei.com Link: https://lkml.kernel.org/r/20220413032915.251254-1-liupeng256@huawei.com Link: https://lkml.kernel.org/r/20220413032915.251254-2-liupeng256@huawei.com Fixes: 4178158e ("hugetlbfs: fix issue of preallocation of gigantic pages can't work") Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Fixes: e79ce983 ("hugetlbfs: fix a truncation issue in hugepages parameter") Fixes: f9317f77 ("hugetlb: clean up potential spectre issue warnings") Signed-off-by: Peng Liu <liupeng256@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Liu Yuntao <liuyuntao10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christophe JAILLET authored
__add_memory_block() calls both put_device() and device_unregister() when storing the memory block into the xarray. This is incorrect because xarray doesn't take an additional reference and device_unregister() already calls put_device(). Triggering the issue looks really unlikely and its only effect should be to log a spurious warning about a ref counted issue. Link: https://lkml.kernel.org/r/d44c63d78affe844f020dc02ad6af29abc448fc4.1650611702.git.christophe.jaillet@wanadoo.fr Fixes: 4fb6eabf ("drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Scott Cheloha <cheloha@linux.vnet.ibm.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
It's not guaranteed that highest will be above the min_pfn. If highest is below the min_pfn, migrate_pfn and free_pfn can meet prematurely and lead to some useless work. Make sure highest is above min_pfn to avoid making a futile effort. Link: https://lkml.kernel.org/r/20220418141253.24298-13-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Since commit efe771c7 ("mm, compaction: always finish scanning of a full pageblock"), compaction will always finish scanning a pageblock. And migrate_pfn is assured to align with pageblock_nr_pages when we reach here. So we will always return COMPACT_SUCCESS if a suitable fallback is found due to the below IS_ALIGNED check of migrate_pfn. Simplify the code to make this clear and improve the readability. No functional change intended. Link: https://lkml.kernel.org/r/20220418141253.24298-12-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
When compact_result indicates that the allocation should now succeed, i.e. compact_result = COMPACT_SUCCESS, compaction_zonelist_suitable should return false because there is no need to do compaction now. Link: https://lkml.kernel.org/r/20220418141253.24298-11-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
It's possible that kcompactd_run could fail to run kcompactd for a hot added node and leave pgdat->kcompactd as NULL. So pgdat->kcompactd should be checked here to avoid possible NULL pointer dereference. Link: https://lkml.kernel.org/r/20220418141253.24298-10-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Since commit 282722b0 ("mm, compaction: restrict async compaction to pageblocks of same migratetype"), async direct compaction is restricted to scan the pageblocks of same migratetype. Correct the comment accordingly. Link: https://lkml.kernel.org/r/20220418141253.24298-9-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Use helper compound_nr to make use of compound_nr when CONFIG_64BIT and simplify the code a bit. Link: https://lkml.kernel.org/r/20220418141253.24298-8-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Always use COMPACT_CLUSTER_MAX here as we're doing the compaction. Minor improvements in readability. Link: https://lkml.kernel.org/r/20220418141253.24298-7-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
checked_pageblock is already removed and suitable_migration_target is not rechecked under the zone lock since commit f8224aa5 ("mm, compaction: do not recheck suitable_migration_target under lock"). Correct the comment accordingly. Link: https://lkml.kernel.org/r/20220418141253.24298-6-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Since commit cf66f070 ("mm, compaction: do not consider a need to reschedule as contention"), async compaction won't abort when scheduling is needed. Correct the relevant comment accordingly. Link: https://lkml.kernel.org/r/20220418141253.24298-5-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
isolate_start_pfn is unused when cc->nr_freepages ! = 0. Otherwise cc->free_pfn will overwrite it unconditionally. So we should remove this unneeded and somewhat misleading assignment. Link: https://lkml.kernel.org/r/20220418141253.24298-4-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
pfn is unused in this do while loop. Remove the unneeded pfn update. Link: https://lkml.kernel.org/r/20220418141253.24298-3-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: David Hildenbrand <david@redhat.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Patch series "A few cleanup and fixup patches for compaction". This series contains a few patches to clean up some obsolete comment, remove unneeded return value and so on. Also we fix the possible NULL pointer dereference. More details can be found in the respective changelogs. This patch (of 12): The return value of kcompactd_run() is unused now. Clean it up. Link: https://lkml.kernel.org/r/20220418141253.24298-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220418141253.24298-2-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc; Mel Gorman <mgorman@techsingularity.net> Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Charan Teja Kalla <charante@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yang Yang authored
Users may use ksm by calling madvise(, , MADV_MERGEABLE) when they want to save memory, it's a tradeoff by suffering delay on ksm cow. Users can get to know how much memory ksm saved by reading /sys/kernel/mm/ksm/pages_sharing, but they don't know what's the costs of ksm cow, and this is important of some delay sensitive tasks. So add ksm cow events to help users evaluate whether or how to use ksm. Also update Documentation/admin-guide/mm/ksm.rst with new added events. Link: https://lkml.kernel.org/r/20220331035616.2390805-1-yang.yang29@zte.com.cnSigned-off-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Saravanan D <saravanand@fb.com> Cc: Minchan Kim <minchan@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
xu xin authored
Some applications or containers want to use KSM by calling madvise() to advise areas of address space to be MERGEABLE. But they may not know which applications are more likely to cause real merges in the deployment. If this patch is applied, it helps them know their corresponding number of merged pages, and then optimize their app code. As current KSM only counts the number of KSM merging pages(e.g. ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see the more fine-grained KSM merging, for the upper application optimization, the merging area cannot be set easily according to the KSM page merging probability of each process. Therefore, it is necessary to add extra statistical means so that the upper level users can know the detailed KSM merging information of each process. We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to indicate the involved ksm merging pages of this process. [akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s] Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cnSigned-off-by: xu xin <xu.xin16@zte.com.cn> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Cc: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Ohhoon Kwon <ohoono.kwon@samsung.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Feng Tang <feng.tang@intel.com> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
The stub for non_swap_entry() may not help much, because MAX_SWAPFILES has already contained all the information to decide whether a swap entry is real swap entry or pesudo ones (migrations, ...). There can be some performance influences on non_swap_entry() with below conditions all met: !CONFIG_MIGRATION && !CONFIG_MEMORY_FAILURE && !CONFIG_DEVICE_PRIVATE But that's definitely not the major config most machines will use, at the meantime it's already in a slow path of swap entry (being parsed from a swap pte), so IMHO it shouldn't be a major issue. Also according to the analysis from Alistair, somehow the stub didn't do the job right [1]. To make the code cleaner, let's drop the stub. [1] https://lore.kernel.org/lkml/8735ihbw6g.fsf@nvdebian.thelocal/ Note: the uffd-wp shmem & hugetlbfs series will need this patch to make sure swap entries work as expected with below config as spotted by Alistair: !CONFIG_MIGRATION && !CONFIG_MEMORY_FAILURE && !CONFIG_DEVICE_PRIVATE && CONFIG_PTE_MARKER (PS: this config should mostly never gonna happen, though, afaict..) Link: https://lkml.kernel.org/r/20220413191147.66645-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Joao Martins authored
Currently memmap_init_zone_device() ends up initializing 32768 pages when it only needs to initialize 128 given tail page reuse. That number is worse with 1GB compound pages, 262144 instead of 128. Update memmap_init_zone_device() to skip redundant initialization, detailed below. When a pgmap @vmemmap_shift is set, all pages are mapped at a given huge page alignment and use compound pages to describe them as opposed to a struct per 4K. With @vmemmap_shift > 0 and when struct pages are stored in ram (!altmap) most tail pages are reused. Consequently, the amount of unique struct pages is a lot smaller than the total amount of struct pages being mapped. The altmap path is left alone since it does not support memory savings based on compound pages devmap. Link: https://lkml.kernel.org/r/20220420155310.9712-6-joao.m.martins@oracle.comSigned-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Joao Martins authored
A compound devmap is a dev_pagemap with @vmemmap_shift > 0 and it means that pages are mapped at a given huge page alignment and utilize uses compound pages as opposed to order-0 pages. Take advantage of the fact that most tail pages look the same (except the first two) to minimize struct page overhead. Allocate a separate page for the vmemmap area which contains the head page and separate for the next 64 pages. The rest of the subsections then reuse this tail vmemmap page to initialize the rest of the tail pages. Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and when initializing compound devmap with big enough @vmemmap_shift (e.g. 1G PUD) it may cross multiple sections. The vmemmap code needs to consult @pgmap so that multiple sections that all map the same tail data can refer back to the first copy of that data for a given gigantic page. On compound devmaps with 2M align, this mechanism lets 6 pages be saved out of the 8 necessary PFNs necessary to set the subsection's 512 struct pages being mapped. On a 1G compound devmap it saves 4094 pages. Altmap isn't supported yet, given various restrictions in altmap pfn allocator, thus fallback to the already in use vmemmap_populate(). It is worth noting that altmap for devmap mappings was there to relieve the pressure of inordinate amounts of memmap space to map terabytes of pmem. With compound pages the motivation for altmaps for pmem gets reduced. Link: https://lkml.kernel.org/r/20220420155310.9712-5-joao.m.martins@oracle.comSigned-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Joao Martins authored
In preparation for device-dax for using hugetlbfs compound page tail deduplication technique, move the comment block explanation into a common place in Documentation/vm. Link: https://lkml.kernel.org/r/20220420155310.9712-4-joao.m.martins@oracle.comSigned-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Joao Martins authored
In preparation for describing a memmap with compound pages, move the actual pte population logic into a separate function vmemmap_populate_address() and have a new helper vmemmap_populate_range() walk through all base pages it needs to populate. While doing that, change the helper to use a pte_t* as return value, rather than an hardcoded errno of 0 or -ENOMEM. Link: https://lkml.kernel.org/r/20220420155310.9712-3-joao.m.martins@oracle.comSigned-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Joao Martins authored
Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9. This series minimizes 'struct page' overhead by pursuing a similar approach as Muchun Song series "Free some vmemmap pages of hugetlb page" (now merged since v5.14), but applied to devmap with @vmemmap_shift (device-dax). The vmemmap dedpulication original idea (already used in HugeTLB) is to reuse/deduplicate tail page vmemmap areas, particular the area which only describes tail pages. So a vmemmap page describes 64 struct pages, and the first page for a given ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second vmemmap page would contain only tail pages, and that's what gets reused across the rest of the subsection/section. The bigger the page size, the bigger the savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages). This is done for PMEM /specifically only/ on device-dax configured namespaces, not fsdax. In other words, a devmap with a @vmemmap_shift. In terms of savings, per 1Tb of memory, the struct page cost would go down with compound devmap: * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory) * with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of total memory) The series is mostly summed up by patch 4, and to summarize what the series does: Patches 1 - 3: Minor cleanups in preparation for patch 4. Move the very nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry. Patch 4: Patch 4 is the one that takes care of the struct page savings (also referred to here as tail-page/vmemmap deduplication). Much like Muchun series, we reuse the second PTE tail page vmemmap areas across a given @vmemmap_shift On important difference though, is that contrary to the hugetlbfs series, there's no vmemmap for the area because we are late-populating it as opposed to remapping a system-ram range. IOW no freeing of pages of already initialized vmemmap like the case for hugetlbfs, which greatly simplifies the logic (besides not being arch-specific). altmap case unchanged and still goes via the vmemmap_populate(). Also adjust the newly added docs to the device-dax case. [Note that device-dax is still a little behind HugeTLB in terms of savings. I have an additional simple patch that reuses the head vmemmap page too, as a follow-up. That will double the savings and namespaces initialization.] Patch 5: Initialize fewer struct pages depending on the page size with DRAM backed struct pages -- because fewer pages are unique and most tail pages (with bigger vmemmap_shift). NVDIMM namespace bootstrap improves from ~268-358 ms to ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally. And struct page needed capacity will be 3.8x / 1071x smaller for 2M and 1G respectivelly. Tested on x86 with 1.5Tb of pmem (including pinning, and RDMA registration/deregistration scalability with 2M MRs) This patch (of 5): In support of using compound pages for devmap mappings, plumb the pgmap down to the vmemmap_populate implementation. Note that while altmap is retrievable from pgmap the memory hotplug code passes altmap without pgmap[*], so both need to be independently plumbed. So in addition to @altmap, pass @pgmap to sparse section populate functions namely: sparse_add_section section_activate populate_section_memmap __populate_section_memmap Passing @pgmap allows __populate_section_memmap() to both fetch the vmemmap_shift in which memmap metadata is created for and also to let sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick whether to just reuse tail pages from past onlined sections. While at it, fix the kdoc for @altmap for sparse_add_section(). [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/ Link: https://lkml.kernel.org/r/20220420155310.9712-1-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/20220420155310.9712-2-joao.m.martins@oracle.comSigned-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Muchun Song authored
The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". In this patch , cheanup configs to make code more expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.comSigned-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Muchun Song authored
The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". In this patch , cheanup the static key and hugetlb_free_vmemmap_enabled() to make code more expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-3-songmuchun@bytedance.comSigned-off-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Muchun Song authored
Patch series "cleanup hugetlb_vmemmap". The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize" is more clear. In this series, cheanup related codes to make it more clear and expressive. This is suggested by David. This patch (of 3): The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". And some function names are prefixed with "huge_page" instead of "hugetlb", it is easily to be confused with THP. In this patch, cheanup related functions to make code more clear and expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220404074652.68024-2-songmuchun@bytedance.comSigned-off-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Ma Wupeng authored
Previous 0x100000 is used to check the 4G limit in find_zone_movable_pfns_for_nodes(). This is right in x86 because the page size can only be 4K. But 16K and 64K are available in arm64. So replace it with PHYS_PFN(SZ_4G). Link: https://lkml.kernel.org/r/20220414101314.1250667-8-mawupeng1@huawei.comSigned-off-by: Ma Wupeng <mawupeng1@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
When old_len == new_len, do_munmap will return -EINVAL due to len == 0. This errno will be simply ignored because of old_len != new_len check. So it is unnecessary to call do_munmap when old_len == new_len because nothing is actually done. Link: https://lkml.kernel.org/r/20220401081023.37080-1-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Use helper mlock_future_check() to check whether it's safe to resize the locked_vm to simplify the code. Minor readability improvement. Link: https://lkml.kernel.org/r/20220322112004.27380-1-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Anshuman Khandual authored
There are no platforms left which use arch_vm_get_page_prot(). Just drop generic arch_vm_get_page_prot(). Link: https://lkml.kernel.org/r/20220414062125.609297-8-anshuman.khandual@arm.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-