- 12 Sep, 2022 40 commits
-
-
ye xingchen authored
Return the value cgwb_bdi_init() directly instead of storing it in another redundant variable. Link: https://lkml.kernel.org/r/20220826071906.252419-1-ye.xingchen@zte.com.cnSigned-off-by: ye xingchen <ye.xingchen@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Li Zhe authored
In commit 2f1ee091 ("Revert "mm: use early_pfn_to_nid in page_ext_init""), we call page_ext_init() after page_alloc_init_late() to avoid some panic problem. It seems that we cannot track early page allocations in current kernel even if page structure has been initialized early. This patch introduces a new boot parameter 'early_page_ext' to resolve this problem. If we pass it to the kernel, page_ext_init() will be moved up and the feature 'deferred initialization of struct pages' will be disabled to initialize the page allocator early and prevent the panic problem above. It can help us to catch early page allocations. This is useful especially when we find that the free memory value is not the same right after different kernel booting. [akpm@linux-foundation.org: fix section issue by removing __meminitdata] Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.comSigned-off-by: Li Zhe <lizhe.67@bytedance.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Gerald Schaefer authored
When pud-sized hugepages were introduced for s390, the generic version of follow_huge_pud() was using pte_page() instead of pud_page(). This would be wrong for s390, see also commit 97534127 ("mm/hugetlb: use pmd_page() in follow_huge_pmd()"). Therefore, and probably because not all archs were supporting pud_page() at that time, a private version of follow_huge_pud() was added for s390, correctly using pud_page(). Since commit 3a194f3f ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry"), the generic version of follow_huge_pud() is now also using pud_page(), and in general behaves similar to follow_huge_pmd(). Therefore we can now switch to the generic version and get rid of the s390-specific follow_huge_pud(). Link: https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpadSigned-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Haiyue Wang <haiyue.wang@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Shakeel Butt authored
For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger machines and the network intensive workloads requiring througput in Gbps, 32 is too small and makes the memcg charging path a bottleneck. For now, increase it to 64 for easy acceptance to 6.0. We will need to revisit this in future for ever increasing demand of higher performance. Please note that the memcg charge path drain the per-cpu memcg charge stock, so there should not be any oom behavior change. Though it does have impact on rstat flushing and high limit reclaim backoff. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 17064.7 Mbps (62.7% improvement) With the patch, the throughput improved by 62.7%. Link: https://lkml.kernel.org/r/20220825000506.239406-4-shakeelb@google.comSigned-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Shakeel Butt authored
With memcg v2 enabled, memcg->memory.usage is a very hot member for the workloads doing memcg charging on multiple CPUs concurrently. Particularly the network intensive workloads. In addition, there is a false cache sharing between memory.usage and memory.high on the charge path. This patch moves the usage into a separate cacheline and move all the read most fields into separate cacheline. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 12413.7 Mbps (18.4% improvement) With the patch, the throughput improved by 18.4%. One side-effect of this patch is the increase in the size of struct mem_cgroup. For example with this patch on 64 bit build, the size of struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for the performance improvement, this additional size is worth it. In addition there are opportunities to reduce the size of struct mem_cgroup like deprecation of kmem and tcpmem page counters and better packing. Link: https://lkml.kernel.org/r/20220825000506.239406-3-shakeelb@google.comSigned-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Shakeel Butt authored
Patch series "memcg: optimize charge codepath", v2. Recently Linux networking stack has moved from a very old per socket pre-charge caching to per-cpu caching to avoid pre-charge fragmentation and unwarranted OOMs. One impact of this change is that for network traffic workloads, memcg charging codepath can become a bottleneck. The kernel test robot has also reported this regression[1]. This patch series tries to improve the memcg charging for such workloads. This patch series implement three optimizations: (A) Reduce atomic ops in page counter update path. (B) Change layout of struct page_counter to eliminate false sharing between usage and high. (C) Increase the memcg charge batch to 64. To evaluate the impact of these optimizations, on a 72 CPUs machine, we ran the following workload in root memcg and then compared with scenario where the workload is run in a three level of cgroup hierarchy with top level having min and low setup appropriately. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): 1. root memcg 21694.8 Mbps 2. 6.0-rc1 10482.7 Mbps (-51.6%) 3. 6.0-rc1 + (A) 14542.5 Mbps (-32.9%) 4. 6.0-rc1 + (B) 12413.7 Mbps (-42.7%) 5. 6.0-rc1 + (C) 17063.7 Mbps (-21.3%) 6. 6.0-rc1 + (A+B+C) 20120.3 Mbps (-7.2%) With all three optimizations, the memcg overhead of this workload has been reduced from 51.6% to just 7.2%. [1] https://lore.kernel.org/linux-mm/20220619150456.GB34471@xsang-OptiPlex-9020/ This patch (of 3): For cgroups using low or min protections, the function propagate_protected_usage() was doing an atomic xchg() operation irrespectively. We can optimize out this atomic operation for one specific scenario where the workload is using the protection (i.e. min > 0) and the usage is above the protection (i.e. usage > min). This scenario is actually very common where the users want a part of their workload to be protected against the external reclaim. Though this optimization does introduce a race when the usage is around the protection and concurrent charges and uncharged trip it over or under the protection. In such cases, we might see lower effective protection but the subsequent charge/uncharge will correct it. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately to see if this optimization is effective for the mentioned case. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 14542.5 Mbps (38.7% improvement) With the patch, the throughput improved by 38.7% Link: https://lkml.kernel.org/r/20220825000506.239406-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20220825000506.239406-2-shakeelb@google.comSigned-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Heidelberg authored
zswap has been with us since 2013, and it's widely used in many products. Link: https://lkml.kernel.org/r/20220823152033.66682-1-david@ixit.czSigned-off-by: David Heidelberg <david@ixit.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sergey Senozhatsky authored
We do all reset operations under write lock, so we don't need to save ->disksize and ->comp to stack variables. Another thing is that ->comp is freed during zram reset, but comp pointer is not NULL-ed, so zram keeps the freed pointer value. Link: https://lkml.kernel.org/r/20220824035100.971816-1-senozhatsky@chromium.orgSigned-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alistair Popple authored
When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is called to migrate pages out of zones which should not contain any longterm pinned pages. When migration succeeds all pages will have been unpinned so pinning needs to be retried. Migration can also fail, in which case the pages will also have been unpinned but the operation should not be retried. If all pages are in the correct zone nothing will be unpinned and no retry is required. The logic in check_and_migrate_movable_pages() tracks unnecessary state and the return codes for each case are difficult to follow. Refactor the code to clean this up. No behaviour change is intended. [akpm@linux-foundation.org: fix unused var warning] Link: https://lkml.kernel.org/r/19583d1df07fdcb99cfa05c265588a3fa58d1902.1661317396.git-series.apopple@nvidia.comSigned-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alistair Popple authored
gup_flags is passed to check_and_migrate_movable_pages() so that it can call either put_page() or unpin_user_page() to drop the page reference. However check_and_migrate_movable_pages() is only called for FOLL_LONGTERM, which implies FOLL_PIN so there is no need to pass gup_flags. Link: https://lkml.kernel.org/r/d611c65a9008ff55887307df457c6c2220ad6163.1661317396.git-series.apopple@nvidia.comSigned-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Bui Quang Minh authored
In page_counter_set_max, we want to make sure the new limit is not below the concurrently-changing counter value. We read the counter and check that the limit is not below the counter before the swap. After the swap, we read the counter again and retry in case the counter is incremented as this may violate the requirement. Even though the page_counter_try_charge can see the old limit, it is guaranteed that the counter is not above the old limit after the increment. So in case the new limit is not below the old limit, the counter is guaranteed to be not above the new limit too. We can skip the retry in this case to optimize a little bit. Link: https://lkml.kernel.org/r/20220821154055.109635-1-minhquangbui99@gmail.comSigned-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Rolf Eike Beer authored
Link: https://lkml.kernel.org/r/8991525.CDJkKcVGEf@devpool047Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Rolf Eike Beer authored
Empty PTEs are passed to the pte_entry callback, not to pte_hole. Link: https://lkml.kernel.org/r/3695521.kQq0lBPeGt@devpool047Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yang Shi authored
Workingset refault stats are important and useful metrics to measure how well reclaimer and swapping work and how healthy the services are, but they are just available for cgroup v2. There are still plenty users with cgroup v1, export the stats for cgroup v1. Link: https://lkml.kernel.org/r/20220816185801.651091-1-shy828301@gmail.comSigned-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kassey Li authored
It is too slow to dump all the pages, in some usage we just want to dump a given start pfn, for example: a CMA range or a single page. To speed up and save time, this change allows specifying of a start pfn by adding llseek for page_owner. Link: https://lkml.kernel.org/r/20220818022425.31056-1-quic_yingangl@quicinc.comSigned-off-by: Kassey Li <quic_yingangl@quicinc.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
pmd_huge() is usually used to indicate a pmd level hugetlb. However a pmd mapped huge page can only be THP in damon_mkold_pmd_entry() or damon_young_pmd_entry(), so replace pmd_huge() with pmd_trans_huge() in this case to make the code more readable according to the discussion [1]. [1] https://lore.kernel.org/all/098c1480-416d-bca9-cedb-ca495df69b64@linux.alibaba.com/ Link: https://lkml.kernel.org/r/a9e010ca5d299e18d740c7c52290ecb6a014dde6.1660805030.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
pmd_huge() is used to validate if the pmd entry is mapped by a huge page, also including the case of non-present (migration or hwpoisoned) pmd entry on arm64 or x86 architectures. This means that pmd_pfn() can not get the correct pfn number for a non-present pmd entry, which will cause damon_get_page() to get an incorrect page struct (also may be NULL by pfn_to_online_page()), making the access statistics incorrect. This means that the DAMON may make incorrect decision according to the incorrect statistics, for example, DAMON may can not reclaim cold page in time due to this cold page was regarded as accessed mistakenly if DAMOS_PAGEOUT operation is specified. Moreover it does not make sense that we still waste time to get the page of the non-present entry. Just treat it as not-accessed and skip it, which maintains consistency with non-present pte level entries. So add pmd entry present validation to fix the above issues. Link: https://lkml.kernel.org/r/58b1d1f5fbda7db49ca886d9ef6783e3dcbbbc98.1660805030.git.baolin.wang@linux.alibaba.com Fixes: 3f49584b ("mm/damon: implement primitives for the virtual memory address spaces") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yin Fengwei authored
If there is private data attached to a THP, the refcount of THP will be increased and will prevent the THP from being split. Attempt to release any private data attached to the THP before attempting the split to increase the chance of splitting successfully. There was a memory failure issue hit during HW error injection testing with 5.18 kernel + xfs as rootfs. The test was killed and a system reboot was required to re-run the test. The issue was tracked down to a THP split failure caused by the memory failure not being handled. The page dump showed: [ 1785.433075] page:0000000025f9530b refcount:18 mapcount:0 mapping:000000008162eea7 index:0xa10 pfn:0x2f0200 [ 1785.443954] head:0000000025f9530b order:4 compound_mapcount:0 compound_pincount:0 [ 1785.452408] memcg:ff4247f2d28e9000 [ 1785.456304] aops:xfs_address_space_operations ino:8555182 dentry name:"baseos-filenames.solvx" [ 1785.466612] flags: 0x1000000000012036(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 1785.476514] raw: 1000000000012036 ffb9460f8bc07c08 ffb9460f8bc08408 ff4247f22e6299f8 [ 1785.485268] raw: 0000000000000a10 ff4247f194ade900 00000012ffffffff ff4247f2d28e9000 It was like the error was injected to a large folio for xfs with private data attached. With private data released before splitting the THP, the test case could be run successfully many times without rebooting the system. Link: https://lkml.kernel.org/r/20220810064907.582899-1-fengwei.yin@intel.comCo-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Qi Zheng authored
When the pgtable is NULL in the set_huge_zero_page(), we should not increment the count of PTE page table pages by calling mm_inc_nr_ptes(). Otherwise we may receive the following warning when the mm exits: BUG: non-zero pgtables_bytes on freeing mm Now we can't observe the above warning since only do_huge_pmd_anonymous_page() invokes set_huge_zero_page() and the pgtable can not be NULL. Therefore, instead of moving mm_inc_nr_ptes() to the non-NULL branch of pgtable, it is better to remove the redundant pgtable check directly. Link: https://lkml.kernel.org/r/20220818082748.40021-1-zhengqi.arch@bytedance.comSigned-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
Squash the __soft_offline_page() into soft_offline_in_use_page() and kill __soft_offline_page(). [wangkefeng.wang@huawei.com: update hpage when try_to_split_thp_page() succeeds] Link: https://lkml.kernel.org/r/20220830104654.28234-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20220819033402.156519-2-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
Open-code the page_handle_poison() into soft_offline_page() and kill unneeded soft_offline_free_page(). Link: https://lkml.kernel.org/r/20220819033402.156519-1-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Muchun Song authored
We can choose to copy three contiguous tail pages' content to the first three pages instead of copying one by one to simplify the code and reduce code size from 229 bytes to 63 bytes. The BUILD_BUG_ON() aims to avoid out-of-bounds accesses. Link: https://lkml.kernel.org/r/20220819035532.6189-1-songmuchun@bytedance.comSigned-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
For reserved pages, HWPoison flag will be set without increasing the page refcnt. So we shouldn't even try to unpoison these pages and thus decrease the page refcnt unexpectly. Add a PageReserved() check to filter this case out and remove the below unneeded zero page (zero page is reserved) check. Link: https://lkml.kernel.org/r/20220818130016.45313-7-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
If try_to_unmap() fails, the hwpoisoned page still resides in the address space of some processes. We should kill these processes or the hwpoisoned page might be consumed later. collect_procs() is always called to collect relevant processes now so they can be killed later if unmap fails. [linmiaohe@huawei.com: v2] Link: https://lkml.kernel.org/r/20220823032346.4260-6-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220818130016.45313-6-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
After kill_procs(), tk will be freed without being removed from the to_kill list. In the next iteration, the freed list entry in the to_kill list will be accessed, thus leading to use-after-free issue. Adding list_del() in kill_procs() to fix the issue. Link: https://lkml.kernel.org/r/20220823032346.4260-5-linmiaohe@huawei.com Fixes: c36e2024 ("mm: introduce mf_dax_kill_procs() for fsdax case") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
When hwpoison_filter() refuses to soft offline a page, the page refcnt incremented previously by MF_COUNT_INCREASED would have been consumed via get_hwpoison_page() if ret <= 0. So the put_ref_page() here will put the extra one. Remove it to fix the issue. Link: https://lkml.kernel.org/r/20220818130016.45313-4-linmiaohe@huawei.com Fixes: 9113eaf3 ("mm/memory-failure.c: add hwpoison_filter for soft offline") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
When free_raw_hwp_pages() fails its work, the refcnt of the hugetlb page would have been incremented if ret > 0. Using put_page() to fix refcnt leaking in this case. Link: https://lkml.kernel.org/r/20220818130016.45313-3-linmiaohe@huawei.com Fixes: debb6b9c3fdd ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Patch series "A few fixup patches for memory-failure", v2. This series contains a few fixup patches to fix incorrect update of page refcnt, fix possible use-after-free issue and so on. More details can be found in the respective changelogs. This patch (of 6): When hwpoison_filter() refuses to hwpoison a hugetlb page, the refcnt of the page would have been incremented if res == 1. Using put_page() to fix the refcnt leaking in this case. Link: https://lkml.kernel.org/r/20220823032346.4260-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220818130016.45313-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220818130016.45313-2-linmiaohe@huawei.com Fixes: 405ce051 ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Charan Teja Kalla authored
The below is one path where race between page_ext and offline of the respective memory blocks will cause use-after-free on the access of page_ext structure. process1 process2 --------- --------- a)doing /proc/page_owner doing memory offline through offline_pages. b) PageBuddy check is failed thus proceed to get the page_owner information through page_ext access. page_ext = lookup_page_ext(page); migrate_pages(); ................. Since all pages are successfully migrated as part of the offline operation,send MEM_OFFLINE notification where for page_ext it calls: offline_page_ext()--> __free_page_ext()--> free_page_ext()--> vfree(ms->page_ext) mem_section->page_ext = NULL c) Check for the PAGE_EXT flags in the page_ext->flags access results into the use-after-free (leading to the translation faults). As mentioned above, there is really no synchronization between page_ext access and its freeing in the memory_offline. The memory offline steps(roughly) on a memory block is as below: 1) Isolate all the pages 2) while(1) try free the pages to buddy.(->free_list[MIGRATE_ISOLATE]) 3) delete the pages from this buddy list. 4) Then free page_ext.(Note: The struct page is still alive as it is freed only during hot remove of the memory which frees the memmap, which steps the user might not perform). This design leads to the state where struct page is alive but the struct page_ext is freed, where the later is ideally part of the former which just representing the page_flags (check [3] for why this design is chosen). The abovementioned race is just one example __but the problem persists in the other paths too involving page_ext->flags access(eg: page_is_idle())__. Fix all the paths where offline races with page_ext access by maintaining synchronization with rcu lock and is achieved in 3 steps: 1) Invalidate all the page_ext's of the sections of a memory block by storing a flag in the LSB of mem_section->page_ext. 2) Wait until all the existing readers to finish working with the ->page_ext's with synchronize_rcu(). Any parallel process that starts after this call will not get page_ext, through lookup_page_ext(), for the block parallel offline operation is being performed. 3) Now safely free all sections ->page_ext's of the block on which offline operation is being performed. Note: If synchronize_rcu() takes time then optimizations can be done in this path through call_rcu()[2]. Thanks to David Hildenbrand for his views/suggestions on the initial discussion[1] and Pavan kondeti for various inputs on this patch. [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/ [2] https://lore.kernel.org/all/a26ce299-aed1-b8ad-711e-a49e82bdd180@quicinc.com/T/#u [3] https://lore.kernel.org/all/6fa6b7aa-731e-891c-3efb-a03d6a700efa@redhat.com/ [quic_charante@quicinc.com: rename label `loop' to `ext_put_continue' per David] Link: https://lkml.kernel.org/r/1661496993-11473-1-git-send-email-quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1660830600-9068-1-git-send-email-quic_charante@quicinc.comSigned-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Fernand Sieber <sieberf@amazon.com> Cc: Minchan Kim <minchan@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavan Kondeti <quic_pkondeti@quicinc.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox authored
If the pages being mapped are in HIGHMEM, page_address() returns NULL. This probably wasn't noticed before because there aren't currently any architectures with HAVE_ARCH_HUGE_VMALLOC and HIGHMEM, but it's simpler to call page_to_phys() and futureproofs us against such configurations existing. Link: https://lkml.kernel.org/r/Yv6qHc6e+m7TMWhi@casper.infradead.org Fixes: 121e6f32 ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
xupanda authored
Fix a spelling mistake in comment. Link: https://lkml.kernel.org/r/20220815065102.74347-1-xu.panda@zte.com.cnReported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: xupanda <xu.panda@zte.com.cn> Signed-off-by: CGEL ZTE <cgel.zte@gmail.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
find_min_pfn_with_active_regions() is only called from free_area_init(). Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(), and kill find_min_pfn_with_active_regions(). Link: https://lkml.kernel.org/r/20220815111017.39341-1-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Zi Yan authored
This Kconfig option is used by individual arch to set its desired MAX_ORDER. Rename it to reflect its actual use. Link: https://lkml.kernel.org/r/20220815143959.1511278-1-zi.yan@sent.comAcked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Guo Ren <guoren@kernel.org> [csky] Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Huacai Chen <chenhuacai@kernel.org> [LoongArch] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Taichi Sugaya <sugaya.taichi@socionext.com> Cc: Neil Armstrong <narmstrong@baylibre.com> Cc: Qin Jian <qinjian@cqplus1.com> Cc: Guo Ren <guoren@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alexey Romanov authored
Emails are not documentation. [akpm@linux-foundation.org: fix whitespace damage, repair doc reference] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20220815144825.39001-1-avromanov@sberdevices.ruSigned-off-by: Alexey Romanov <avromanov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
If the pagetables are shared, we shouldn't copy or take references. Since src could have unshared and dst shares with another vma, huge_pte_none() is thus used to determine whether dst_pte is shared. But this check isn't reliable. A shared pte could have pte none in pagetable in fact. The page count of ptep page should be checked here in order to reliably determine whether pte is shared. [lukas.bulwahn@gmail.com: remove unused local variable dst_entry in copy_hugetlb_page_range()] Link: https://lkml.kernel.org/r/20220822082525.26071-1-lukas.bulwahn@gmail.com Link: https://lkml.kernel.org/r/20220816130553.31406-7-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
The sysfs group per_node_hstate_attr_group and hstate_demote_attr_group when h->demote_order != 0 are created in hugetlb_register_node(). But these sysfs groups are not removed when unregister the node, thus sysfs group is leaked. Using sysfs_remove_group() to fix this issue. Link: https://lkml.kernel.org/r/20220816130553.31406-6-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Fengwei Yin <fengwei.yin@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
The memory barrier smp_wmb() is needed to make sure that preceding stores to the page contents become visible before the below set_pte_at() write. Link: https://lkml.kernel.org/r/20220816130553.31406-5-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
When huge_add_to_page_cache() fails, the page is freed directly without calling restore_reserve_on_error() to restore reserve for newly allocated pages not in page cache. Fix this by calling restore_reserve_on_error() when huge_add_to_page_cache fails. [linmiaohe@huawei.com: remove err == -EEXIST check and retry logic] Link: https://lkml.kernel.org/r/20220823030209.57434-4-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220816130553.31406-4-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
If sysfs_create_group() fails with hstate_attr_group, hstate_kobjs[hi] will be set to NULL. Then it will be passed to sysfs_create_group() if h->demote_order != 0 thus triggering WARN_ON(!kobj) check. Fix this by making sure hstate_kobjs[hi] != NULL when calling sysfs_create_group. Link: https://lkml.kernel.org/r/20220816130553.31406-3-linmiaohe@huawei.com Fixes: 79dfc695 ("hugetlb: add demote hugetlb page sysfs interfaces") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Patch series "A few fixup patches for hugetlb". This series contains a few fixup patches to fix incorrect update of max_huge_pages, fix WARN_ON(!kobj) in sysfs_create_group() and so on. More details can be found in the respective changelogs. This patch (of 6): There should be pages_per_huge_page(h) / pages_per_huge_page(target_hstate) pages incremented for target_hstate->max_huge_pages when page is demoted. Update max_huge_pages accordingly for consistency. Link: https://lkml.kernel.org/r/20220816130553.31406-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220816130553.31406-2-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-