- 04 Jul, 2024 40 commits
-
-
Donet Tom authored
Commit 1b151e24 ("block: Remove special-casing of compound pages") caused a change in behaviour when releasing the pages if the buffer does not start at the beginning of the page. This was because the calculation of the number of pages to release was incorrect. This was fixed by commit 38b43539 ("block: Fix page refcounts for unaligned buffers in __bio_release_pages()"). We pin the user buffer during direct I/O writes. If this buffer is a hugepage, bio_release_page() will unpin it and decrement all references and pin counts at ->bi_end_io. However, if any references to the hugepage remain post-I/O, the hugepage will not be freed upon unmap, leading to a memory leak. This patch verifies that a hugepage, used as a user buffer for DIO operations, is correctly freed upon unmapping, regardless of whether the offsets are aligned or unaligned w.r.t page boundary. Test Result Fail Scenario (Without the fix) -------------------------------------------------------- []# ./hugetlb_dio TAP version 13 1..4 No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 1 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 2 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 3 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 6 not ok 4 : Huge pages not freed! Totals: pass:3 fail:1 xfail:0 xpass:0 skip:0 error:0 Test Result PASS Scenario (With the fix) --------------------------------------------------------- []#./hugetlb_dio TAP version 13 1..4 No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 1 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 2 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 3 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 4 : Huge pages freed successfully ! Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0 [donettom@linux.ibm.com: address review comments from Muhammad] Link: https://lkml.kernel.org/r/20240604132801.23377-1-donettom@linux.ibm.com [donettom@linux.ibm.com: add this test to run_vmtests.sh] Link: https://lkml.kernel.org/r/20240607182000.6494-1-donettom@linux.ibm.com Link: https://lkml.kernel.org/r/20240523063905.3173-1-donettom@linux.ibm.com Fixes: 38b43539 ("block: Fix page refcounts for unaligned buffers in __bio_release_pages()") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tony Battersby <tonyb@cybernetics.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Johnson authored
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/zsmalloc.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-4-8c20e7d26842@quicinc.comSigned-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Johnson authored
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/kfence/kfence_test.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-3-8c20e7d26842@quicinc.comSigned-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Johnson authored
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/dmapool_test.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-2-8c20e7d26842@quicinc.comSigned-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Johnson authored
Patch series "mm: add missing MODULE_DESCRIPTION() macros". This fixes the instances of "WARNING: modpost: missing MODULE_DESCRIPTION()" that I'm seeing in mm/. This patch (of 4): Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/hwpoison-inject.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-0-8c20e7d26842@quicinc.com Link: https://lkml.kernel.org/r/20240513-mm-md-v1-1-8c20e7d26842@quicinc.comSigned-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Eric Chanudet authored
x86_64 is already using the node's cpu as maximum threads. Make that the default for all archs setting DEFERRED_STRUCT_PAGE_INIT. This returns to the behavior prior making the function arch-specific with commit ecd09650 ("mm: make deferred init's max threads arch-specific"). Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms shows faster deferred_init_memmap completions: | | x13s | SA8775p-ride | Ampere R137-P31 | Ampere HR330 | | | Metal, 32GB | VM, 36GB | VM, 58GB | Metal, 128GB | | | 8cpus | 8cpus | 8cpus | 32cpus | |---------|-------------|--------------|-----------------|--------------| | threads | ms (%) | ms (%) | ms (%) | ms (%) | |---------|-------------|--------------|-----------------|--------------| | 1 | 108 (0%) | 72 (0%) | 224 (0%) | 324 (0%) | | cpus | 24 (-77%) | 36 (-50%) | 40 (-82%) | 56 (-82%) | Michael Ellerman reported: : On a machine here (1TB, 40 cores, 4KB pages) the existing code gives: : : [ 0.500124] node 2 deferred pages initialised in 210ms : [ 0.515790] node 3 deferred pages initialised in 230ms : [ 0.516061] node 0 deferred pages initialised in 230ms : [ 0.516522] node 7 deferred pages initialised in 230ms : [ 0.516672] node 4 deferred pages initialised in 230ms : [ 0.516798] node 6 deferred pages initialised in 230ms : [ 0.517051] node 5 deferred pages initialised in 230ms : [ 0.523887] node 1 deferred pages initialised in 240ms : : vs with the patch: : : [ 0.379613] node 0 deferred pages initialised in 90ms : [ 0.380388] node 1 deferred pages initialised in 90ms : [ 0.380540] node 4 deferred pages initialised in 100ms : [ 0.390239] node 6 deferred pages initialised in 100ms : [ 0.390249] node 2 deferred pages initialised in 100ms : [ 0.390786] node 3 deferred pages initialised in 110ms : [ 0.396721] node 5 deferred pages initialised in 110ms : [ 0.397095] node 7 deferred pages initialised in 110ms : : Which is a nice speedup. [echanude@redhat.com: v3] Link: https://lkml.kernel.org/r/20240528185455.643227-4-echanude@redhat.com Link: https://lkml.kernel.org/r/20240522203758.626932-4-echanude@redhat.comSigned-off-by: Eric Chanudet <echanude@redhat.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mateusz Guzik authored
Execs of dynamically linked binaries at 20-ish cores are bottlenecked on the i_mmap_rwsem semaphore, while the biggest singular contributor is free_pgd_range inducing the lock acquire back-to-back for all consecutive mappings of a given file. Tracing the count of said acquires while building the kernel shows: [1, 2) 799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 3) 0 | | [3, 4) 3009 | | [4, 5) 3009 | | [5, 6) 326442 |@@@@@@@@@@@@@@@@@@@@@ | So in particular there were 326442 opportunities to coalesce 5 acquires into 1. Doing so increases execs per second by 4% (~50k to ~52k) when running the benchmark linked below. The lock remains the main bottleneck, I have not looked at other spots yet. Bench can be found here: http://apollo.backplane.com/DFlyMisc/doexec.c $ cc -O2 -o shared-doexec doexec.c $ ./shared-doexec $(nproc) Note this particular test makes sure binaries are separate, but the loader is shared. Stats collected on the patched kernel (+ "noinline") with: bpftrace -e 'kprobe:unlink_file_vma_batch_process { @ = lhist(((struct unlink_vma_file_batch *)arg0)->count, 0, 8, 1); }' Link: https://lkml.kernel.org/r/20240521234321.359501-1-mjguzik@gmail.comSigned-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jane Chu authored
While handling hwpoison in a THP page, it is possible that try_to_split_thp_page() fails. For example, when the THP page has been RDMA pinned. At this point, the kernel cannot isolate the poisoned THP page, all it could do is to send a SIGBUS to the user process with meaningful payload to give user-level recovery a chance. Link: https://lkml.kernel.org/r/20240524215306.2705454-6-jane.chu@oracle.comSigned-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jane Chu authored
Move hwpoison_filter() higher up as there is no need to spend a lot cycles only to find out later that the page is supposed to be skipped from hwpoison handling. Link: https://lkml.kernel.org/r/20240524215306.2705454-5-jane.chu@oracle.comSigned-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jane Chu authored
Added two explicit MF_MSG messages describing failure in get_hwpoison_page. Attemped to document the definition of various action names, and made a few adjustment to the action_result() calls. Link: https://lkml.kernel.org/r/20240524215306.2705454-4-jane.chu@oracle.comSigned-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jane Chu authored
The soft hwpoison injector via madvise(MADV_HWPOISON) operates in a synchrous way in a sense, the injector is also a process under test, and should it have the poisoned page mapped in its address space, it should get killed as much as in a real UE situation. Doing so align with what the madvise(2) man page says: " "This operation may result in the calling process receiving a SIGBUS and the page being unmapped." Link: https://lkml.kernel.org/r/20240524215306.2705454-3-jane.chu@oracle.comSigned-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <oalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jane Chu authored
Patch series "Enhance soft hwpoison handling and injection", v4. This series is aimed at the following enhancements: - Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave more like as if a real UE occurred. Because the other two injectors such as hwpoison-inject and the 'einj' on x86 can't, and it seems to me we need a better simulation to real UE scenario. - For years, if the kernel is unable to unmap a hwpoisoned page, it send a SIGKILL instead of SIGBUS to prevent user process from potentially accessing the page again. But in doing so, the user process also lose important information: vaddr, for recovery. Fortunately, the kernel already has code to kill process re-accessing a hwpoisoned page, so remove the '!unmap_success' check. - Right now, if a thp page under GUP longterm pin is hwpoisoned, and kernel cannot split the thp page, memory-failure simply ignores the UE and returns. That's not ideal, it could deliver a SIGBUS with useful information for userspace recovery. This patch (of 5): For years when it comes down to kill a process due to hwpoison, a SIGBUS is delivered only if unmap has been successful. Otherwise, a SIGKILL is delivered. And the reason for that is to prevent the involved process from accessing the hwpoisoned page again. Since then a lot has changed, a hwpoisoned page is marked and upon being re-accessed, the memory-failure handler invokes kill_accessing_process() to kill the process immediately. So let's take out the '!unmap_success' factor and try to deliver SIGBUS if possible. Link: https://lkml.kernel.org/r/20240524215306.2705454-1-jane.chu@oracle.com Link: https://lkml.kernel.org/r/20240524215306.2705454-2-jane.chu@oracle.comSigned-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Bang Li authored
Let us simplify the code by update_mmu_tlb_range(). Link: https://lkml.kernel.org/r/20240522061204.117421-4-libang.li@antgroup.comSigned-off-by: Bang Li <libang.li@antgroup.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Bang Li authored
Let's make update_mmu_tlb() simply a generic wrapper around update_mmu_tlb_range(). Only the latter can now be overridden by the architecture. We can now remove __HAVE_ARCH_UPDATE_MMU_TLB as well. Link: https://lkml.kernel.org/r/20240522061204.117421-3-libang.li@antgroup.comSigned-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Bang Li authored
Patch series "Add update_mmu_tlb_range() to simplify code", v4. This series of commits mainly adds the update_mmu_tlb_range() to batch update tlb in an address range and implement update_mmu_tlb() using update_mmu_tlb_range(). After commit 19eaf449 ("mm: thp: support allocation of anonymous multi-size THP"), We may need to batch update tlb of a certain address range by calling update_mmu_tlb() in a loop. Using the update_mmu_tlb_range(), we can simplify the code and possibly reduce the execution of some unnecessary code in some architectures. This patch (of 3): Add update_mmu_tlb_range(), we can batch update tlb of an address range. Link: https://lkml.kernel.org/r/20240522061204.117421-1-libang.li@antgroup.com Link: https://lkml.kernel.org/r/20240522061204.117421-2-libang.li@antgroup.comSigned-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Dev Jain authored
Post FEAT_LPA2, the Aarch64 Linux kernel extends higher address support to 4K and 16K translation granules. To support testing this out, we need to do away with static initialization of page size, while still maintaining the nice array of testcases; this can be achieved by initializing and populating the array as a stack variable, and filling in the page size and hugepage size at runtime. Link: https://lkml.kernel.org/r/20240522070435.773918-3-dev.jain@arm.comSigned-off-by: Dev Jain <dev.jain@arm.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Dev Jain authored
Patch series "Restructure va_high_addr_switch". The va_high_addr_switch memory selftest tests out some corner cases related to allocation and page/hugepage faulting around the switch boundary. Currently, the page size and hugepage size have been statically defined. Post FEAT_LPA2, the Aarch64 Linux kernel adds support for 4k and 16k translation granules on higher addresses; we restructure the test to support the same. In addition, we avoid invocation of the binary twice, in the shell script, to reduce test noise. This patch (of 2): When invoking the binary with "--run-hugetlb" flag, the testcases involving the base page are anyways going to be run. Therefore, remove duplication by invoking the binary only once. Link: https://lkml.kernel.org/r/20240522070435.773918-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20240522070435.773918-2-dev.jain@arm.comSigned-off-by: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Using insert_page() we might have previously ended up passing the zeropage into rmap code. Make sure that won't happen again. Note that we won't check the huge zeropage for now, which might still end up in RMAP code. Link: https://lkml.kernel.org/r/20240522125713.775114-4-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
For now we only get the (small) zeropage mapped to user space in four cases (excluding VM_PFNMAP mappings, such as /proc/vmstat): (1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON): do_anonymous_page() will not refcount it and map it pte_mkspecial() (2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem (MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and map it pte_mkspecial(). (3) KSM in mergeable VMA (anonymous VMA or COW mapping). cmp_and_merge_page() will not refcount it and map it pte_mkspecial(). (4) FSDAX as an optimization for holes. vmf_insert_mixed()->__vm_insert_mixed() might end up calling insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the zeropage and not mapping it pte_mkspecial(). With CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will not refcount it and map it pte_mkspecial(). In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a08 ("dax: remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit e1fb4a08 ("dax: remove VM_MIXEDMAP for fsdax and device dax"). Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page() would currently return the zeropage. We'll refcount the zeropage when mapping and when unmapping. Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP, vm_normal_page() would currently refuse to return the zeropage. So we'd refcount it when mapping but not when unmapping it ... do we have fsdax without CONFIG_ARCH_HAS_PTE_SPECIAL in practice? Hard to tell. Independent of that, we should never refcount the zeropage when we might be holding that reference for a long time, because even without an accounting imbalance we might overflow the refcount. As there is interest in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean support for that in the cases where it makes sense: (A) Never refcount the zeropage when mapping it: In insert_page(), special-case the zeropage, do not refcount it, and use pte_mkspecial(). Don't involve insert_pfn(), adjusting insert_page() looks cleaner than branching off to insert_pfn(). (B) Never refcount the zeropage when unmapping it: In vm_normal_page(), also don't return the zeropage in a VM_MIXEDMAP mapping without CONFIG_ARCH_HAS_PTE_SPECIAL. Add a VM_WARN_ON_ONCE() sanity check if we'd ever return the zeropage, which could happen if someone forgets to set pte_mkspecial() when mapping the zeropage. Document that. (C) Allow the zeropage only where reasonable s390x never wants the zeropage in some processes running legacy KVM guests that make use of storage keys. So disallow that. Further, using the zeropage in COW mappings is unproblematic (just what we do for other COW mappings), because FAULT_FLAG_UNSHARE can just unshare it and GUP with FOLL_LONGTERM would work as expected. Similarly, mappings that can never have writable PTEs (implying no write faults) are also not problematic, because nothing could end up mapping the PTE writable by mistake later. But in case we could have writable PTEs, we'll only allow the zeropage in FSDAX VMAs, that are incompatible with GUP and are blocked there completely. We'll always require the zeropage to be mapped with pte_special(). GUP-fast will reject the zeropage that way, but GUP-slow will allow it. (Note that GUP does not refcount the zeropage with FOLL_PIN, because there were issues with overflowing the refcount in the past). Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to catch early during testing if we'd ever find a zeropage unexpectedly in code that wants to upgrade write permissions. Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail with VM_FAULT_SIGBUS, like we do for other sanity checks. Drop the stale comment regarding reserved pages from insert_page(). Note that: * we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and vmf_insert_pfn() would allow the zeropage in some cases and not refcount it. * vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP mappings and we'll leave that alone for now. People can simply use one of the other interfaces. * we won't bother with the huge zeropage for now. It's never PTE-mapped and also GUP does not special-case it yet. Link: https://lkml.kernel.org/r/20240522125713.775114-3-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Patch series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()", v2. There is interest in mapping zeropages via vm_insert_pages() [1] into MAP_SHARED mappings. For now, we only get zeropages in MAP_SHARED mappings via vmf_insert_mixed() from FSDAX code, and I think it's a bit shaky in some cases because we refcount the zeropage when mapping it but not necessarily always when unmapping it ... and we should actually never refcount it. It's all a bit tricky, especially how zeropages in MAP_SHARED mappings interact with GUP (FOLL_LONGTERM), mprotect(), write-faults and s390x forbidding the shared zeropage (rewrite [2] s now upstream). This series tries to take the careful approach of only allowing the zeropage where it is likely safe to use (which should cover the existing FSDAX use case and [1]), preventing that it could accidentally get mapped writable during a write fault, mprotect() etc, and preventing issues with FOLL_LONGTERM in the future with other users. Tested with a patch from Vincent that uses the zeropage in context of [1]. [1] https://lkml.kernel.org/r/20240430111354.637356-1-vdonnefort@google.com [2] https://lkml.kernel.org/r/20240411161441.910170-1-david@redhat.com This patch (of 3): We'll now also cover the case where insert_page() is called from __vm_insert_mixed(), which sounds like the right thing to do. Link: https://lkml.kernel.org/r/20240522125713.775114-2-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Muhammad Usama Anjum authored
Check return value and return error/skip the tests. Link: https://lkml.kernel.org/r/20240520185248.1801945-1-usama.anjum@collabora.com Fixes: 46fd75d4 ("selftests: mm: add pagemap ioctl tests") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sidhartha Kumar authored
All users have been converted to use the folio version of these macros, we can safely remove the page based interface. Link: https://lkml.kernel.org/r/20240520224407.110062-1-sidhartha.kumar@oracle.comSigned-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
Currently we use one swap_address_space for every 64M chunk to reduce lock contention, this is like having a set of smaller swap files inside one swap device. But when doing swap cache look up or insert, we are still using the offset of the whole large swap device. This is OK for correctness, as the offset (key) is unique. But Xarray is specially optimized for small indexes, it creates the radix tree levels lazily to be just enough to fit the largest key stored in one Xarray. So we are wasting tree nodes unnecessarily. For 64M chunk it should only take at most 3 levels to contain everything. But if we are using the offset from the whole swap device, the offset (key) value will be way beyond 64M, and so will the tree level. Optimize this by using a new helper swap_cache_index to get a swap entry's unique offset in its own 64M swap_address_space. I see a ~1% performance gain in benchmark and actual workload with high memory pressure. Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO is enabled, as swap out path can never skip swap cache): Before: 6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k 0inputs+0outputs (55major+33555018minor)pagefaults 0swaps After (1.8% faster): 6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k 0inputs+0outputs (54major+33555027minor)pagefaults 0swaps Similar result with MySQL and sysbench using swap: Before: 94055.61 qps After (0.8% faster): 94834.91 qps Radix tree slab usage is also very slightly lower. Link: https://lkml.kernel.org/r/20240521175854.96038-12-ryncsn@gmail.comSigned-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
There are two helpers for retrieving the index within address space for mixed usage of swap cache and page cache: - page_index - folio_index This commit drops page_index, as we have eliminated all users, and converts folio_index's helper __page_file_index to use folio to avoid the page conversion. Link: https://lkml.kernel.org/r/20240521175854.96038-11-ryncsn@gmail.comSigned-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
These two helpers were useful for mixed usage of swap cache and page cache, which help retrieve the corresponding file or swap device offset of a page or folio. They were introduced in commit f981c595 ("mm: methods for teaching filesystems about PG_swapcache pages") and used in commit d56b4ddf ("nfs: teach the NFS client how to treat PG_swapcache pages"), suppose to be used with direct_IO for swap over fs. But after commit e1209d3a ("mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space"), swap with direct_IO is no more, and swap cache mapping is never exposed to fs. Now we have dropped all users of page_file_offset and folio_file_pos, so they can be deleted. Link: https://lkml.kernel.org/r/20240521175854.96038-10-ryncsn@gmail.comSigned-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
folio_file_pos and page_file_offset are for mixed usage of swap cache and page cache, it can't be page cache here, so introduce a new helper to get the swap offset in swap device directly. Need to include swapops.h in mm/swap.h to ensure swp_offset is always defined before use. Link: https://lkml.kernel.org/r/20240521175854.96038-9-ryncsn@gmail.comSigned-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
folio_file_pos is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio_pos instead. After commit e1209d3a ("mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space"), swap cache should never be exposed to nfs. So remove the usage of folio_file_pos in following NFS functions / helpers: - nfs_vm_page_mkwrite It's only used by nfs_file_vm_ops.page_mkwrite - trace event helper: nfs_folio_event - trace event helper: nfs_folio_event_done These two are used through DEFINE_NFS_FOLIO_EVENT and DEFINE_NFS_FOLIO_EVENT_DONE, which defined following events: - trace_nfs_aop_readpage{_done}: only called by nfs_read_folio - trace_nfs_writeback_folio: only called by nfs_wb_folio - trace_nfs_invalidate_folio: only called by nfs_invalidate_folio - trace_nfs_launder_folio_done: only called by nfs_launder_folio None of them could possibly be used on swap cache folio, nfs_read_folio only called by: .write_begin -> nfs_read_folio .read_folio nfs_wb_folio only called by nfs mapping: .release_folio -> nfs_wb_folio .launder_folio -> nfs_wb_folio .write_begin -> nfs_read_folio -> nfs_wb_folio .read_folio -> nfs_wb_folio .write_end -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio .page_mkwrite -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio .write_begin -> nfs_flush_incompatible -> nfs_wb_folio .page_mkwrite -> nfs_vm_page_mkwrite -> nfs_flush_incompatible -> nfs_wb_folio nfs_invalidate_folio is only called by .invalidate_folio. nfs_launder_folio is only called by .launder_folio - nfs_grow_file - nfs_update_folio nfs_grow_file is only called by nfs_update_folio, and all possible callers of them are: .write_end -> nfs_update_folio .page_mkwrite -> nfs_update_folio - nfs_wb_folio_cancel .invalidate_folio -> nfs_wb_folio_cancel Also, seeing from the swap side, swap_rw is now the only interface calling into fs, the offset info is always in iocb.ki_pos now. So we can remove all these folio_file_pos call safely. Link: https://lkml.kernel.org/r/20240521175854.96038-8-ryncsn@gmail.comSigned-off-by: Kairui Song <kasong@tencent.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
folio_file_pos is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio_pos instead. It can't be a swap cache page here. Swap mapping may only call into fs through swap_rw and that is not supported for netfs. So just drop it and use folio_pos instead. Link: https://lkml.kernel.org/r/20240521175854.96038-7-ryncsn@gmail.comSigned-off-by: Kairui Song <kasong@tencent.com> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
folio_file_pos is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio_pos instead. It can't be a swap cache page here. Swap mapping may only call into fs through swap_rw and that is not supported for afs. So just drop it and use folio_pos instead. Link: https://lkml.kernel.org/r/20240521175854.96038-6-ryncsn@gmail.comSigned-off-by: Kairui Song <kasong@tencent.com> Cc: David Howells <dhowells@redhat.com> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
This function is no longer used after commit 4fa7a717 ("NFS: Fix up nfs_vm_page_mkwrite() for folios"), all users have been converted to use folio instead, just delete it to remove usage of page_index. Link: https://lkml.kernel.org/r/20240521175854.96038-5-ryncsn@gmail.comSigned-off-by: Kairui Song <kasong@tencent.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
page_index is needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use page->index instead. It can't be a swap cache page here, so just drop it. Link: https://lkml.kernel.org/r/20240521175854.96038-4-ryncsn@gmail.comSigned-off-by: Kairui Song <kasong@tencent.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kairui Song authored
Patch series "mm/swap: clean up and optimize swap cache index", v6. Currently we use one swap_address_space for every 64M chunk to reduce lock contention, this is like having a set of smaller files inside a swap device. But when doing swap cache look up or insert, we are still using the offset of the whole large swap device. This is OK for correctness, as the offset (key) is unique. But Xarray is specially optimized for small indexes, it creates the redix tree levels lazily to be just enough to fit the largest key stored in one Xarray. So we are wasting tree nodes unnecessarily. For 64M chunk it should only take at most 3 level to contain everything. But if we are using the offset from the whole swap device, the offset (key) value will be way beyond 64M, and so will the tree level. Optimize this by reduce the swap cache search space into 64M scope. Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO is enabled, as swap out path can never skip swap cache): Before: 6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k 0inputs+0outputs (55major+33555018minor)pagefaults 0swaps After (+1.8% faster): 6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k 0inputs+0outputs (54major+33555027minor)pagefaults 0swaps Similar result with MySQL and sysbench using swap: Before: 94055.61 qps After (+0.8% faster): 94834.91 qps There is alse a very slight drop of radix tree node slab usage: Before: 303952K After: 302224K For this series: There are multiple places that expect mixed type of pages (page cache or swap cache), eg. migration, huge memory split; There are four helpers for that: - page_index - page_file_offset - folio_index - folio_file_pos To keep the code clean and compatible, this series first cleaned up usage of them. page_file_offset and folio_file_pos are historical helpes that can be simply dropped after clean up. And page_index can be all converted to folio_index or folio->index. Then introduce two new helpers swap_cache_index and swap_dev_pos for swap. Replace swp_offset with swap_cache_index when used to retrieve folio from swap cache, and use swap_dev_pos when needed to retrieve the device position of a swap entry. This way, swap_cache_index can return the optimized value with no compatibility issue. The result is better performance and reduced LOC. Idealy, in the future, we may want to reduce SWAP_ADDRESS_SPACE_SHIFT from 14 to 12: Default Xarray chunk offset is 6, so we have 3 level trees instead of 2 level trees just for 2 extra bits. But swap cache is based on address_space struct, with 4 times more metadata sparsely distributed in memory it waste more cacheline, the performance gain from this series is almost canceled according to my test. So first, just have a cleaner seperation of offsets and smaller search space. This patch (of 10): page_index is only for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use page->index instead. It can't be a swap cache page here (being part of buffer head), so just drop it. And while we are at it, optimize the code by retrieving the offset of the buffer head within the folio directly using bh_offset, and get rid of the loop and usage of page helpers. Link: https://lkml.kernel.org/r/20240521175854.96038-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20240521175854.96038-3-ryncsn@gmail.comSuggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kemeng Shi authored
Factor out balance_wb_limits to remove repeated code [shikemeng@huaweicloud.com: add comment] Link: https://lkml.kernel.org/r/20240606033547.344376-1-shikemeng@huaweicloud.com [akpm@linux-foundation.org: s/fileds/fields/ in comment] Link: https://lkml.kernel.org/r/20240514125254.142203-9-shikemeng@huaweicloud.comSigned-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kemeng Shi authored
Factor out wb_dirty_exceeded to remove repeated code Link: https://lkml.kernel.org/r/20240514125254.142203-8-shikemeng@huaweicloud.comSigned-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kemeng Shi authored
Factor out balance_domain_limits to remove repeated code. Link: https://lkml.kernel.org/r/20240514125254.142203-7-shikemeng@huaweicloud.comSigned-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kemeng Shi authored
Factor out wb_dirty_freerun to remove more repeated freerun code. Link: https://lkml.kernel.org/r/20240514125254.142203-6-shikemeng@huaweicloud.comSigned-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kemeng Shi authored
Factor out code of freerun into new helper functions domain_poll_intv and domain_dirty_freerun to remove repeated code. Link: https://lkml.kernel.org/r/20240514125254.142203-5-shikemeng@huaweicloud.comSigned-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kemeng Shi authored
Factor out domain_over_bg_thresh from wb_over_bg_thresh to remove repeated code. Link: https://lkml.kernel.org/r/20240514125254.142203-4-shikemeng@huaweicloud.comSigned-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kemeng Shi authored
Add general function domain_dirty_avail to calculate dirty and avail for either dirty limit or background writeback in either global domain or wb domain. Link: https://lkml.kernel.org/r/20240514125254.142203-3-shikemeng@huaweicloud.comSigned-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kemeng Shi authored
Patch series "Add helper functions to remove repeated code and improve readability of cgroup writeback", v2. This series adds a lot of helpers to remove repeated code between domain and wb; dirty limit and dirty background; global domain and wb domain. The helpers also improve readability. More details can be found in the respective patches. A simple domain hierarchy is tested: global domain (> 20G) | cgroup domain1(10G) | wb1 | fio Test steps: /* make it easy to observe */ echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 3000 > /proc/sys/vm/dirty_writeback_centisecs /* create cgroup domain */ cd /sys/fs/cgroup echo "+memory +io" > cgroup.subtree_control mkdir group1 cd group1 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ /* run fio to generate dirty pages */ fio -name test -filename=/bdi1/file -size=xxx -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 When fio size is 1G, the wb is in freerun state and dirty pages are only written back when dirty inode is expired after 30 seconds. When fio size is 2G, the dirty pages keep being written back and bandwidth of fio is limited. This patch (of 8): Similar to wb_dirty_limits which calculates dirty and thresh of wb, wb_bg_dirty_limits calculates background dirty and background thresh of wb. With wb_bg_dirty_limits, we could remove repeated code in wb_over_bg_thresh. Link: https://lkml.kernel.org/r/20240514125254.142203-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20240514125254.142203-2-shikemeng@huaweicloud.comSigned-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-