- 04 Jul, 2024 40 commits
-
-
John Hubbard authored
Patch series "cleanups, fixes, and progress towards avoiding "make headers"", v3. Eventually, once the build succeeds on a sufficiently old distro, the idea is to delete $(KHDR_INCLUDES) from the selftests/mm build, and then after that, from selftests/lib.mk and all of the other selftest builds. For now, this series merely achieves a clean build of selftests/mm on a not-so-old distro: Ubuntu 23.04. In other words, after this series is applied, it is possible to delete $(KHDR_INCLUDES) from selftests/mm/Makefile and the build will still succeed. 1. Add tools/uapi/asm/unistd_[32|x32|64].h files, which include definitions of __NR_mseal, and include them (indirectly) from the files that use __NR_mseal. The new files are copied from ./usr/include/asm, which is how we have agreed to do this sort of thing, see [1]. 2. Add fs.h, similarly created: it was copied directly from a snapshot of ./usr/include/linux/fs.h after running "make headers". 3. Add a few selected prctl.h values that the ksm and mdwe tests require. 4. Factor out some common code from mseal_test.c and seal_elf.c, into a new mseal_helpers.h file. 5. Remove local __NR_* definitions and checks. [1] commit e076eaca ("selftests: break the dependency upon local header files") This patch (of 6): The selftests/mm build isn't exactly "broken", according to the current documentation, which still claims that one must run "make headers", before building the kselftests. However, according to the new plan to get rid of that requirement [1], they are future-broken: attempting to build selftests/mm *without* first running "make headers" will fail due to not finding __NR_mseal. Therefore, include asm-generic/unistd.h, which has all of the system call numbers that are needed, abstracted across the various CPU arches. Some explanation in support of this "asm-generic" approach: For most user space programs, the header file inclusion behaves as per this microblaze example, which comes from David Hildenbrand (thanks!): arch/microblaze/include/asm/unistd.h -> #include <uapi/asm/unistd.h> arch/microblaze/include/uapi/asm/unistd.h -> #include <asm/unistd_32.h> -> Generated during "make headers" usr/include/asm/unistd_32.h is generated via arch/microblaze/kernel/syscalls/Makefile with the syshdr command. So we never end up including asm-generic/unistd.h directly on microblaze... [2] However, those programs are installed on a single computer that has a single set of asm and kernel headers installed. In contrast, the kselftests are quite special, because they must provide a set of user space programs that: a) Mostly avoid using the installed (distro) system header files. b) Build (and run) on all supported CPU architectures c) Occasionally use symbols that have so new that they have not yet been included in the distro's header files. Doing (a) creates a new problem: how to get a set of cross-platform headers that works in all cases. Fortunately, asm-generic headers solve that one. Which is why we need to use them here--at least, for particularly difficult headers such as unistd.h. The reason this hasn't really come up yet, is that until now, the kselftests requirement (which I'm trying to eventually remove) was that "make headers" must first be run. That allowed the selftests to get a snapshot of sufficiently new header files that looked just like (and conflict with) the installed system headers. And as an aside, this is also an improvement over past practices of simply open-coding in a single (not per-arch) definition of a new symbol, directly into the selftest code. [1] commit e076eaca ("selftests: break the dependency upon local header files") [2] https://lore.kernel.org/all/0b152bea-ccb6-403e-9c57-08ed5e828135@redhat.com/ Link: https://lkml.kernel.org/r/20240618022422.804305-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20240618022422.804305-2-jhubbard@nvidia.com Fixes: 4926c7a5 ("selftest mm/mseal memory sealing") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rich Felker <dalias@libc.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yosry Ahmed authored
Commit [1] introduced IO polling support duding swapin to reduce swap read latency for block devices that can be polled. However later commit [2] removed polling support. Commit [3] removed the remnants of polling support from read_swap_cache_async() and __read_swap_cache_async(). However, it left behind some remnants in swap_read_folio(), the 'synchronous' argument. swap_read_folio() reads the folio synchronously if synchronous=true or if SWP_SYNCHRONOUS_IO is set in swap_info_struct. The only caller that passes synchronous=true is in do_swap_page() in the SWP_SYNCHRONOUS_IO case. Hence, the argument is redundant, it is only set to true when the swap read would have been synchronous anyway. Remove it. [1] Commit 23955622 ("swap: add block io poll in swapin path") [2] Commit 9650b453 ("block: ignore RWF_HIPRI hint for sync dio") [3] Commit b243dcbf ("swap: remove remnants of polling from read_swap_cache_async") Link: https://lkml.kernel.org/r/20240607045515.1836558-1-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
It looks rather weird that totalhigh_pages() returns an "unsigned long" but nr_free_highpages() returns an "unsigned int". Let's return an "unsigned long" from nr_free_highpages() to be consistent. While at it, use a plain "0" instead of a "0UL" in the !CONFIG_HIGHMEM totalhigh_pages() implementation, to make these look alike as well. Link: https://lkml.kernel.org/r/20240607083711.62833-3-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Patch series "mm/highmem: don't track highmem pages manually". Let's remove highmem special-casing from adjust_managed_page_count(), to result in less confusion why memblock manually adjusts totalram_pages, and __free_pages_core() only adjusts the zone's managed pages -- what about the highmem pages that adjust_managed_page_count() updates? Now, we only maintain totalram_pages and a zone's managed pages independent of highmem support. We can derive the number of highmem pages simply by looking at the relevant zone's managed pages. I don't think there is any particular fast path that needs a maximum-efficient totalhigh_pages() implementation. Note that highmem memory is currently initialized using free_highmem_page()->free_reserved_page(), not __free_pages_core(). In the future we might want to also use __free_pages_core() to initialize highmem memory, to make that less special, and consider moving totalram_pages updates into __free_pages_core() [1], so we can just use adjust_managed_page_count() in there as well. Booting a simple kernel in QEMU reveals no highmem accounting change: Before: Memory: 3095448K/3145208K available (14802K kernel code, 2073K rwdata, 5000K rodata, 740K init, 556K bss, 49760K reserved, 0K cma-reserved, 2244488K highmem) After: Memory: 3095276K/3145208K available (14802K kernel code, 2073K rwdata, 5000K rodata, 740K init, 556K bss, 49932K reserved, 0K cma-reserved, 2244488K highmem) [1] https://lkml.kernel.org/r/20240601133402.2675-1-richard.weiyang@gmail.com This patch (of 2): Can we get rid of the highmem ifdef in adjust_managed_page_count()? Likely yes: we don't have that many totalhigh_pages() users, and they all don't seem to be very performance critical. So let's implement totalhigh_pages() like nr_free_highpages(), collecting information from all zones. This is now similar to what we do in si_meminfo_node() to collect the per-node highmem page count. In the common case (single node, 3-4 zones), we really shouldn't care. We could optimize a bit further (only walk ZONE_HIGHMEM and ZONE_MOVABLE if required), but there doesn't seem a real need for that. [david@redhat.com: fix build bot complaint] Link: https://lkml.kernel.org/r/b57e5bc4-eb72-40e3-add4-57dfa6e03df6@redhat.com Link: https://lkml.kernel.org/r/20240607083711.62833-1-david@redhat.com Link: https://lkml.kernel.org/r/20240607083711.62833-2-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
That example was added in 2008. In 2015, we restricted access to the PFNs in the pagemap to CAP_SYS_ADMIN, making that approach quite less usable. It's 2024 now, and using that racy and low-lewel mechanism to calculate the USS should not be considered a good example anymore. /proc/$pid/smaps and /proc/$pid/smaps_rollup can do a much better job without any of that low-level handling. Let's just drop that example. Link: https://lkml.kernel.org/r/20240607122357.115423-7-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Oscar Salvador <osalvador@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
... and rename it to folio_precise_page_mapcount(). fs/proc is the last remaining user, and that should stay that way. While at it, cleanup kpagecount_read() a bit: there are still some legacy leftovers -- when the interface was introduced it returned the page refcount, but was changed briefly afterwards to return the page mapcount. Further, some simple folio conversion. Once we stop using the per-page mapcounts of large folios, all folio_precise_page_mapcount() users will have to implement an alternative way to achieve what they are trying to achieve, possibly in a less precise way. [dan.carpenter@linaro.org: fix uninitialized variable in pagemap_pmd_range()] Link: https://lkml.kernel.org/r/9d6eaba7-92f8-4a70-8765-38a519680a87@moroto.mountain Link: https://lkml.kernel.org/r/20240607122357.115423-6-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Oscar Salvador <osalvador@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
We currently rely on mapcount information for pages referenced by non-present entries to calculate the USS (shared vs. private) and the PSS. However, relying on mapcounts for non-present entries doesn't make any sense. We have to treat such entries as "maybe shared, but no idea how often", implying that they will *not* get accounted towards the USS, and will get fully accounted to the PSS (no idea how often shared). There is one exception: device exclusive entries essentially behave like present entries (e.g., mapcount incremented). In smaps_pmd_entry(), use is_pfn_swap_entry() instead of is_migration_entry(), which should not make a real difference but makes the code look more similar to the PTE variant. While at it, adjust the comments in smaps_account(). Link: https://lkml.kernel.org/r/20240607122357.115423-5-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Oscar Salvador <osalvador@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
We added PM_MMAP_EXCLUSIVE in 2015 via commit 77bb499b ("pagemap: add mmap-exclusive bit for marking pages mapped only here"), when THPs could not be partially mapped and page_mapcount() returned something that was true for all pages of the THP. In 2016, we added support for partially mapping THPs via commit 53f9263b ("mm: rework mapcount accounting to enable 4k mapping of THPs") but missed to determine PM_MMAP_EXCLUSIVE as well per page. Checking page_mapcount() on the head page does not tell the whole story. We should check each individual page. In a future without per-page mapcounts it will be different, but we'll change that to be consistent with PTE-mapped THPs once we deal with that. Link: https://lkml.kernel.org/r/20240607122357.115423-4-david@redhat.com Fixes: 53f9263b ("mm: rework mapcount accounting to enable 4k mapping of THPs") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Relying on the mapcount for non-present PTEs that reference pages doesn't make any sense: they are not accounted in the mapcount, so page_mapcount() == 1 won't return the result we actually want to know. While we don't check the mapcount for migration entries already, we could end up checking it for swap, hwpoison, device exclusive, ... entries, which we really shouldn't. There is one exception: device private entries, which we consider fake-present (e.g., incremented the mapcount). But we won't care about that for now for PM_MMAP_EXCLUSIVE, because indicating PM_SWAP for them although they are fake-present already sounds suspiciously wrong. Let's never indicate PM_MMAP_EXCLUSIVE without PM_PRESENT. Link: https://lkml.kernel.org/r/20240607122357.115423-3-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Patch series "fs/proc: move page_mapcount() to fs/proc/internal.h". With all other page_mapcount() users in the tree gone, move page_mapcount() to fs/proc/internal.h, rename it and extend the documentation to prevent future (ab)use. ... of course, I find some issues while working on that code that I sort first ;) We'll now only end up calling page_mapcount() [now folio_precise_page_mapcount()] on pages mapped via present page table entries. Except for /proc/kpagecount, that still does questionable things, but we'll leave that legacy interface as is for now. Did a quick sanity check. Likely we would want some better selfestest for /proc/$/pagemap + smaps. I'll see if I can find some time to write some more. This patch (of 6): Looks like we never taught pagemap_pmd_range() about the existence of PMD-mapped file THPs. Seems to date back to the times when we first added support for non-anon THPs in the form of shmem THP. Link: https://lkml.kernel.org/r/20240607122357.115423-1-david@redhat.com Link: https://lkml.kernel.org/r/20240607122357.115423-2-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Fixes: 800d8c63 ("shmem: add huge pages support") Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Johnson authored
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_hmm.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Link: https://lkml.kernel.org/r/20240531-lib-md-test_hmm-v1-1-e4aa17daa57b@quicinc.comSigned-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Johnson authored
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_maple_tree.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Link: https://lkml.kernel.org/r/20240531-md-lib-test_maple_tree-v1-1-7b1b485aeec3@quicinc.comSigned-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Johnson authored
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_ubsan.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Link: https://lkml.kernel.org/r/20240531-md-lib-test_ubsan-v1-1-c2a80d258842@quicinc.comSigned-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Johnson authored
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_xarray.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Link: https://lkml.kernel.org/r/20240531-md-lib-test_xarray-v1-1-42fd6833bdd4@quicinc.comSigned-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Barry Song authored
After swapping out, we perform a swap-in operation. If we first read and then write, we encounter a major fault in do_swap_page for reading, along with additional minor faults in do_wp_page for writing. However, the latter appears to be unnecessary and inefficient. Instead, we can directly reuse in do_swap_page and completely eliminate the need for do_wp_page. This patch achieves that optimization specifically for exclusive folios. The following microbenchmark demonstrates the significant reduction in minor faults. #define DATA_SIZE (2UL * 1024 * 1024) #define PAGE_SIZE (4UL * 1024) static void *read_write_data(char *addr) { char tmp; for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { tmp = *(volatile char *)(addr + i); *(volatile char *)(addr + i) = tmp; } } int main(int argc, char **argv) { struct rusage ru; char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); memset(addr, 0x11, DATA_SIZE); do { long old_ru_minflt, old_ru_majflt; long new_ru_minflt, new_ru_majflt; madvise(addr, DATA_SIZE, MADV_PAGEOUT); getrusage(RUSAGE_SELF, &ru); old_ru_minflt = ru.ru_minflt; old_ru_majflt = ru.ru_majflt; read_write_data(addr); getrusage(RUSAGE_SELF, &ru); new_ru_minflt = ru.ru_minflt; new_ru_majflt = ru.ru_majflt; printf("minor faults:%ld major faults:%ld\n", new_ru_minflt - old_ru_minflt, new_ru_majflt - old_ru_majflt); } while(0); return 0; } w/o patch, / # ~/a.out minor faults:512 major faults:512 w/ patch, / # ~/a.out minor faults:0 major faults:512 Minor faults decrease to 0! Link: https://lkml.kernel.org/r/20240602004502.26895-1-21cnbao@gmail.comSigned-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jonathan Cameron authored
The call for memblock_phys_free() in try_remove_memory() does not balance any call to memblock_alloc() (or memblock_reserve() for that matter). There are no memblock_reserve() calls in mm/memory_hotplug.c, no memblock allocations possible after mm_core_init(), and even if memblock_add_node() called from add_memory_resource() would need to allocate memory, that memory would ba allocated from slab. The patch f9126ab9 ("memory-hotplug: fix wrong edge when hot add a new node") that introduced that call to memblock_free() does not provide adequate description why that was required and tinkering with memblock in the context of memory hotplug on x86 seems bogus because x86 never kept memblock after boot anyway. Drop memblock_phys_free() call in try_remove_memory(). [rppt@kernel.org: rewrite the commit message] Link: https://lkml.kernel.org/r/20240605082049.973242-1-rppt@kernel.orgSigned-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Johnson authored
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in samples/kmemleak/kmemleak-test.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Link: https://lkml.kernel.org/r/20240601-md-samples-kmemleak-v1-1-47186be7f0a8@quicinc.comSigned-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
Add mTHP counters for anonymous shmem. [baolin.wang@linux.alibaba.com: update Documentation/admin-guide/mm/transhuge.rst] Link: https://lkml.kernel.org/r/d86e2e7f-4141-432b-b2ba-c6691f36ef0b@linux.alibaba.com Link: https://lkml.kernel.org/r/4fd9e467d49ae4a747e428bcd821c7d13125ae67.1718090413.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
Although the top-level hugepage allocation can be turned off, anonymous shmem can still use mTHP by configuring the sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled'. Therefore, add alignment for mTHP size to provide a suitable alignment address in shmem_get_unmapped_area(). Link: https://lkml.kernel.org/r/0c549b57cf7db07503af692d8546ecfad0fcce52.1718090413.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Lance Yang <ioworker0@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
Commit 19eaf449 adds multi-size THP (mTHP) for anonymous pages, that can allow THP to be configured through the sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'. However, the anonymous shmem will ignore the anonymous mTHP rule configured through the sysfs interface, and can only use the PMD-mapped THP, that is not reasonable. Users expect to apply the mTHP rule for all anonymous pages, including the anonymous shmem, in order to enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc. In addition, the mTHP interfaces can be extended to support all shmem/tmpfs scenarios in the future, especially for the shmem mmap() case. The primary strategy is similar to supporting anonymous mTHP. Introduce a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled', which can have almost the same values as the top-level '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new additional "inherit" option and dropping the testing options 'force' and 'deny'. By default all sizes will be set to "never" except PMD size, which is set to "inherit". This ensures backward compatibility with the anonymous shmem enabled of the top level, meanwhile also allows independent control of anonymous shmem enabled for each mTHP. Link: https://lkml.kernel.org/r/65796c1e72e51e15f3410195b5c2d5b6c160d411.1718090413.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
To support the use of mTHP with anonymous shmem, add a new sysfs interface 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/' directory for each mTHP to control whether shmem is enabled for that mTHP, with a value similar to the top level 'shmem_enabled', which can be set to: "always", "inherit (to inherit the top level setting)", "within_size", "advise", "never". An 'inherit' option is added to ensure compatibility with these global settings, and the options 'force' and 'deny' are dropped, which are rather testing artifacts from the old ages. By default, PMD-sized hugepages have enabled="inherit" and all other hugepage sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'. In addition, if top level value is 'force', then only PMD-sized hugepages have enabled="inherit", otherwise configuration will be failed and vice versa. That means now we will avoid using non-PMD sized THP to override the global huge allocation. [baolin.wang@linux.alibaba.com: fix transhuge.rst indentation] Link: https://lkml.kernel.org/r/b189d815-998b-4dfd-ba89-218ff51313f8@linux.alibaba.com [akpm@linux-foundation.org: reflow transhuge.rst addition to 80 cols] [baolin.wang@linux.alibaba.com: move huge_shmem_orders_lock under CONFIG_SYSFS] Link: https://lkml.kernel.org/r/eb34da66-7f12-44f3-a39e-2bcc90c33354@linux.alibaba.com [akpm@linux-foundation.org: huge_memory.c needs mm_types.h] Link: https://lkml.kernel.org/r/ffddfa8b3cb4266ff963099ab78cfd7184c57ac7.1718090413.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
In order to extend support for mTHP, add THP validation for PMD-mapped THP related statistics to avoid statistical confusion. Link: https://lkml.kernel.org/r/c4b04cbd51e6951cc2436a87be8eaa4a1516faec.1718090413.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
Patch series "add mTHP support for anonymous shmem", v5. Anonymous pages have already been supported for multi-size (mTHP) allocation through commit 19eaf449, that can allow THP to be configured through the sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'. However, the anonymous shmem will ignore the anonymous mTHP rule configured through the sysfs interface, and can only use the PMD-mapped THP, that is not reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect to apply an unified mTHP strategy for anonymous pages, also including the anonymous shared pages, in order to enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc. As discussed in the bi-weekly MM meeting[1], the mTHP controls should control all of shmem, not only anonymous shmem, but support will be added iteratively. Therefore, this patch set starts with support for anonymous shmem. The primary strategy is similar to supporting anonymous mTHP. Introduce a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled', which can have almost the same values as the top-level '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new additional "inherit" option and dropping the testing options 'force' and 'deny'. By default all sizes will be set to "never" except PMD size, which is set to "inherit". This ensures backward compatibility with the anonymous shmem enabled of the top level, meanwhile also allows independent control of anonymous shmem enabled for each mTHP. Use the page fault latency tool to measure the performance of 1G anonymous shmem with 32 threads on my machine environment with: ARM64 Architecture, 32 cores, 125G memory: base: mm-unstable user-time sys_time faults_per_sec_per_cpu faults_per_sec 0.04s 3.10s 83516.416 2669684.890 mm-unstable + patchset, anon shmem mTHP disabled user-time sys_time faults_per_sec_per_cpu faults_per_sec 0.02s 3.14s 82936.359 2630746.027 mm-unstable + patchset, anon shmem 64K mTHP enabled user-time sys_time faults_per_sec_per_cpu faults_per_sec 0.08s 0.31s 678630.231 17082522.495 From the data above, it is observed that the patchset has a minimal impact when mTHP is not enabled (some fluctuations observed during testing). When enabling 64K mTHP, there is a significant improvement of the page fault latency. [1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/ This patch (of 6): Add large folio mapping establishment support for finish_fault() as a preparation, to support multi-size THP allocation of anonymous shmem pages in the following patches. Keep the same behavior (per-page fault) for non-anon shmem to avoid inflating the RSS unintentionally, and we can discuss what size of mapping to build when extending mTHP to control non-anon shmem in the future. [baolin.wang@linux.alibaba.com: avoid going beyond the PMD pagetable size] Link: https://lkml.kernel.org/r/b0e6a8b1-a32c-459e-ae67-fde5d28773e6@linux.alibaba.com [baolin.wang@linux.alibaba.com: use 'PTRS_PER_PTE' instead of 'PTRS_PER_PTE - 1'] Link: https://lkml.kernel.org/r/e1f5767a-2c9b-4e37-afe6-1de26fe54e41@linux.alibaba.com Link: https://lkml.kernel.org/r/cover.1718090413.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/3a190892355989d42f59cf9f2f98b94694b0d24d.1718090413.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Nobody checks the error flag any more, so setting it accomplishes nothing. Remove the obsolete parts of this comment; it hasn't been true since errseq_t was used to track writeback errors in 2017. Link: https://lkml.kernel.org/r/20240531032938.2712870-1-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
Replace PFNs with addresses in readahead window calculation. This simplified the logic and reduce the code line number. No functionality change is expected. Link: https://lkml.kernel.org/r/20240531081230.310128-4-ying.huang@intel.comSigned-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
When VMA based swap readahead is introduced in commit ec560175 ("mm, swap: VMA based swap readahead"), "struct vma_swap_readahead" is defined to describe the readahead window. Because we wanted to save the PTE entries in the struct at that time. But after commit 4f8fcf4c ("mm/swap: swap_vma_readahead() do the pte_offset_map()"), we no longer save PTE entries in the struct. The size of the struct becomes so small, that it's better to use the fields of the struct directly. This can simplify the code to improve the code readability. The line number of source code reduces too. No functionality change is expected in this patch. Link: https://lkml.kernel.org/r/20240531081230.310128-3-ying.huang@intel.comSigned-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
Patch series "mm,swap: cleanup VMA based swap readahead window calculation". When VMA based swap readahead is introduced in commit ec560175 ("mm, swap: VMA based swap readahead"), "struct vma_swap_readahead" is defined to describe the readahead window. Because we wanted to save the PTE entries in the struct at that time. But after commit 4f8fcf4c ("mm/swap: swap_vma_readahead() do the pte_offset_map()"), we no longer save PTE entries in the struct. The size of the struct becomes so small, that it's better to use the fields of the struct directly. This can simplify the code to improve the code readability. The line number of source code reduces too. A theoretical underflow issue and some related code cleanup is done in the series too. This patch (of 3): In swap readahead window calculation, if the fault PFN is smaller than the readahead window size, underflow may occurs. This is only possible in theory, because the start of the virtual address space will not be used for anonymous pages in practice. Even if underflow occurs, there will be no functional bugs. In the worst cases, some swap entries may be swapped in incorrectly and some pages may be allocate on the wrong nodes. Anyway, we still needs to fix the issue via some underflow checking. Link: https://lkml.kernel.org/r/20240531081230.310128-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20240531081230.310128-2-ying.huang@intel.com Fixes: ec560175 ("mm, swap: VMA based swap readahead") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jiapeng Chong authored
Use existing swap() function rather than duplicating its implementation. ./mm/userfaultfd.c:1006:13-14: WARNING opportunity for swap() Link: https://lkml.kernel.org/r/20240531091643.67778-1-jiapeng.chong@linux.alibaba.comSigned-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9266Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Dev Jain authored
Consistently name the return variable with an _nr suffix, whenever calling pfn_to_section_nr(), to avoid confusion with a (struct mem_section *). Link: https://lkml.kernel.org/r/20240531124144.240399-1-dev.jain@arm.comSigned-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Oscar Salvador authored
In free_pagetable() we use the non-atomic version for clearing the PageReserved bit from the page. free_pagetable() will either call free_reserved_page() or put_page_bootmem(), which will eventually end up calling free_reserved_page(), and in there we already clear the PageReserved flag. Link: https://lkml.kernel.org/r/20240527044523.29207-1-osalvador@suse.deSigned-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
pxx_huge() has been removed in recent commit 9636f055 ("mm/treewide: remove pXd_huge()"), however there are still three comments referencing the API that got overlooked. Remove them. Link: https://lkml.kernel.org/r/20240527154855.528816-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Brian Johannesmeyer authored
Add a regression test to ensure that kmsan_unpoison_memory() works the same as an unpoisoning operation added by the instrumentation. The test has two subtests: one that checks the instrumentation, and one that checks kmsan_unpoison_memory(). Each subtest initializes the first byte of a 4-byte buffer, then checks that the other 3 bytes are uninitialized. [glider@google.com: change description, remove comment about failing test case] Link: https://lkml.kernel.org/r/20240528104807.738758-2-glider@google.comSigned-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com> Link: https://lore.kernel.org/lkml/20240524232804.1984355-1-bjohannesmeyer@gmail.com/T/Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Uros Bizjak authored
Use __this_cpu_try_cmpxchg() instead of __this_cpu_cmpxchg (*ptr, old, new) == old in preload_this_cpu_lock(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. The generated code improves from: 4bb6: 48 85 f6 test %rsi,%rsi 4bb9: 0f 84 10 fa ff ff je 45cf <...> 4bbf: 4c 89 e8 mov %r13,%rax 4bc2: 65 48 0f b1 35 00 00 cmpxchg %rsi,%gs:0x0(%rip) 4bc9: 00 00 4bcb: 48 85 c0 test %rax,%rax 4bce: 0f 84 fb f9 ff ff je 45cf <...> to: 4bb6: 48 85 f6 test %rsi,%rsi 4bb9: 0f 84 10 fa ff ff je 45cf <...> 4bbf: 4c 89 e8 mov %r13,%rax 4bc2: 65 48 0f b1 35 00 00 cmpxchg %rsi,%gs:0x0(%rip) 4bc9: 00 00 4bcb: 0f 84 fe f9 ff ff je 45cf <...> No functional change intended. Link: https://lkml.kernel.org/r/20240528144345.5980-2-ubizjak@gmail.comSigned-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Uros Bizjak authored
Add __this_cpu_try_cmpxchg() version of the percpu op. Link: https://lkml.kernel.org/r/20240528144345.5980-1-ubizjak@gmail.comSigned-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Shakeel Butt authored
Kernel test robot reported [1] performance regression for will-it-scale test suite's page_fault2 test case for the commit 70a64b79 ("memcg: dynamically allocate lruvec_stats"). After inspection it seems like the commit has unintentionally introduced false cache sharing. After the commit the fields of mem_cgroup_per_node which get read on the performance critical path share the cacheline with the fields which get updated often on LRU page allocations or deallocations. This has caused contention on that cacheline and the workloads which manipulates a lot of LRU pages are regressed as reported by the test report. The solution is to rearrange the fields of mem_cgroup_per_node such that the false sharing is eliminated. Let's move all the read only pointers at the start of the struct, followed by memcg-v1 only fields and at the end fields which get updated often. Experiment setup: Ran fallocate1, fallocate2, page_fault1, page_fault2 and page_fault3 from the will-it-scale test suite inside a three level memcg with /tmp mounted as tmpfs on two different machines, one a single numa node and the other one, two node machine. $ ./[testcase]_processes -t $NR_CPUS -s 50 Results for single node, 52 CPU machine: Testcase base with-patch fallocate1 1031081 1431291 (38.80 %) fallocate2 1029993 1421421 (38.00 %) page_fault1 2269440 3405788 (50.07 %) page_fault2 2375799 3572868 (50.30 %) page_fault3 28641143 28673950 ( 0.11 %) Results for dual node, 80 CPU machine: Testcase base with-patch fallocate1 2976288 3641185 (22.33 %) fallocate2 2979366 3638181 (22.11 %) page_fault1 6221790 7748245 (24.53 %) page_fault2 6482854 7847698 (21.05 %) page_fault3 28804324 28991870 ( 0.65 %) Link: https://lkml.kernel.org/r/20240528164050.2625718-1-shakeel.butt@linux.dev Fixes: 70a64b79 ("memcg: dynamically allocate lruvec_stats") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Feng Tang <feng.tang@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sidhartha Kumar authored
By using a folio in scan_movable_pages() we convert the last user of the page-based hugetlb information macro functions to the folio version. After this conversion, we can safely remove the page-based definitions from include/linux/hugetlb.h. Link: https://lkml.kernel.org/r/20240530171427.242018-1-sidhartha.kumar@oracle.comSigned-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Chuanhua Han authored
When a large folio is found in the swapcache, the current implementation requires calling do_swap_page() nr_pages times, resulting in nr_pages page faults. This patch opts to map the entire large folio at once to minimize page faults. Additionally, redundant checks and early exits for ARM64 MTE restoring are removed. Link: https://lkml.kernel.org/r/20240529082824.150954-7-21cnbao@gmail.comSigned-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Chuanhua Han authored
The function should_try_to_free_swap() operates under the assumption that swap-in always occurs at the normal page granularity, i.e., folio_nr_pages() = 1. However, in reality, for large folios, add_to_swap_cache() will invoke folio_ref_add(folio, nr). To accommodate large folio swap-in, this patch eliminates this assumption. Link: https://lkml.kernel.org/r/20240529082824.150954-6-21cnbao@gmail.comSigned-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Barry Song authored
Should do_swap_page() have the capability to directly map a large folio, metadata restoration becomes necessary for a specified number of pages denoted as nr. It's important to highlight that metadata restoration is solely required by the SPARC platform, which, however, does not enable THP_SWAP. Consequently, in the present kernel configuration, there exists no practical scenario where users necessitate the restoration of nr metadata. Platforms implementing THP_SWAP might invoke this function with nr values exceeding 1, subsequent to do_swap_page() successfully mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr() functions remain empty. Link: https://lkml.kernel.org/r/20240529082824.150954-5-21cnbao@gmail.comSigned-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Barry Song authored
There could arise a necessity to obtain the first pte_t from a swap pte_t located in the middle. For instance, this may occur within the context of do_swap_page(), where a page fault can potentially occur in any PTE of a large folio. To address this, the following patch introduces pte_move_swp_offset(), a function capable of bidirectional movement by a specified delta argument. Consequently, pte_next_swp_offset() will directly invoke it with delta = 1. Link: https://lkml.kernel.org/r/20240529082824.150954-4-21cnbao@gmail.comSigned-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-