- 06 Apr, 2023 40 commits
-
-
Mike Rapoport (IBM) authored
The bulk of memory management initialization code is spread all over mm/page_alloc.c and makes navigating through page allocator functionality difficult. Move most of the functions marked __init and __meminit to mm/mm_init.c to make it better localized and allow some more spare room before mm/page_alloc.c reaches 10k lines. No functional changes. Link: https://lkml.kernel.org/r/20230321170513.2401534-4-rppt@kernel.orgSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mike Rapoport (IBM) authored
Instead of duplicating long static_branch_enabled(&check_pages_enabled) wrap it in a helper function is_check_pages_enabled() Link: https://lkml.kernel.org/r/20230321170513.2401534-3-rppt@kernel.orgSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mike Rapoport (IBM) authored
Patch series "mm: move core MM initialization to mm/mm_init.c", v2. This set moves most of the core MM initialization to mm/mm_init.c. This largely includes free_area_init() and its helpers, functions used at boot time, mm_init() from init/main.c and some of the functions it calls. Aside from gaining some more space before mm/page_alloc.c hits 10k lines, this makes mm/page_alloc.c to be mostly about buddy allocator and moves the init code out of the way, which IMO improves maintainability. Besides, this allows to move a couple of declarations out of include/linux and make them private to mm/. And as an added bonus there a slight decrease in vmlinux size. For tinyconfig and defconfig on x86 I've got tinyconfig: text data bss dec hex filename 853206 289376 1200128 2342710 23bf36 a/vmlinux 853198 289344 1200128 2342670 23bf0e b/vmlinux defconfig: text data bss dec hex filename 26152959 9730634 2170884 38054477 244aa4d a/vmlinux 26152945 9730602 2170884 38054431 244aa1f b/vmlinux This patch (of 14): Comment about fixrange_init() says that its called from pgtable_init() while the actual caller is pagetabe_init(). Update comment to match the code. Link: https://lkml.kernel.org/r/20230321170513.2401534-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20230321170513.2401534-2-rppt@kernel.orgSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daud <philmd@linaro.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Lorenzo Stoakes authored
I have recently been involved in both reviewing and submitting patches to the vmalloc code in mm and would be willing and happy to help out with review going forward if it would be helpful! Link: https://lkml.kernel.org/r/55f663af6100c84a71a0065ac0ed22463aa340de.1679421959.git.lstoakes@gmail.comSigned-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mike Rapoport (IBM) authored
The get_page_from_free_area() helper is only used in mm/page_alloc.c so move it there to reduce noise in include/linux/mmzone.h Link: https://lkml.kernel.org/r/20230319114214.2133332-1-rppt@kernel.orgSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Lorenzo Stoakes authored
All use of this value is now at page granularity, so specify the variable as such too. This simplifies the logic. We maintain the debugfs entry to ensure that there are no user-visible changes. Link: https://lkml.kernel.org/r/4995bad07fe9baa51c786fa0d81819dddfb57654.1679089214.git.lstoakes@gmail.comSigned-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Lorenzo Stoakes authored
Patch series "Refactor do_fault_around()" Refactor do_fault_around() to avoid bitwise tricks and rather difficult to follow logic. Additionally, prefer fault_around_pages to fault_around_bytes as the operations are performed at a base page granularity. This patch (of 2): The existing logic is confusing and fails to abstract a number of bitwise tricks. Use ALIGN_DOWN() to perform alignment, pte_index() to obtain a PTE index and represent the address range using PTE offsets, which naturally make it clear that the operation is intended to occur within only a single PTE and prevent spanning of more than one page table. We rely on the fact that fault_around_bytes will always be page-aligned, at least one page in size, a power of two and that it will not exceed PAGE_SIZE * PTRS_PER_PTE in size (i.e. the address space mapped by a PTE). These are all guaranteed by fault_around_bytes_set(). Link: https://lkml.kernel.org/r/cover.1679089214.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/d125db1c3665a63b80cea29d56407825482e2262.1679089214.git.lstoakes@gmail.comSigned-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
When trying to isolate a migratable pageblock, it can contain several normal pages or several hugetlb pages (e.g. CONT-PTE 64K hugetlb on arm64) in a pageblock. That means we may hold the lru lock of a normal page to continue to isolate the next hugetlb page by isolate_or_dissolve_huge_page() in the same migratable pageblock. However in the isolate_or_dissolve_huge_page(), it may allocate a new hugetlb page and dissolve the old one by alloc_and_dissolve_hugetlb_folio() if the hugetlb's refcount is zero. That means we can still enter the direct compaction path to allocate a new hugetlb page under the current lru lock, which may cause possible deadlock. To avoid this possible deadlock, we should release the lru lock when trying to isolate a hugetbl page. Moreover it does not make sense to take the lru lock to isolate a hugetlb, which is not in the lru list. Link: https://lkml.kernel.org/r/7ab3bffebe59fb419234a68dec1e4572a2518563.1678962352.git.baolin.wang@linux.alibaba.com Fixes: 369fa227 ("mm: make alloc_contig_range handle free hugetlb pages") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Lam <william.lam@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
commit b717d6b9 ("mm: compaction: include compound page count for scanning in pageblock isolation") added compound page statistics for scanning in pageblock isolation, to make sure the number of scanned pages is always larger than the number of isolated pages when isolating mirgratable or free pageblock. However, when failing to isolate the pages when scanning the migratable or free pageblocks, the isolation failure path did not consider the scanning statistics of the compound pages, which result in showing the incorrect number of scanned pages in tracepoints or in vmstats which will confuse people about the page scanning pressure in memory compaction. Thus we should take into account the number of scanning pages when failing to isolate the compound pages to make the statistics accurate. Link: https://lkml.kernel.org/r/73d6250a90707649cc010731aedc27f946d722ed.1678962352.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Lam <william.lam@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
This effectively reverts d014cd7c ("mm, mremap: fix mremap() expanding for vma's with vm_ops->close()"). After the recent changes, vma_merge() is able to handle the expansion properly even when the vma being expanded has a vm_ops->close operation, so we don't need to special case it anymore. Link: https://lkml.kernel.org/r/20230309111258.24079-11-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
Since pre-git times, is_mergeable_vma() returns false for a vma with vm_ops->close, so that no owner assumptions are violated in case the vma is removed as part of the merge. This check is currently very conservative and can prevent merging even situations where vma can't be removed, such as simple expansion of previous vma, as evidenced by commit d014cd7c ("mm, mremap: fix mremap() expanding for vma's with vm_ops->close()") In order to allow more merging when appropriate and simplify the code that was made more complex by commit d014cd7c, start distinguishing cases where the vma can be really removed, and allow merging with vm_ops->close otherwise. As a first step, add a may_remove_vma parameter to is_mergeable_vma(). can_vma_merge_before() sets it to true, because when called from vma_merge(), a removal of the vma is possible. In can_vma_merge_after(), pass the parameter as false, because no removal can occur in each of its callers: - vma_merge() calls it on the 'prev' vma, which is never removed - mmap_region() and do_brk_flags() call it to determine if it can expand a vma, which is not removed As a result, vma's with vm_ops->close may now merge with compatible ranges in more situations than previously. We can also revert commit d014cd7c as the next step to simplify mremap code again. [vbabka@suse.cz: adjust comment as suggested by Lorenzo] Link: https://lkml.kernel.org/r/74f2ea6c-f1a9-6dd7-260c-25e660f42379@suse.cz Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
The comments already mention returning 'true' so make the code match them. Link: https://lkml.kernel.org/r/20230309111258.24079-9-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
The variable 'adj_next' holds the value by which we adjust vm_start of a vma in variable 'adjust', that's either 'next' or 'mid', so the current name is inaccurate. Rename it to 'adj_start'. Link: https://lkml.kernel.org/r/20230309111258.24079-8-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
There are several places where we test if 'mid' is really the area NNNN in the diagram and the tests have two variants and are non-obvious to follow. Instead, set 'mid' to NULL up-front if it's not the NNNN area, and simplify the tests. Also update the description in comment accordingly. [vbabka@suse.cz: adjust/add comments as suggested by Lorenzo] Link: https://lkml.kernel.org/r/def43190-53f7-a607-d1b0-b657565f4288@suse.cz Link: https://lkml.kernel.org/r/20230309111258.24079-7-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
It is more intuitive to go from prev to mid and then next. No functional change. Link: https://lkml.kernel.org/r/20230309111258.24079-6-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
Almost all cases now use the 'next' pointer for the vma following the merged area, and the cases diagram shows it as XXXX. Case 4 is different as it uses 'mid' and NNNN, so change it for consistency. No functional change. Link: https://lkml.kernel.org/r/20230309111258.24079-5-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
Case 1 is now shown in the comment as next vma being merged with prev, so use 'next' instead of 'mid'. In case 1 they both point to the same vma. As a consequence, in case 6, the dup_anon_vma() is now tried first on 'next' and then on 'mid', before it was the opposite order. This is not a functional change, as those two vma's cannnot have a different anon_vma, as that would have prevented the merging in the first place. Link: https://lkml.kernel.org/r/20230309111258.24079-4-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
In case 3 we we use 'next' for everything but vma_pgoff. So use 'next' for that as well, instead of 'mid', for consistency. Then in case 8 we have to use 'mid' explicitly, which should also make the intent more obvious. Adjust the diagram for cases 1-3 in the comment to match the code - we are using 'next' for case 3 so mark the range with XXXX instead of NNNN. For case 2 that's a no-op as the code doesn't touch 'next' or 'mid'. For case 1 it's now wrong but that will be fixed next. No functional change. Link: https://lkml.kernel.org/r/20230309111258.24079-3-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
Patch series "cleanup vma_merge() and improve mergeability tests". My initial goal here was to try making the check for vm_ops->close in is_mergeable_vma() only be applied for vma's that would be truly removed as part of the merge (see Patch 9). This would then allow reverting the quick fix d014cd7c ("mm, mremap: fix mremap() expanding for vma's with vm_ops->close()"). This was successful enough to allow the revert (Patch 10). Checks using can_vma_merge_before() are still pessimistic about possible vma removal, and making them precise would probably complicate the vma_merge() code too much. Liam's 6.3-rc1 simplification of vma_merge() and removal of __vma_adjust() was very much helpful in understanding the vma_merge() implementation and especially when vma removals can happen, which is now very obvious. While studing the code, I've found ways to make it hopefully even more easy to follow, so that's the patches 1-8. That made me also notice a bug that's now already fixed in 6.3-rc1. This patch (of 10): In the merging preparation part of vma_merge(), some vma pointer variables are assigned for later execution of the merge, but also read from in the block itself. The code is easier follow and check against the cases diagram in the comment if the code reads only from the "primary" vma variables prev, mid, next instead. No functional change. Link: https://lkml.kernel.org/r/20230309111258.24079-1-vbabka@suse.cz Link: https://lkml.kernel.org/r/20230309111258.24079-2-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>] Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Axel Rasmussen authored
UFFDIO_COPY already has UFFDIO_COPY_MODE_WP, so when installing a new PTE to resolve a missing fault, one can install a write-protected one. This is useful when using UFFDIO_REGISTER_MODE_{MISSING,WP} in combination. This was motivated by testing HugeTLB HGM [1], and in particular its interaction with userfaultfd features. Existing userfaultfd code supports using WP and MINOR modes together (i.e. you can register an area with both enabled), but without this CONTINUE flag the combination is in practice unusable. So, add an analogous UFFDIO_CONTINUE_MODE_WP, which does the same thing as UFFDIO_COPY_MODE_WP, but for *minor* faults. Update the selftest to do some very basic exercising of the new flag. Update Documentation/ to describe how these flags are used (neither the COPY nor the new CONTINUE versions of this mode flag were described there before). [1]: https://patchwork.kernel.org/project/linux-mm/cover/20230218002819.1486479-1-jthoughton@google.com/ Link: https://lkml.kernel.org/r/20230314221250.682452-5-axelrasmussen@google.comSigned-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Axel Rasmussen authored
Many userfaultfd ioctl functions take both a 'mode' and a 'wp_copy' argument. In future commits we plan to plumb the flags through to more places, so we'd be proliferating the very long argument list even further. Let's take the time to simplify the argument list. Combine the two arguments into one - and generalize, so when we add more flags in the future, it doesn't imply more function arguments. Since the modes (copy, zeropage, continue) are mutually exclusive, store them as an integer value (0, 1, 2) in the low bits. Place combine-able flag bits in the high bits. This is quite similar to an earlier patch proposed by Nadav Amit ("userfaultfd: introduce uffd_flags" [1]). The main difference is that patch only handled flags, whereas this patch *also* combines the "mode" argument into the same type to shorten the argument list. [1]: https://lore.kernel.org/all/20220619233449.181323-2-namit@vmware.com/ Link: https://lkml.kernel.org/r/20230314221250.682452-4-axelrasmussen@google.comSigned-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: James Houghton <jthoughton@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Axel Rasmussen authored
Quite a few userfaultfd functions took both mm and vma pointers as arguments. Since the mm is trivially accessible via vma->vm_mm, there's no reason to pass both; it just needlessly extends the already long argument list. Get rid of the mm pointer, where possible, to shorten the argument list. Link: https://lkml.kernel.org/r/20230314221250.682452-3-axelrasmussen@google.comSigned-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Axel Rasmussen authored
Patch series "mm: userfaultfd: refactor and add UFFDIO_CONTINUE_MODE_WP", v5. - Commits 1-3 refactor userfaultfd ioctl code without behavior changes, with the main goal of improving consistency and reducing the number of function args. - Commit 4 adds UFFDIO_CONTINUE_MODE_WP. This patch (of 4): The basic problem is, over time we've added new userfaultfd ioctls, and we've refactored the code so functions which used to handle only one case are now re-used to deal with several cases. While this happened, we didn't bother to rename the functions. Similarly, as we added new functions, we cargo-culted pieces of the now-inconsistent naming scheme, so those functions too ended up with names that don't make a lot of sense. A key point here is, "copy" in most userfaultfd code refers specifically to UFFDIO_COPY, where we allocate a new page and copy its contents from userspace. There are many functions with "copy" in the name that don't actually do this (at least in some cases). So, rename things into a consistent scheme. The high level idea is that the call stack for userfaultfd ioctls becomes: userfaultfd_ioctl -> userfaultfd_(particular ioctl) -> mfill_atomic_(particular kind of fill operation) -> mfill_atomic /* loops over pages in range */ -> mfill_atomic_pte /* deals with single pages */ -> mfill_atomic_pte_(particular kind of fill operation) -> mfill_atomic_install_pte There are of course some special cases (shmem, hugetlb), but this is the general structure which all function names now adhere to. Link: https://lkml.kernel.org/r/20230314221250.682452-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20230314221250.682452-2-axelrasmussen@google.comSigned-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mike Rapoport (IBM) authored
MIPS defines insane ranges for ARCH_FORCE_MAX_ORDER allowing MAX_ORDER up to 63, which implies maximal contiguous allocation size of 2^63 pages. Drop bogus definitions of ranges for ARCH_FORCE_MAX_ORDER and leave it a simple integer with sensible defaults. Users that *really* need to change the value of ARCH_FORCE_MAX_ORDER will be able to do so but they won't be mislead by the bogus ranges. Link: https://lkml.kernel.org/r/20230322081520.2516226-1-rppt@kernel.orgSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mike Rapoport (IBM) authored
LoongArch defines insane ranges for ARCH_FORCE_MAX_ORDER allowing MAX_ORDER up to 63, which implies maximal contiguous allocation size of 2^63 pages. Drop bogus definitions of ranges for ARCH_FORCE_MAX_ORDER and leave it a simple integer with sensible defaults. Users that *really* need to change the value of ARCH_FORCE_MAX_ORDER will be able to do so but they won't be mislead by the bogus ranges. Link: https://lkml.kernel.org/r/20230322081727.2516291-1-rppt@kernel.orgSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in __iommu_dma_alloc_pages(). Also use GENMASK() instead of hard to read "(2U << order) - 1" magic. Link: https://lkml.kernel.org/r/20230315113133.11326-10-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Acked-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in calculate_order(). Link: https://lkml.kernel.org/r/20230315113133.11326-9-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in page_reporting_register(). Link: https://lkml.kernel.org/r/20230315113133.11326-8-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in rb_alloc_aux_page(). Link: https://lkml.kernel.org/r/20230315113133.11326-7-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in genwqe driver. Link: https://lkml.kernel.org/r/20230315113133.11326-6-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Frank Haverkamp <haver@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in i915_gem_object_get_pages_internal(). Link: https://lkml.kernel.org/r/20230315113133.11326-5-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in floppy code. Also allocation buffer exactly PAGE_SIZE << MAX_ORDER bytes is okay. Fix MAX_LEN check. Link: https://lkml.kernel.org/r/20230315113133.11326-4-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Denis Efremov <efremov@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in linux_main(). Link: https://lkml.kernel.org/r/20230315113133.11326-3-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
Patch series "Fix confusion around MAX_ORDER". MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Fix the bugs and then change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. This patch (of 10): MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in tsb_grow(). Link: https://lkml.kernel.org/r/20230315113133.11326-1-kirill.shutemov@linux.intel.com Link: https://lkml.kernel.org/r/20230315113133.11326-2-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Miller <davem@davemloft.net> Cc: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Enable it by default on the stress test, and add some smoke tests for the pte markers on anonymous. Link: https://lkml.kernel.org/r/20230309223711.823547-3-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Paul Gofman <pgofman@codeweavers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4. The new feature bit makes anonymous memory acts the same as file memory on userfaultfd-wp in that it'll also wr-protect none ptes. It can be useful in two cases: (1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot, so pre-fault can be replaced by enabling this flag and speed up protections (2) It helps to implement async uffd-wp mode that Muhammad is working on [1] It's debatable whether this is the most ideal solution because with the new feature bit set, wr-protect none pte needs to pre-populate the pgtables to the last level (PAGE_SIZE). But it seems fine so far to service either purpose above, so we can leave optimizations for later. The series brings pte markers to anonymous memory too. There's some change in the common mm code path in the 1st patch, great to have some eye looking at it, but hopefully they're still relatively straightforward. This patch (of 2): This is a new feature that controls how uffd-wp handles none ptes. When it's set, the kernel will handle anonymous memory the same way as file memory, by allowing the user to wr-protect unpopulated ptes. File memories handles none ptes consistently by allowing wr-protecting of none ptes because of the unawareness of page cache being exist or not. For anonymous it was not as persistent because we used to assume that we don't need protections on none ptes or known zero pages. One use case of such a feature bit was VM live snapshot, where if without wr-protecting empty ptes the snapshot can contain random rubbish in the holes of the anonymous memory, which can cause misbehave of the guest when the guest OS assumes the pages should be all zeros. QEMU worked it around by pre-populate the section with reads to fill in zero page entries before starting the whole snapshot process [1]. Recently there's another need raised on using userfaultfd wr-protect for detecting dirty pages (to replace soft-dirty in some cases) [2]. In that case if without being able to wr-protect none ptes by default, the dirty info can get lost, since we cannot treat every none pte to be dirty (the current design is identify a page dirty based on uffd-wp bit being cleared). In general, we want to be able to wr-protect empty ptes too even for anonymous. This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make uffd-wp handling on none ptes being consistent no matter what the memory type is underneath. It doesn't have any impact on file memories so far because we already have pte markers taking care of that. So it only affects anonymous. The feature bit is by default off, so the old behavior will be maintained. Sometimes it may be wanted because the wr-protect of none ptes will contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to anonymous), but also on creating the pgtables to store the pte markers. So there's potentially less chance of using thp on the first fault for a none pmd or larger than a pmd. The major implementation part is teaching the whole kernel to understand pte markers even for anonymously mapped ranges, meanwhile allowing the UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the new feature bit is set. Note that even if the patch subject starts with mm/uffd, there're a few small refactors to major mm path of handling anonymous page faults. But they should be straightforward. With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all the memory before wr-protect during taking a live snapshot. Quotting from Muhammad's test result here [3] based on a simple program [4]: (1) With huge page disabled echo madvise > /sys/kernel/mm/transparent_hugepage/enabled ./uffd_wp_perf Test DEFAULT: 4 Test PRE-READ: 1111453 (pre-fault 1101011) Test MADVISE: 278276 (pre-fault 266378) Test WP-UNPOPULATE: 11712 (2) With Huge page enabled echo always > /sys/kernel/mm/transparent_hugepage/enabled ./uffd_wp_perf Test DEFAULT: 4 Test PRE-READ: 22521 (pre-fault 22348) Test MADVISE: 4909 (pre-fault 4743) Test WP-UNPOPULATE: 14448 There'll be a great perf boost for no-thp case, while for thp enabled with extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE, but that's low possibility in reality, also the overhead was not reduced but postponed until a follow up write on any huge zero thp, so potentially it is faster by making the follow up writes slower. [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/ [2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/ [3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/ [4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c [peterx@redhat.com: comment changes, oneliner fix to khugepaged] Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Paul Gofman <pgofman@codeweavers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
KASAN suppresses reports for bad accesses done by the KASAN reporting code. The reporting code might access poisoned memory for reporting purposes. Software KASAN modes do this by suppressing reports during reporting via current->kasan_depth, the same way they suppress reports during accesses to poisoned slab metadata. Hardware Tag-Based KASAN does not use current->kasan_depth, and instead resets pointer tags for accesses to poisoned memory done by the reporting code. Despite that, a recursive report can still happen: 1. On hardware with faulty MTE support. This was observed by Weizhao Ouyang on a faulty hardware that caused memory tags to randomly change from time to time. 2. Theoretically, due to a previous MTE-undetected memory corruption. A recursive report can happen via: 1. Accessing a pointer with a non-reset tag in the reporting code, e.g. slab->slab_cache, which is what Weizhao Ouyang observed. 2. Theoretically, via external non-annotated routines, e.g. stackdepot. To resolve this issue, resetting tags for all of the pointers in the reporting code and all the used external routines would be impractical. Instead, disable tag checking done by the CPU for the duration of KASAN reporting for Hardware Tag-Based KASAN. Without this fix, Hardware Tag-Based KASAN reporting code might deadlock. [andreyknvl@google.com: disable preemption instead of migration, fix comment typo] Link: https://lkml.kernel.org/r/d14417c8bc5eea7589e99381203432f15c0f9138.1680114854.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/59f433e00f7fa985e8bf9f7caf78574db16b67ab.1678491668.git.andreyknvl@google.com Fixes: 2e903b91 ("kasan, arm64: implement HW_TAGS runtime") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Weizhao Ouyang <ouyangweizhao@zeku.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Add two new tagging-related routines arch_suppress_tag_checks_start/stop that suppress MTE tag checking via the TCO register. These rouines are used in the next patch. [andreyknvl@google.com: drop __ from mte_disable/enable_tco names] Link: https://lkml.kernel.org/r/7ad5e5a9db79e3aba08d8f43aca24350b04080f6.1680114854.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/75a362551c3c54b70ae59a3492cabb51c105fa6b.1678491668.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Weizhao Ouyang <ouyangweizhao@zeku.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vincenzo Frascino authored
The TCO related routines are used in uaccess methods and load_unaligned_zeropad() but are unrelated to both even if the naming suggest otherwise. Improve the readability of the code moving the away from uaccess.h and pre-pending them with "mte". [andreyknvl@google.com: drop __ from mte_disable/enable_tco names] Link: https://lkml.kernel.org/r/74d26337b2360733956114069e96ff11c296a944.1680114854.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/a48e7adce1248c0f9603a457776d59daa0ef734b.1678491668.git.andreyknvl@google.comSigned-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Weizhao Ouyang <ouyangweizhao@zeku.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-