- 13 May, 2022 40 commits
-
-
Peter Xu authored
Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults. We do this slightly earlier than hugetlb_cow() so that we can avoid taking some extra locks that we definitely don't need. Link: https://lkml.kernel.org/r/20220405014901.14590-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
They will be used in the follow up patches to either check/set/clear uffd-wp bit of a huge pte. So far it reuses all the small pte helpers. Archs can overwrite these versions when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the future. Link: https://lkml.kernel.org/r/20220405014858.14531-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Normally we skip copy page when fork() for VM_SHARED shmem, but we can't skip it anymore if uffd-wp is enabled on dst vma. This should only happen when the src uffd has UFFD_FEATURE_EVENT_FORK enabled on uffd-wp shmem vma, so that VM_UFFD_WP will be propagated onto dst vma too, then we should copy the pgtables with uffd-wp bit and pte markers, because these information will be lost otherwise. Since the condition checks will become even more complicated for deciding "whether a vma needs to copy the pgtable during fork()", introduce a helper vma_needs_copy() for it, so everything will be clearer. Link: https://lkml.kernel.org/r/20220405014855.14468-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
We don't have "huge" version of pte markers, instead when necessary we split the thp. However split the thp is not enough, because file-backed thp is handled totally differently comparing to anonymous thps: rather than doing a real split, the thp pmd will simply got cleared in __split_huge_pmd_locked(). That is not enough if e.g. when there is a thp covers range [0, 2M) but we want to wr-protect small page resides in [4K, 8K) range, because after __split_huge_pmd() returns, there will be a none pmd, and change_pmd_range() will just skip it right after the split. Here we leverage the previously introduced change_pmd_prepare() macro so that we'll populate the pmd with a pgtable page after the pmd split (in which process the pmd will be cleared for cases like shmem). Then change_pte_range() will do all the rest for us by installing the uffd-wp pte marker at any none pte that we'd like to wr-protect. Link: https://lkml.kernel.org/r/20220405014852.14413-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
File-backed memory differs from anonymous memory in that even if the pte is missing, the data could still resides either in the file or in page/swap cache. So when wr-protect a pte, we need to consider none ptes too. We do that by installing the uffd-wp pte markers when necessary. So when there's a future write to the pte, the fault handler will go the special path to first fault-in the page as read-only, then report to userfaultfd server with the wr-protect message. On the other hand, when unprotecting a page, it's also possible that the pte got unmapped but replaced by the special uffd-wp marker. Then we'll need to be able to recover from a uffd-wp pte marker into a none pte, so that the next access to the page will fault in correctly as usual when accessed the next time. Special care needs to be taken throughout the change_protection_range() process. Since now we allow user to wr-protect a none pte, we need to be able to pre-populate the page table entries if we see (!anonymous && MM_CP_UFFD_WP) requests, otherwise change_protection_range() will always skip when the pgtable entry does not exist. For example, the pgtable can be missing for a whole chunk of 2M pmd, but the page cache can exist for the 2M range. When we want to wr-protect one 4K page within the 2M pmd range, we need to pre-populate the pgtable and install the pte marker showing that we want to get a message and block the thread when the page cache of that 4K page is written. Without pre-populating the pmd, change_protection() will simply skip that whole pmd. Note that this patch only covers the small pages (pte level) but not covering any of the transparent huge pages yet. That will be done later, and this patch will be a preparation for it too. Link: https://lkml.kernel.org/r/20220405014850.14352-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
File-backed memory is prone to being unmapped at any time. It means all information in the pte will be dropped, including the uffd-wp flag. To persist the uffd-wp flag, we'll use the pte markers. This patch teaches the zap code to understand uffd-wp and know when to keep or drop the uffd-wp bit. Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we don't want to persist such an information, for example, when destroying the whole vma, or punching a hole in a shmem file. For the rest cases we should never drop the uffd-wp bit, or the wr-protect information will get lost. The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than memory.c because it'll be further referenced in hugetlb files later. Link: https://lkml.kernel.org/r/20220405014847.14295-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
File-backed memories are prone to unmap/swap so the ptes are always unstable, because they can be easily faulted back later using the page cache. This could lead to uffd-wp getting lost when unmapping or swapping out such memory. One example is shmem. PTE markers are needed to store those information. This patch prepares it by handling uffd-wp pte markers first it is applied elsewhere, so that the page fault handler can recognize uffd-wp pte markers. The handling of uffd-wp pte markers is similar to missing fault, it's just that we'll handle this "missing fault" when we see the pte markers, meanwhile we need to make sure the marker information is kept during processing the fault. This is a slow path of uffd-wp handling, because zapping of wr-protected shmem ptes should be rare. So far it should only trigger in two conditions: (1) When trying to punch holes in shmem_fallocate(), there is an optimization to zap the pgtables before evicting the page. (2) When swapping out shmem pages. Because of this, the page fault handling is simplifed too by not sending the wr-protect message in the 1st page fault, instead the page will be installed read-only, so the uffd-wp message will be generated in the next fault, which will trigger the do_wp_page() path of general uffd-wp handling. Disable fault-around for all uffd-wp registered ranges for extra safety just like uffd-minor fault, and clean the code up. Link: https://lkml.kernel.org/r/20220405014844.14239-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Pass wp_copy into shmem_mfill_atomic_pte() through the stack, then apply the UFFD_WP bit properly when the UFFDIO_COPY on shmem is with UFFDIO_COPY_MODE_WP. wp_copy lands mfill_atomic_install_pte() finally. Note: we must do pte_wrprotect() if !writable in mfill_atomic_install_pte(), as mk_pte() could return a writable pte (e.g., when VM_SHARED on a shmem file). Link: https://lkml.kernel.org/r/20220405014841.14185-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
This patch introduces the 1st user of pte marker: the uffd-wp marker. When the pte marker is installed with the uffd-wp bit set, it means this pte was wr-protected by uffd. We will use this special pte to arm the ptes that got either unmapped or swapped out for a file-backed region that was previously wr-protected. This special pte could trigger a page fault just like swap entries. This idea is greatly inspired by Hugh and Andrea in the discussion, which is referenced in the links below. Some helpers are introduced to detect whether a swap pte is uffd wr-protected. After the pte marker introduced, one swap pte can be wr-protected in two forms: either it is a normal swap pte and it has _PAGE_SWP_UFFD_WP set, or it's a pte marker that has PTE_MARKER_UFFD_WP set. [peterx@redhat.com: fixup] Link: https://lkml.kernel.org/r/YkzKiM8tI4+qOfXF@xz-m1.local Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/ Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/ Link: https://lkml.kernel.org/r/20220405014838.14131-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
This patch allows do_fault() to trigger on !pte_none() cases too. This prepares for the pte markers to be handled by do_fault() just like none pte. To achieve this, instead of unconditionally check against pte_none() in finish_fault(), we may hit the case that the orig_pte was some pte marker so what we want to do is to replace the pte marker with some valid pte entry. Then if orig_pte was set we'd want to check the current *pte (under pgtable lock) against orig_pte rather than none pte. Right now there's no solid way to safely reference orig_pte because when pmd is not allocated handle_pte_fault() will not initialize orig_pte, so it's not safe to reference it. There's another solution proposed before this patch to do pte_clear() for vmf->orig_pte for pmd==NULL case, however it turns out it'll break arm32 because arm32 could have assumption that pte_t* pointer will always reside on a real ram32 pgtable, not any kernel stack variable. To solve this, we add a new flag FAULT_FLAG_ORIG_PTE_VALID, and it'll be set along with orig_pte when there is valid orig_pte, or it'll be cleared when orig_pte was not initialized. It'll be updated every time we call handle_pte_fault(), so e.g. if a page fault retry happened it'll be properly updated along with orig_pte. [1] https://lore.kernel.org/lkml/710c48c9-406d-e4c5-a394-10501b951316@samsung.com/ [akpm@linux-foundation.org: coding-style cleanups] [peterx@redhat.com: fix crash reported by Marek] Link: https://lkml.kernel.org/r/Ylb9rXJyPm8/ao8f@xz-m1.local Link: https://lkml.kernel.org/r/20220405014836.14077-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
This patch still does not use pte marker in any way, however it teaches the core mm about the pte marker idea. For example, handle_pte_marker() is introduced that will parse and handle all the pte marker faults. Many of the places are more about commenting it up - so that we know there's the possibility of pte marker showing up, and why we don't need special code for the cases. [peterx@redhat.com: userfaultfd.c needs swapops.h] Link: https://lkml.kernel.org/r/YmRlVj3cdizYJsr0@xz-m1.local Link: https://lkml.kernel.org/r/20220405014833.14015-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8. Overview ======== Userfaultfd-wp anonymous support was merged two years ago. There're quite a few applications that started to leverage this capability either to take snapshots for user-app memory, or use it for full user controled swapping. This series tries to complete the feature for uffd-wp so as to cover all the RAM-based memory types. So far uffd-wp is the only missing piece of the rest features (uffd-missing & uffd-minor mode). One major reason to do so is that anonymous pages are sometimes not satisfying the need of applications, and there're growing users of either shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem between hypervisor process and device emulation process, shmem local live migration for upgrades), or for performance on tlb hits. All these mean that if a uffd-wp app wants to switch to any of the memory types, it'll stop working. I think it's worthwhile to have the kernel to cover all these aspects. This series chose to protect pages in pte level not page level. One major reason is safety. I have no idea how we could make it safe if any of the uffd-privileged app can wr-protect a page that any other application can use. It means this app can block any process potentially for any time it wants. The other reason is that it aligns very well with not only the anonymous uffd-wp solution, but also uffd as a whole. For example, userfaultfd is implemented fundamentally based on VMAs. We set flags to VMAs showing the status of uffd tracking. For another per-page based protection solution, it'll be crossing the fundation line on VMA-based, and it could simply be too far away already from what's called userfaultfd. PTE markers =========== The patchset is based on the idea called PTE markers. It was discussed in one of the mm alignment sessions, proposed starting from v6, and this is the 2nd version of it using PTE marker idea. PTE marker is a new type of swap entry that is ony applicable to file backed memories like shmem and hugetlbfs. It's used to persist some pte-level information even if the original present ptes in pgtable are zapped. Logically pte markers can store more than uffd-wp information, but so far only one bit is used for uffd-wp purpose. When the pte marker is installed with uffd-wp bit set, it means this pte is wr-protected by uffd. It solves the problem on e.g. file-backed memory mapped ptes got zapped due to any reason (e.g. thp split, or swapped out), we can still keep the wr-protect information in the ptes. Then when the page fault triggers again, we'll know this pte is wr-protected so we can treat the pte the same as a normal uffd wr-protected pte. The extra information is encoded into the swap entry, or swp_offset to be explicit, with the swp_type being PTE_MARKER. So far uffd-wp only uses one bit out of the swap entry, the rest bits of swp_offset are still reserved for other purposes. There're two configs to enable/disable PTE markers: CONFIG_PTE_MARKER CONFIG_PTE_MARKER_UFFD_WP We can set !PTE_MARKER to completely disable all the PTE markers, along with uffd-wp support. I made two config so we can also enable PTE marker but disable uffd-wp file-backed for other purposes. At the end of current series, I'll enable CONFIG_PTE_MARKER by default, but that patch is standalone and if anyone worries about having it by default, we can also consider turn it off by dropping that oneliner patch. So far I don't see a huge risk of doing so, so I kept that patch. In most cases, PTE markers should be treated as none ptes. It is because that unlike most of the other swap entry types, there's no PFN or block offset information encoded into PTE markers but some extra well-defined bits showing the status of the pte. These bits should only be used as extra data when servicing an upcoming page fault, and then we behave as if it's a none pte. I did spend a lot of time observing all the pte_none() users this time. It is indeed a challenge because there're a lot, and I hope I didn't miss a single of them when we should take care of pte markers. Luckily, I don't think it'll need to be considered in many cases, for example: boot code, arch code (especially non-x86), kernel-only page handlings (e.g. CPA), or device driver codes when we're tackling with pure PFN mappings. I introduced pte_none_mostly() in this series when we need to handle pte markers the same as none pte, the "mostly" is the other way to write "either none pte or a pte marker". I didn't replace pte_none() to cover pte markers for below reasons: - Very rare case of pte_none() callers will handle pte markers. E.g., all the kernel pages do not require knowledge of pte markers. So we don't pollute the major use cases. - Unconditionally change pte_none() semantics could confuse people, because pte_none() existed for so long a time. - Unconditionally change pte_none() semantics could make pte_none() slower even if in many cases pte markers do not exist. - There're cases where we'd like to handle pte markers differntly from pte_none(), so a full replace is also impossible. E.g. khugepaged should still treat pte markers as normal swap ptes rather than none ptes, because pte markers will always need a fault-in to merge the marker with a valid pte. Or the smap code will need to parse PTE markers not none ptes. Patch Layout ============ Introducing PTE marker and uffd-wp bit in PTE marker: mm: Introduce PTE_MARKER swap entry mm: Teach core mm about pte markers mm: Check against orig_pte for finish_fault() mm/uffd: PTE_MARKER_UFFD_WP Adding support for shmem uffd-wp: mm/shmem: Take care of UFFDIO_COPY_MODE_WP mm/shmem: Handle uffd-wp special pte in page fault handler mm/shmem: Persist uffd-wp bit across zapping for file-backed mm/shmem: Allow uffd wr-protect none pte for file-backed mem mm/shmem: Allows file-back mem to be uffd wr-protected on thps mm/shmem: Handle uffd-wp during fork() Adding support for hugetlbfs uffd-wp: mm/hugetlb: Introduce huge pte version of uffd-wp helpers mm/hugetlb: Hook page faults for uffd write protection mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP mm/hugetlb: Handle UFFDIO_WRITEPROTECT mm/hugetlb: Handle pte markers in page faults mm/hugetlb: Allow uffd wr-protect none ptes mm/hugetlb: Only drop uffd-wp special pte if required mm/hugetlb: Handle uffd-wp during fork() Misc handling on the rest mm for uffd-wp file-backed: mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Enabling of uffd-wp on file-backed memory: mm/uffd: Enable write protection for shmem & hugetlbfs mm: Enable PTE markers by default selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Tests ===== - Compile test on x86_64 and aarch64 on different configs - Kernel selftests - uffd-test [0] - Umapsort [1,2] test for shmem/hugetlb, with swap on/off [0] https://github.com/xzpeter/clibs/tree/master/uffd-test [1] https://github.com/xzpeter/umap-apps/tree/peter [2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs This patch (of 23): Introduces a new swap entry type called PTE_MARKER. It can be installed for any pte that maps a file-backed memory when the pte is temporarily zapped, so as to maintain per-pte information. The information that kept in the pte is called a "marker". Here we define the marker as "unsigned long" just to match pgoff_t, however it will only work if it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64. A new config CONFIG_PTE_MARKER is introduced too; it's by default off. A bunch of helpers are defined altogether to service the rest of the pte marker code. [peterx@redhat.com: fixup] Link: https://lkml.kernel.org/r/Yk2rdB7SXZf+2BDF@xz-m1.local Link: https://lkml.kernel.org/r/20220405014646.13522-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220405014646.13522-2-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Wonhyuk Yang authored
To spread dirty pages, nodes are checked whether they have reached the dirty limit using the expensive node_dirty_ok(). To reduce the frequency of calling node_dirty_ok(), the last node that hit the dirty limit can be cached. Instead of caching the node, caching both the node and its node_dirty_ok() status can reduce the number of calle to node_dirty_ok(). [akpm@linux-foundation.org: rename last_pgdat_dirty_limit to last_pgdat_dirty_ok] Link: https://lkml.kernel.org/r/20220430011032.64071-1-vvghjk1234@gmail.comSigned-off-by: Wonhyuk Yang <vvghjk1234@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Donghyeok Kim <dthex5d@gmail.com> Cc: JaeSang Yoo <jsyoo5b@gmail.com> Cc: Jiyoup Kim <lakroforce@gmail.com> Cc: Ohhoon Kwon <ohkwon1043@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit documents the new DAMON_RECLAIM parameter, 'commit_inputs' in its usage document. Link: https://lkml.kernel.org/r/20220429160606.127307-15-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
DAMON_RECLAIM reads the user input parameters only when it starts. To allow more efficient online tuning, this commit implements a new input parameter called 'commit_inputs'. Writing true to the parameter makes DAMON_RECLAIM reads the input parameters again. Link: https://lkml.kernel.org/r/20220429160606.127307-14-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit documents the newly added 'state' sysfs file input keyword, 'commit', which allows online tuning of DAMON contexts. Link: https://lkml.kernel.org/r/20220429160606.127307-13-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Currently, DAMON sysfs interface doesn't provide a way for adjusting DAMON input parameters while it is turned on. Therefore, users who want to reconfigure DAMON need to stop DAMON and restart. This means all the monitoring results that accumulated so far, which could be useful, should be flushed. This would be inefficient for many cases. For an example, let's suppose a sysadmin was running a DAMON-based Operation Scheme to find memory regions not accessed for more than 5 mins and page out the regions. If it turns out the 5 mins threshold was too long and therefore the sysadmin wants to reduce it to 4 mins, the sysadmin should turn off DAMON, restart it, and wait for at least 4 more minutes so that DAMON can find the cold memory regions, even though DAMON was knowing there are regions that not accessed for 4 mins at the time of shutdown. This commit makes DAMON sysfs interface to support online DAMON input parameters updates by adding a new input keyword for the 'state' DAMON sysfs file, 'commit'. Writing the keyword to the 'state' file while the corresponding kdamond is running makes the kdamond to read the sysfs file values again and update the DAMON context. Link: https://lkml.kernel.org/r/20220429160606.127307-12-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Only '->kdamond' and '->kdamond_stop' are protected by 'kdamond_lock' of 'struct damon_ctx'. All other DAMON context internal data items are recommended to be accessed in DAMON callbacks, or under some additional synchronizations. But, DAMON sysfs is accessing the schemes stat under 'kdamond_lock'. It makes no big issue as the read values are not used anywhere inside kernel, but would better to be fixed. This commit moves the reads to DAMON callback context, as supposed to be used for the purpose. Link: https://lkml.kernel.org/r/20220429160606.127307-11-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
DAMON sysfs 'state' file handling code is using string literals in both 'state_show()' and 'state_store()'. This makes the code error prone and inflexible for future extensions. To improve the situation, this commit defines possible input strings and 'enum' for identifying each input keyword only once, and refactors the code to reuse those. Link: https://lkml.kernel.org/r/20220429160606.127307-10-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
'damon_set_regions()' is general enough so that it can also be used for only creating regions. This commit makes DAMON sysfs interface to reuse the function rather keeping two implementations for a same purpose. Link: https://lkml.kernel.org/r/20220429160606.127307-9-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit separates DAMON sysfs interface's monitoring context targets setup code to a new function for better readability. Link: https://lkml.kernel.org/r/20220429160606.127307-8-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Having multiple targets for physical address space monitoring makes no sense. This commit prohibits such a ridiculous DAMON context setup my making the DAMON context build function to check and return an error for the case. Link: https://lkml.kernel.org/r/20220429160606.127307-7-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
'damon_va_apply_three_regions()' is just a wrapper of its general version, 'damon_set_regions()'. This commit replaces the wrapper calls to directly call the general version. Link: https://lkml.kernel.org/r/20220429160606.127307-6-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit moves 'damon_set_regions()' from vaddr to core, as it is aimed to be used by not only 'vaddr' but also other parts of DAMON. Link: https://lkml.kernel.org/r/20220429160606.127307-5-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
'damon_va_apply_three_regions()' is for adjusting address ranges to fit in three discontiguous ranges. The function can be generalized for arbitrary number of discontiguous ranges and reused for future usage, such as arbitrary online regions update. For such future usage, this commit introduces a generalized version of the function called 'damon_set_regions()'. Link: https://lkml.kernel.org/r/20220429160606.127307-4-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
When 'after_sampling()' or 'after_aggregation()' DAMON callbacks return an error, kdamond continues the remaining loop once. It makes no much sense to run the remaining part while something wrong already happened. The context might be corrupted or having invalid data. This commit therefore makes kdamond skips the remaining works and immediately finish in the cases. Link: https://lkml.kernel.org/r/20220429160606.127307-3-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Patch series "mm/damon: Support online tuning". Effects of DAMON and DAMON-based Operation Schemes highly depends on the configurations. Wrong configurations could even result in unexpected efficiency degradations. For finding a best configuration, repeating incremental configuration changes and results measurements, in other words, online tuning, could be helpful. Nevertheless, DAMON kernel API supports only restrictive online tuning. Worse yet, the sysfs-based DAMON user interface doesn't support online tuning at all. DAMON_RECLAIM also doesn't support online tuning. This patchset makes the DAMON kernel API, DAMON sysfs interface, and DAMON_RECLAIM supports online tuning. Sequence of patches ------------------- First two patches enhance DAMON online tuning for kernel API users. Specifically, patch 1 let kernel API users to be able to do DAMON online tuning without a restriction, and patch 2 makes error handling easier. Following seven patches (patches 3-9) refactor code for better readability and easier reuse of code fragments that will be useful for online tuning support. Patch 10 introduces DAMON callback based user request handling structure for DAMON sysfs interface, and patch 11 enables DAMON online tuning via DAMON sysfs interface. Documentation patch (patch 12) for usage of it follows. Patch 13 enables online tuning of DAMON_RECLAIM and finally patch 14 documents the DAMON_RECLAIM online tuning usage. This patch (of 14): For updating input parameters for running DAMON contexts, DAMON kernel API users can use the contexts' callbacks, as it is the safe place for context internal data accesses. When the context has DAMON-based operation schemes and all schemes are deactivated due to their watermarks, however, DAMON does nothing but only watermarks checks. As a result, no callbacks will be called back, and therefore the kernel API users cannot update the input parameters including monitoring attributes, DAMON-based operation schemes, and watermarks. To let users easily update such DAMON input parameters in such a case, this commit adds a new callback, 'after_wmarks_check()'. It will be called after each watermarks check. Users can do the online input parameters update in the callback even under the schemes deactivated case. Link: https://lkml.kernel.org/r/20220429160606.127307-2-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Niels Dossche authored
Add a regression test that validates that mremap fails for vma's that don't exist. Link: https://lkml.kernel.org/r/20220427224439.23828-3-dossche.niels@gmail.comSigned-off-by: Niels Dossche <dossche.niels@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Adrian Huang authored
Fix spelling/grammar mistakes in comments. Link: https://lkml.kernel.org/r/20220428061522.666-1-adrianhuang0701@gmail.comSigned-off-by: Adrian Huang <ahuang12@lenovo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Hongchen Zhang authored
A pmd migration entry should first be a swap pmd,so use is_swap_pmd(pmd) instead of !pmd_present(pmd). On the other hand, some architecture (MIPS for example) may misjudge a pmd_none entry as a pmd migration entry. Link: https://lkml.kernel.org/r/1651131333-6386-1-git-send-email-zhanghongchen@loongson.cnSigned-off-by: Hongchen Zhang <zhanghongchen@loongson.cn> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Florian Rommel authored
Commit c1e8d7c6 ("mmap locking API: convert mmap_sem comments") missed replacing some references of mmap_sem by mmap_lock due to misspelling (mm_sem instead of mmap_sem). Link: https://lkml.kernel.org/r/20220503113333.214124-1-mail@florommel.deSigned-off-by: Florian Rommel <mail@florommel.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Collingbourne authored
When CONFIG_KASAN_HW_TAGS is enabled we currently increase the minimum slab alignment to 16. This happens even if MTE is not supported in hardware or disabled via kasan=off, which creates an unnecessary memory overhead in those cases. Eliminate this overhead by making the minimum slab alignment a runtime property and only aligning to 16 if KASAN is enabled at runtime. On a DragonBoard 845c (non-MTE hardware) with a kernel built with CONFIG_KASAN_HW_TAGS, waiting for quiescence after a full Android boot I see the following Slab measurements in /proc/meminfo (median of 3 reboots): Before: 169020 kB After: 167304 kB [akpm@linux-foundation.org: make slab alignment type `unsigned int' to avoid casting] Link: https://linux-review.googlesource.com/id/I752e725179b43b144153f4b6f584ceb646473ead Link: https://lkml.kernel.org/r/20220427195820.1716975-2-pcc@google.comSigned-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Collingbourne authored
An inclusion of cache.h in printk.h was added in 2014 in commit c28aa1f0 ("printk/cache: mark printk_once test variable __read_mostly") in order to bring in the definition of __read_mostly. The usage of __read_mostly was later removed in commit 3ec25826 ("printk: Tie printk_once / printk_deferred_once into .data.once for reset") which made the inclusion of cache.h unnecessary, so remove it. We have a small amount of code that depended on the inclusion of cache.h from printk.h; fix that code to include the appropriate header. This fixes a circular inclusion on arm64 (linux/printk.h -> linux/cache.h -> asm/cache.h -> linux/kasan-enabled.h -> linux/static_key.h -> linux/jump_label.h -> linux/bug.h -> asm/bug.h -> linux/printk.h) that would otherwise be introduced by the next patch. Build tested using {allyesconfig,defconfig} x {arm64,x86_64}. Link: https://linux-review.googlesource.com/id/I8fd51f72c9ef1f2d6afd3b2cbc875aa4792c1fba Link: https://lkml.kernel.org/r/20220427195820.1716975-1-pcc@google.comSigned-off-by: Peter Collingbourne <pcc@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
Now we will use flush_cache_page() to flush cache for anonymous hugetlb pages when unmapping or migrating a hugetlb page mapping, but the flush_cache_page() only handles a PAGE_SIZE range on some architectures (like arm32, arc and so on), which will cause potential cache issues. Thus change to use flush_cache_range() to cover the whole size of a hugetlb page. Link: https://lkml.kernel.org/r/dc903b378d1e2d26bbbe85409ab9d009631f175c.1651056365.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
The cache level flush will always be first when changing an existing virtual–>physical mapping to a new value, since this allows us to properly handle systems whose caches are strict and require a virtual–>physical translation to exist for a virtual address. So we should move the cache flushing before huge_pmd_unshare(). As Muchun pointed out[1], now the architectures whose supporting hugetlb PMD sharing have no cache flush issues in practice. But I think we should still follow the cache/TLB flushing rules when changing a valid virtual address mapping in case of potential issues in future. [1] https://lore.kernel.org/all/YmT%2F%2FhuUbFX+KHcy@FVFYT0MHHV2J.usts.net/ Link: https://lkml.kernel.org/r/4f7ae6dfdc838ab71e1655188b657c032ff1f28f.1651056365.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
This patchset fixes some cache flushing issues if PMD sharing is possible for hugetlb pages, which were found by code inspection. Meanwhile Mike found the flush_cache_page() can not cover the whole size of a hugetlb page on some architectures [1], so I added a new patch 3 to fix this issue, since I found only try_to_unmap_one() and try_to_migrate_one() need to fix after some investigation. [1] https://lore.kernel.org/linux-mm/064da3bb-5b4b-7332-a722-c5a541128705@oracle.com/ This patch (of 3): When moving hugetlb page tables, the cache flushing is called in move_page_tables() without considering the shared PMDs, which may be cause cache issues on some architectures. Thus we should move the hugetlb cache flushing into move_hugetlb_page_tables() with considering the shared PMDs ranges, calculated by adjust_range_if_pmd_sharing_possible(). Meanwhile also expanding the TLBs flushing range in case of shared PMDs. Note this is discovered via code inspection, and did not meet a real problem in practice so far. Link: https://lkml.kernel.org/r/cover.1651056365.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/0443c8cf20db554d3ff4b439b30e0ff26c0181dd.1651056365.git.baolin.wang@linux.alibaba.com Fixes: 550a7d60 ("mm, hugepages: add mremap() support for hugepage backed vma") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
liusongtang authored
pgprot.pgprot is non-portable code. It should be replaced by portable macro pgprot_val. Link: https://lkml.kernel.org/r/20220426071302.220646-1-liusongtang@huawei.comSigned-off-by: liusongtang <liusongtang@huawei.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit documents the user space support of the newly added monitoring operations set for fixed virtual address ranges monitoring, namely 'fvaddr', on the ABI and usage documents for DAMON. Link: https://lkml.kernel.org/r/20220426231750.48822-4-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit makes DAMON sysfs interface to support the fixed virtual address ranges monitoring. After this commit, writing 'fvaddr' to the 'operations' DAMON sysfs file makes DAMON uses the monitoring operations set for fixed virtual address ranges, so that users can monitor accesses to only interested virtual address ranges. [sj@kernel.org: fix pid leak under fvaddr ops use case] Link: https://lkml.kernel.org/r/20220503220531.45913-1-sj@kernel.org Link: https://lkml.kernel.org/r/20220426231750.48822-3-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Patch series "support fixed virtual address ranges monitoring". The monitoring operations set for virtual address spaces automatically updates the monitoring target regions to cover entire mappings of the virtual address spaces as much as possible. Some users could have more information about their programs than kernel and therefore have interest in not entire regions but only specific regions. For such cases, the automatic monitoring target regions updates are only unnecessary overhead or distractions. This patchset adds supports for the use case on DAMON's kernel API (DAMON_OPS_FVADDR) and sysfs interface ('fvaddr' keyword for 'operations' sysfs file). This patch (of 3): The monitoring operations set for virtual address spaces automatically updates the monitoring target regions to cover entire mappings of the virtual address spaces as much as possible. Some users could have more information about their programs than kernel and therefore have interest in not entire regions but only specific regions. For such cases, the automatic monitoring target regions updates are only unnecessary overheads or distractions. For such cases, DAMON's API users can simply set the '->init()' and '->update()' of the DAMON context's '->ops' NULL, and set the target monitoring regions when creating the context. But, that would be a dirty hack. Worse yet, the hack is unavailable for DAMON user space interface users. To support the use case in a clean way that can easily exported to the user space, this commit adds another monitoring operations set called 'fvaddr', which is same to 'vaddr' but does not automatically update the monitoring regions. Instead, it will only respect the virtual address regions which have explicitly passed at the initial context creation. Note that this commit leave sysfs interface not supporting the feature yet. The support will be made in a following commit. Link: https://lkml.kernel.org/r/20220426231750.48822-1-sj@kernel.org Link: https://lkml.kernel.org/r/20220426231750.48822-2-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-