1. 10 May, 2022 10 commits
    • David Hildenbrand's avatar
      mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() · 50053941
      David Hildenbrand authored
      We can already theoretically fail to unmap (still having page_mapped()) in
      case arch_unmap_one() fails, which can happen on sparc.  Failures to unmap
      are handled gracefully, just as if there are other references on the
      target page: freezing the refcount in split_huge_page_to_list() will fail
      if still mapped and we'll simply remap.
      
      In commit 504e070d ("mm: thp: replace DEBUG_VM BUG with VM_WARN when
      unmap fails for split") we already converted to VM_WARN_ON_ONCE_PAGE,
      let's get rid of it completely now.
      
      This is a preparation for making try_to_migrate() fail on anonymous pages
      with GUP pins, which will make this VM_WARN_ON_ONCE_PAGE trigger more
      frequently.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-11-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50053941
    • David Hildenbrand's avatar
      mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively · 6c54dc6c
      David Hildenbrand authored
      We want to mark anonymous pages exclusive, and when using
      page_move_anon_rmap() we know that we are the exclusive user, as properly
      documented.  This is a preparation for marking anonymous pages exclusive
      in page_move_anon_rmap().
      
      In both instances, we're holding page lock and are sure that we're the
      exclusive owner (page_count() == 1).  hugetlb already properly uses
      page_move_anon_rmap() in the write fault handler.
      
      Note that in case of a PTE-mapped THP, we'll only end up calling this
      function if the whole THP is only referenced by the single PTE mapping a
      single subpage (page_count() == 1); consequently, it's fine to modify the
      compound page mapping inside page_move_anon_rmap().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c54dc6c
    • David Hildenbrand's avatar
      mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() · 40f2bbf7
      David Hildenbrand authored
      New anonymous pages are always mapped natively: only THP/khugepaged code
      maps a new compound anonymous page and passes "true".  Otherwise, we're
      just dealing with simple, non-compound pages.
      
      Let's give the interface clearer semantics and document these.  Remove the
      PageTransCompound() sanity check from page_add_new_anon_rmap().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      40f2bbf7
    • David Hildenbrand's avatar
      mm/rmap: pass rmap flags to hugepage_add_anon_rmap() · 28c5209d
      David Hildenbrand authored
      Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for
      page_add_anon_rmap() now.  RMAP_COMPOUND is implicit for hugetlb pages and
      ignored.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28c5209d
    • David Hildenbrand's avatar
      mm/rmap: remove do_page_add_anon_rmap() · f1e2db12
      David Hildenbrand authored
      ... and instead convert page_add_anon_rmap() to accept flags.
      
      Passing flags instead of bools is usually nicer either way, and we want to
      more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
      that an anonymous page is exclusive: for example, when restoring an
      anonymous page from a writable migration entry.
      
      This is a preparation for marking an anonymous page inside
      page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1e2db12
    • David Hildenbrand's avatar
      mm/rmap: convert RMAP flags to a proper distinct rmap_t type · 14f9135d
      David Hildenbrand authored
      We want to pass the flags to more than one anon rmap function, getting rid
      of special "do_page_add_anon_rmap()".  So let's pass around a distinct
      __bitwise type and refine documentation.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14f9135d
    • David Hildenbrand's avatar
      mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() · fb3d824d
      David Hildenbrand authored
      ...  and move the special check for pinned pages into
      page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages
      via a new pageflag, clearing it only after making sure that there are no
      GUP pins on the anonymous page.
      
      We really only care about pins on anonymous pages, because they are prone
      to getting replaced in the COW handler once mapped R/O.  For !anon pages
      in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about
      that, at least not that I could come up with an example.
      
      Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
      know we're dealing with anonymous pages.  Also, drop the handling of
      pinned pages from copy_huge_pud() and add a comment if ever supporting
      anonymous pages on the PUD level.
      
      This is a preparation for tracking exclusivity of anonymous pages in the
      rmap code, and disallowing marking a page shared (-> failing to duplicate)
      if there are GUP pins on a page.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb3d824d
    • David Hildenbrand's avatar
      mm/memory: slightly simplify copy_present_pte() · b51ad4f8
      David Hildenbrand authored
      Let's move the pinning check into the caller, to simplify return code
      logic and prepare for further changes: relocating the
      page_needs_cow_for_dma() into rmap handling code.
      
      While at it, remove the unused pte parameter and simplify the comments a
      bit.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b51ad4f8
    • David Hildenbrand's avatar
      mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() · 623a1ddf
      David Hildenbrand authored
      Let's do it just like copy_page_range(), taking the seqlock and making
      sure the mmap_lock is held in write mode.
      
      This allows for add a VM_BUG_ON to page_needs_cow_for_dma() and properly
      synchronizes concurrent fork() with GUP-fast of hugetlb pages, which will
      be relevant for further changes.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      623a1ddf
    • David Hildenbrand's avatar
      mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed · 322842ea
      David Hildenbrand authored
      Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v4.
      
      This series is the result of the discussion on the previous approach [2]. 
      More information on the general COW issues can be found there.  It is
      based on latest linus/master (post v5.17, with relevant core-MM changes
      for v5.18-rc1).
      
      This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
      on an anonymous page and COW logic fails to detect exclusivity of the page
      to then replacing the anonymous page by a copy in the page table: The GUP
      pin lost synchronicity with the pages mapped into the page tables.
      
      This issue, including other related COW issues, has been summarized in [3]
      under 3):
      "
        3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
      
        page_maybe_dma_pinned() is used to check if a page may be pinned for
        DMA (using FOLL_PIN instead of FOLL_GET).  While false positives are
        tolerable, false negatives are problematic: pages that are pinned for
        DMA must not be added to the swapcache.  If it happens, the (now pinned)
        page could be faulted back from the swapcache into page tables
        read-only.  Future write-access would detect the pinning and COW the
        page, losing synchronicity.  For the interested reader, this is nicely
        documented in feb889fb ("mm: don't put pinned pages into the swap
        cache").
      
        Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
        cases and can result in a violation of the documented semantics: giving
        false negatives because of the race.
      
        There are cases where we call it without properly taking a per-process
        sequence lock, turning the usage of page_maybe_dma_pinned() racy.  While
        one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
        handle, there is especially one rmap case (shrink_page_list) that's hard
        to fix: in the rmap world, we're not limited to a single process.
      
        The shrink_page_list() issue is really subtle.  If we race with
        someone pinning a page, we can trigger the same issue as in the FOLL_GET
        case.  See the detail section at the end of this mail on a discussion
        how bad this can bite us with VFIO or other FOLL_PIN user.
      
        It's harder to reproduce, but I managed to modify the O_DIRECT
        reproducer to use io_uring fixed buffers [15] instead, which ends up
        using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
        similarly trigger a loss of synchronicity and consequently a memory
        corruption.
      
        Again, the root issue is that a write-fault on a page that has
        additional references results in a COW and thereby a loss of
        synchronicity and consequently a memory corruption if two parties
        believe they are referencing the same page.
      "
      
      This series makes GUP pins (R/O and R/W) on anonymous pages fully
      reliable, especially also taking care of concurrent pinning via GUP-fast,
      for example, also fully fixing an issue reported regarding NUMA balancing
      [4] recently.  While doing that, it further reduces "unnecessary COWs",
      especially when we don't fork()/KSM and don't swapout, and fixes the COW
      security for hugetlb for FOLL_PIN.
      
      In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
      anonymous page is exclusive.  Exclusive anonymous pages that are mapped
      R/O can directly be mapped R/W by the COW logic in the write fault
      handler.  Exclusive anonymous pages that want to be shared (fork(), KSM)
      first have to be marked shared -- which will fail if there are GUP pins on
      the page.  GUP is only allowed to take a pin on anonymous pages that are
      exclusive.  The PT lock is the primary mechanism to synchronize
      modifications of PG_anon_exclusive.  We synchronize against GUP-fast
      either via the src_mm->write_protect_seq (during fork()) or via
      clear/invalidate+flush of the relevant page table entry.
      
      Special care has to be taken about swap, migration, and THPs (whereby a
      PMD-mapping can be converted to a PTE mapping and we have to track
      information for subpages).  Besides these, we let the rmap code handle
      most magic.  For reliable R/O pins of anonymous pages, we need
      FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however,
      it's now 100% mapcount free and I further simplified it a bit.
      
        #1 is a fix
        #3-#10 are mostly rmap preparations for PG_anon_exclusive handling
        #11 introduces PG_anon_exclusive
        #12 uses PG_anon_exclusive and make R/W pins of anonymous pages
         reliable
        #13 is a preparation for reliable R/O pins
        #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins
         make R/O pins of anonymous pages reliable
        #16 adds sanity check when (un)pinning anonymous pages
      
      [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
      
      
      This patch (of 17):
      
      In case arch_unmap_one() fails, we already did a swap_duplicate().  let's
      undo that properly via swap_free().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220428083441.37290-2-david@redhat.com
      Fixes: ca827d55 ("mm, swap: Add infrastructure for saving page metadata on swap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      322842ea
  2. 29 Apr, 2022 30 commits