1. 10 May, 2022 1 commit
    • David Hildenbrand's avatar
      mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed · 322842ea
      David Hildenbrand authored
      Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v4.
      
      This series is the result of the discussion on the previous approach [2]. 
      More information on the general COW issues can be found there.  It is
      based on latest linus/master (post v5.17, with relevant core-MM changes
      for v5.18-rc1).
      
      This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
      on an anonymous page and COW logic fails to detect exclusivity of the page
      to then replacing the anonymous page by a copy in the page table: The GUP
      pin lost synchronicity with the pages mapped into the page tables.
      
      This issue, including other related COW issues, has been summarized in [3]
      under 3):
      "
        3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
      
        page_maybe_dma_pinned() is used to check if a page may be pinned for
        DMA (using FOLL_PIN instead of FOLL_GET).  While false positives are
        tolerable, false negatives are problematic: pages that are pinned for
        DMA must not be added to the swapcache.  If it happens, the (now pinned)
        page could be faulted back from the swapcache into page tables
        read-only.  Future write-access would detect the pinning and COW the
        page, losing synchronicity.  For the interested reader, this is nicely
        documented in feb889fb ("mm: don't put pinned pages into the swap
        cache").
      
        Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
        cases and can result in a violation of the documented semantics: giving
        false negatives because of the race.
      
        There are cases where we call it without properly taking a per-process
        sequence lock, turning the usage of page_maybe_dma_pinned() racy.  While
        one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
        handle, there is especially one rmap case (shrink_page_list) that's hard
        to fix: in the rmap world, we're not limited to a single process.
      
        The shrink_page_list() issue is really subtle.  If we race with
        someone pinning a page, we can trigger the same issue as in the FOLL_GET
        case.  See the detail section at the end of this mail on a discussion
        how bad this can bite us with VFIO or other FOLL_PIN user.
      
        It's harder to reproduce, but I managed to modify the O_DIRECT
        reproducer to use io_uring fixed buffers [15] instead, which ends up
        using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
        similarly trigger a loss of synchronicity and consequently a memory
        corruption.
      
        Again, the root issue is that a write-fault on a page that has
        additional references results in a COW and thereby a loss of
        synchronicity and consequently a memory corruption if two parties
        believe they are referencing the same page.
      "
      
      This series makes GUP pins (R/O and R/W) on anonymous pages fully
      reliable, especially also taking care of concurrent pinning via GUP-fast,
      for example, also fully fixing an issue reported regarding NUMA balancing
      [4] recently.  While doing that, it further reduces "unnecessary COWs",
      especially when we don't fork()/KSM and don't swapout, and fixes the COW
      security for hugetlb for FOLL_PIN.
      
      In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
      anonymous page is exclusive.  Exclusive anonymous pages that are mapped
      R/O can directly be mapped R/W by the COW logic in the write fault
      handler.  Exclusive anonymous pages that want to be shared (fork(), KSM)
      first have to be marked shared -- which will fail if there are GUP pins on
      the page.  GUP is only allowed to take a pin on anonymous pages that are
      exclusive.  The PT lock is the primary mechanism to synchronize
      modifications of PG_anon_exclusive.  We synchronize against GUP-fast
      either via the src_mm->write_protect_seq (during fork()) or via
      clear/invalidate+flush of the relevant page table entry.
      
      Special care has to be taken about swap, migration, and THPs (whereby a
      PMD-mapping can be converted to a PTE mapping and we have to track
      information for subpages).  Besides these, we let the rmap code handle
      most magic.  For reliable R/O pins of anonymous pages, we need
      FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however,
      it's now 100% mapcount free and I further simplified it a bit.
      
        #1 is a fix
        #3-#10 are mostly rmap preparations for PG_anon_exclusive handling
        #11 introduces PG_anon_exclusive
        #12 uses PG_anon_exclusive and make R/W pins of anonymous pages
         reliable
        #13 is a preparation for reliable R/O pins
        #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins
         make R/O pins of anonymous pages reliable
        #16 adds sanity check when (un)pinning anonymous pages
      
      [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
      
      
      This patch (of 17):
      
      In case arch_unmap_one() fails, we already did a swap_duplicate().  let's
      undo that properly via swap_free().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220428083441.37290-2-david@redhat.com
      Fixes: ca827d55 ("mm, swap: Add infrastructure for saving page metadata on swap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      322842ea
  2. 29 Apr, 2022 39 commits