1. 18 Aug, 2023 40 commits
    • Hugh Dickins's avatar
      mm: delete mmap_write_trylock() and vma_try_start_write() · cf95e337
      Hugh Dickins authored
      mmap_write_trylock() and vma_try_start_write() were added just for
      khugepaged, but now it has no use for them: delete.
      
      Link: https://lkml.kernel.org/r/4e6db3d-e8e-73fb-1f2a-8de2dab2a87c@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf95e337
    • Hugh Dickins's avatar
      mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() · d50791c2
      Hugh Dickins authored
      Now that retract_page_tables() can retract page tables reliably, without
      depending on trylocks, delete all the apparatus for khugepaged to try
      again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
      per-mm memory which was set aside for that in the khugepaged_mm_slot.
      
      But one part of that is worth keeping: when hpage_collapse_scan_file()
      found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot to
      be tried for retraction later - catching, for example, page tables where a
      reversible mprotect() of a portion had required splitting the pmd, but now
      it can be recollapsed.  Call collapse_pte_mapped_thp() directly in this
      case (why was it deferred before?  I assume an issue with needing
      mmap_lock for write, but now it's only needed for read).
      
      [hughd@google.com: fix mmap_locked handlng]
        Link: https://lkml.kernel.org/r/bfc6cab2-497f-32bf-dd5-98dc1987e4a9@google.com
      Link: https://lkml.kernel.org/r/a5dce57-6dfa-5559-4698-e817eb2f993@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d50791c2
    • Hugh Dickins's avatar
      mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() · 1043173e
      Hugh Dickins authored
      Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().  It
      does need mmap_read_lock(), but it does not need mmap_write_lock(), nor
      vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing paths are
      relying on pte_offset_map_lock() and pmd_lock(), so use those.
      
      Follow the pattern in retract_page_tables(); and using pte_free_defer()
      removes most of the need for tlb_remove_table_sync_one() here; but call
      pmdp_get_lockless_sync() to use it in the PAE case.
      
      First check the VMA, in case page tables are being torn down: from JannH. 
      Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
      acquired and the page looks suitable: from then on its state is stable.
      
      However, collapse_pte_mapped_thp() was doing something others don't:
      freeing a page table still containing "valid" entries.  i_mmap lock did
      stop a racing truncate from double-freeing those pages, but we prefer
      collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB flush
      can wait until the pmdp_collapse_flush() which follows, but the
      mmu_notifier_invalidate_range_start() has to be done earlier.
      
      Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
      for khugepaged to keep on repeatedly invalidating a range which is then
      found unsuitable e.g.  contains COWs.  "step 2", which does the clearing,
      must then be more careful (after dropping ptl to do mmu_notifier), with
      abort prepared to correct the accounting like "step 3".  But with those
      entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
      safe by the huge page lock, which stops new PTEs from being faulted in.
      
      [hughd@google.com: don't set mmap_locked = true in madvise_collapse()]
        Link: https://lkml.kernel.org/r/d3d9ff14-ef8-8f84-e160-bfa1f5794275@google.com
      [hughd@google.com: use ptep_clear() instead of pte_clear()]
        Link: https://lkml.kernel.org/r/e0197433-8a47-6a65-534d-eda26eeb78b0@google.com
      Link: https://lkml.kernel.org/r/b53be6a4-7715-51f9-aad-f1347dcb7c4@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1043173e
    • Hugh Dickins's avatar
      mm/khugepaged: retract_page_tables() without mmap or vma lock · 1d65b771
      Hugh Dickins authored
      Simplify shmem and file THP collapse's retract_page_tables(), and relax
      its locking: to improve its success rate and to lessen impact on others.
      
      Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
      target_mm, leave that part of the work to madvise_collapse() calling
      collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s result
      code to arrange for that.  That spares retract_page_tables() four
      arguments; and since it will be successful in retracting all of the page
      tables expected of it, no need to track and return a result code itself.
      
      It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
      but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
      allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
      THPs.  retract_page_tables() just needs to use those same spinlocks to
      exclude it briefly, while transitioning pmd from page table to none: so
      restore its use of pmd_lock() inside of which pte lock is nested.
      
      Users of pte_offset_map_lock() etc all now allow for them to fail: so
      retract_page_tables() now has no use for mmap_write_trylock() or
      vma_try_start_write().  In common with rmap and page_vma_mapped_walk(), it
      does not even need the mmap_read_lock().
      
      But those users do expect the page table to remain a good page table,
      until they unlock and rcu_read_unlock(): so the page table cannot be freed
      immediately, but rather by the recently added pte_free_defer().
      
      Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
      when PAE, and pmdp_collapse_flush() did not already do so: to make sure
      that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
      cannot pick up a pmd entry with mismatched pmd_low and pmd_high.
      
      retract_page_tables() can be enhanced to replace_page_tables(), which
      inserts the final huge pmd without mmap lock: going through an invalid
      state instead of pmd_none() followed by fault.  But that enhancement does
      raise some more questions: leave it until a later release.
      
      Link: https://lkml.kernel.org/r/f88970d9-d347-9762-ae6d-da978e8a4df@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d65b771
    • Hugh Dickins's avatar
      mm/pgtable: add pte_free_defer() for pgtable as page · 13cf577e
      Hugh Dickins authored
      Add the generic pte_free_defer(), to call pte_free() via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This version
      suits all those architectures which use an unfragmented page for one page
      table (none of whose pte_free()s use the mm arg which was passed to it).
      
      Link: https://lkml.kernel.org/r/78e921b0-b681-a1b0-dc20-44c9efa4ef3c@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13cf577e
    • Hugh Dickins's avatar
      s390: add pte_free_defer() for pgtables sharing page · 8211dad6
      Hugh Dickins authored
      Add s390-specific pte_free_defer(), to free table page via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This precedes
      the generic version to avoid build breakage from incompatible pgtable_t.
      
      This version is more complicated than others: because s390 fits two 2K
      page tables into one 4K page (so page->rcu_head must be shared between
      both halves), and already uses page->lru (which page->rcu_head overlays)
      to list any free halves; with clever management by page->_refcount bits.
      
      Build upon the existing management, adjusted to follow a new rule: that a
      page is never on the free list if pte_free_defer() was used on either half
      (marked by PageActive).  And for simplicity, delay calling RCU until both
      halves are freed.
      
      Not adding back unallocated fragments to the list in pte_free_defer() can
      result in wasting some amount of memory for pagetables, depending on how
      long the allocated fragment will stay in use.  In practice, this effect is
      expected to be insignificant, and not justify a far more complex approach,
      which might allow to add the fragments back later in __tlb_remove_table(),
      where we might not have a stable mm any more.
      
      [hughd@google.com: Claudio finds warning on mm_has_pgste() more useful than on mm_alloc_pgste()]
        Link: https://lkml.kernel.org/r/3bc095ba-a180-ce3b-82b1-2bfc64612f3@google.com
      Link: https://lkml.kernel.org/r/94eccf5f-264c-8abe-4567-e77f4b4e14a@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Tested-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Acked-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8211dad6
    • Hugh Dickins's avatar
      sparc: add pte_free_defer() for pte_t *pgtable_t · ad1ac8d9
      Hugh Dickins authored
      Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This precedes
      the generic version to avoid build breakage from incompatible pgtable_t.
      
      sparc32 supports pagetables sharing a page, but does not support THP;
      sparc64 supports THP, but does not support pagetables sharing a page.  So
      the sparc-specific pte_free_defer() is as simple as the generic one,
      except for converting between pte_t *pgtable_t and struct page *.
      
      Link: https://lkml.kernel.org/r/dc4f318d-a66a-5622-dc44-9018ea814b37@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ad1ac8d9
    • Hugh Dickins's avatar
      powerpc: add pte_free_defer() for pgtables sharing page · 32cc0b7c
      Hugh Dickins authored
      Add powerpc-specific pte_free_defer(), to free table page via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This precedes
      the generic version to avoid build breakage from incompatible pgtable_t.
      
      This is awkward because the struct page contains only one rcu_head, but
      that page may be shared between PTE_FRAG_NR pagetables, each wanting to
      use the rcu_head at the same time.  But powerpc never reuses a fragment
      once it has been freed: so mark the page Active in pte_free_defer(),
      before calling pte_fragment_free() directly; and there call_rcu() to
      pte_free_now() when last fragment is freed and the page is PageActive.
      
      Link: https://lkml.kernel.org/r/6e3ca5f1-334d-4b14-b92d-fc8e99914fcb@google.comSuggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32cc0b7c
    • Hugh Dickins's avatar
      powerpc: assert_pte_locked() use pte_offset_map_nolock() · 3d140215
      Hugh Dickins authored
      Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
      in assert_pte_locked().  BUG if pte_offset_map_nolock() fails.
      
      This mod might cause new crashes: which either expose my ignorance, or
      indicate issues to be fixed, or limit the usage of assert_pte_locked().
      
      [hughd@google.com: assert_pte_locked() still needs the pmd_none() check]
        Link: https://lkml.kernel.org/r/c73d1543-532c-3da2-8cf2-a95363a14116@google.com
      Link: https://lkml.kernel.org/r/e8d56c95-c132-a82e-5f5f-7bb1b738b057@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3d140215
    • Hugh Dickins's avatar
      arm: adjust_pte() use pte_offset_map_nolock() · de2e4626
      Hugh Dickins authored
      Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
      in adjust_pte(): because it gives the not-locked ptl for precisely that
      pte, which the caller can then safely lock; whereas pte_lockptr() is not
      so tightly coupled, because it dereferences the pmd pointer again.
      
      Link: https://lkml.kernel.org/r/4d5258bd-ffa0-018-253a-25f2c9b783f7@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de2e4626
    • Hugh Dickins's avatar
      mm/pgtable: add PAE safety to __pte_offset_map() · 146b42e0
      Hugh Dickins authored
      There is a faint risk that __pte_offset_map(), on a 32-bit architecture
      with a 64-bit pmd_t e.g.  x86-32 with CONFIG_X86_PAE=y, would succeed on a
      pmdval assembled from a pmd_low and a pmd_high which never belonged
      together: their combination not pointing to a page table at all, perhaps
      not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.
      
      Guard against that (on such configs) by local_irq_save() blocking TLB
      flush between present updates, as linux/pgtable.h suggests.  It's only
      needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
      __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
      lock, would just send it back to __pte_offset_map() again.
      
      Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
      used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
      synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
      at the right moment on those configs which do not already send it.
      
      CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86. 
      It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
      Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.  It is
      not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.
      
      Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
      little more work, to retry if pmd_low good for page table, but pmd_high
      non-zero from THP (and that might be making x86-specific assumptions).
      
      Link: https://lkml.kernel.org/r/3adcd8f-9191-2df1-d7ea-c4877698aad@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      146b42e0
    • Hugh Dickins's avatar
      mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s · a349d72f
      Hugh Dickins authored
      Patch series "mm: free retracted page table by RCU", v3.
      
      Some mmap_lock avoidance i.e.  latency reduction.  Initially just for the
      case of collapsing shmem or file pages to THPs: the usefulness of
      MADV_COLLAPSE on shmem is being limited by that mmap_write_lock it
      currently requires.
      
      Likely to be relied upon later in other contexts e.g.  freeing of empty
      page tables (but that's not work I'm doing).  mmap_write_lock avoidance
      when collapsing to anon THPs?  Perhaps, but again that's not work I've
      done: a quick attempt was not as easy as the shmem/file case.
      
      These changes (though of course not these exact patches) have been in
      Google's data centre kernel for three years now: we do rely upon them.
      
      
      This patch (of 13):
      
      Before putting them to use (several commits later), add rcu_read_lock() to
      pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
      separate commit, since it risks exposing imbalances: prior commits have
      fixed all the known imbalances, but we may find some have been missed.
      
      Link: https://lkml.kernel.org/r/7cd843a9-aa80-14f-5eb2-33427363c20@google.com
      Link: https://lkml.kernel.org/r/d3b01da5-2a6-833c-6681-67a3e024a16f@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a349d72f
    • Peng Zhang's avatar
      maple_tree: drop mas_first_entry() · 6783bd4b
      Peng Zhang authored
      The internal function mas_first_entry() is no longer used, so drop it.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-9-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6783bd4b
    • Peng Zhang's avatar
      maple_tree: replace mas_logical_pivot() with mas_safe_pivot() · 29b2681f
      Peng Zhang authored
      Replace mas_logical_pivot() with mas_safe_pivot() and drop
      mas_logical_pivot() since it won't be used anymore.  We can do this since
      now all nodes will have node limit pivot (if it is not full node).
      
      Link: https://lkml.kernel.org/r/20230711035444.526-8-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      29b2681f
    • Peng Zhang's avatar
      maple_tree: update mt_validate() · a489539e
      Peng Zhang authored
      Instead of using mas_first_entry() to find the leftmost leaf, use a simple
      loop instead.  Remove an unneeded check for root node.  To make the error
      message more accurate, check pivots first and then slots, because checking
      slots depend on the node limit pivot to break the loop.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-7-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a489539e
    • Peng Zhang's avatar
      maple_tree: make mas_validate_limits() check root node and node limit · 33af39d0
      Peng Zhang authored
      Update mas_validate_limits() to check root node, check node limit pivot if
      there is enough room for it to exist and check data_end.  Remove the check
      for child existence as it is done in mas_validate_child_slot().
      
      Link: https://lkml.kernel.org/r/20230711035444.526-6-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      33af39d0
    • Peng Zhang's avatar
      maple_tree: fix mas_validate_child_slot() to check last missed slot · e93fda5a
      Peng Zhang authored
      Don't break the loop before checking the last slot.  Also here check if
      non-leaf nodes are missing children.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-5-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e93fda5a
    • Peng Zhang's avatar
      maple_tree: make mas_validate_gaps() to check metadata · f8e5eac8
      Peng Zhang authored
      Make mas_validate_gaps() check whether the offset in the metadata points
      to the largest gap.  By the way, simplify this function.
      
      Add the verification that gaps beyond the node limit are zero.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-4-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8e5eac8
    • Peng Zhang's avatar
      maple_tree: don't use MAPLE_ARANGE64_META_MAX to indicate no gap · d695c30a
      Peng Zhang authored
      Patch series "Improve the validation for maple tree and some cleanup", v2.
      
      
      This patch (of 7):
      
      Do not use a special offset to indicate that there is no gap.  When there
      is no gap, offset can point to any valid slots because its gap is 0.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-1-zhangpeng.00@bytedance.com
      Link: https://lkml.kernel.org/r/20230711035444.526-3-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d695c30a
    • Sidhartha Kumar's avatar
      mm/memory: pass folio into do_page_mkwrite() · 86aa6998
      Sidhartha Kumar authored
      Saves one implicit call to compound_head().
      
      I'm not sure if I should change the name of the function to
      do_folio_mkwrite() and update the description comment to reference a folio
      as the vm_op is still called page_mkwrite.
      
      
      Link: https://lkml.kernel.org/r/20230711053544.156617-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      86aa6998
    • Miaohe Lin's avatar
      mm: memory-failure: fix race window when trying to get hugetlb folio · d31155b8
      Miaohe Lin authored
      page_folio() is fetched before calling get_hwpoison_hugetlb_folio()
      without hugetlb_lock being held.  So hugetlb page could be demoted before
      get_hwpoison_hugetlb_folio() holding hugetlb_lock but after page_folio()
      is fetched.  So get_hwpoison_hugetlb_folio() will hold unexpected extra
      refcnt of hugetlb folio while leaving demoted page un-refcnted.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-9-linmiaohe@huawei.com
      Fixes: 25182f05 ("mm,hwpoison: fix race with hugetlb page allocation")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d31155b8
    • Miaohe Lin's avatar
      mm: memory-failure: fetch compound head after extra page refcnt is held · a363d122
      Miaohe Lin authored
      Page might become thp, huge page or being splited after compound head is
      fetched but before page refcnt is bumped.  So hpage might be a tail page
      leading to VM_BUG_ON_PAGE(PageTail(page)) in PageTransHuge().
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-8-linmiaohe@huawei.com
      Fixes: 415c64c1 ("mm/memory-failure: split thp earlier in memory error handling")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a363d122
    • Miaohe Lin's avatar
      mm: memory-failure: minor cleanup for comments and codestyle · 5885c6a6
      Miaohe Lin authored
      Fix some wrong function names and grammar error in comments. Also remove
      unneeded space after for_each_process. No functional change intended.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-7-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5885c6a6
    • Miaohe Lin's avatar
      mm: memory-failure: remove unneeded header files · e9c36f7a
      Miaohe Lin authored
      Remove some unneeded header files. No functional change intended.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-6-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e9c36f7a
    • Miaohe Lin's avatar
      mm: memory-failure: use local variable huge to check hugetlb page · 55c7ac45
      Miaohe Lin authored
      Use local variable huge to check whether page is hugetlb page to avoid
      calling PageHuge() multiple times to save cpu cycles.  PageHuge() will be
      stable while extra page refcnt is held.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-5-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55c7ac45
    • Miaohe Lin's avatar
      mm: memory-failure: don't account hwpoison_filter() filtered pages · 80ee7cb2
      Miaohe Lin authored
      mf_generic_kill_procs() will return -EOPNOTSUPP when hwpoison_filter()
      filtered dax page.  In that case, action_result() isn't expected to be
      called to update mf_stats.  This will results in inaccurate but benign
      memory failure handling statistics.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80ee7cb2
    • Miaohe Lin's avatar
      mm: memory-failure: ensure moving HWPoison flag to the raw error pages · 92a025a7
      Miaohe Lin authored
      If hugetlb_vmemmap_optimized is enabled, folio_clear_hugetlb_hwpoison()
      called from try_memory_failure_hugetlb() won't transfer HWPoison flag to
      subpages while folio's HWPoison flag is cleared.  So when trying to free
      this hugetlb page into buddy, folio_clear_hugetlb_hwpoison() is not called
      to move HWPoison flag from head page to the raw error pages even if now
      hugetlb_vmemmap_optimized is cleared.  This will results in HWPoisoned
      page being used again and raw_hwp_page leak.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-3-linmiaohe@huawei.com
      Fixes: ac5fcde0 ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92a025a7
    • Miaohe Lin's avatar
      mm: memory-failure: remove unneeded PageHuge() check · dbe70dbb
      Miaohe Lin authored
      Patch series "A few fixup and cleanup patches for memory-failure", v2.
      
      This series contains a few fixup patches to fix inaccurate mf_stats, fix
      race window when trying to get hugetlb folio and so on.  Also there is
      minor cleanup for comments and codestyle.  More details can be found in
      the respective changelogs.
      
      
      This patch (of 8):
      
      PageHuge() check in me_huge_page() is just for potential problems.  Remove
      it as it's actually dead code and won't catch anything.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20230711055016.2286677-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dbe70dbb
    • David Hildenbrand's avatar
      mm/memory_hotplug: document the signal_pending() check in offline_pages() · de7cb03d
      David Hildenbrand authored
      Let's update the documentation that any signal is sufficient, and add a
      comment that not only checking for fatal signals is historical baggage:
      changing it now could break existing user space.  although unlikely.
      
      For example, when an app provides a custom SIGALRM handler and triggers
      memory offlining, the timeout cmd would no longer stop memory offlining,
      because SIGALRM would no longer be considered a fatal signal.
      
      Note that using signal_pending() instead of fatal_signal_pending() is
      an anti-pattern, but slowly deprecating that behavior to eventually
      change it in the far future is probably not worth the effort.  If this
      ever becomes relevant for user-space, we might want to rethink.
      
      Link: https://lkml.kernel.org/r/20230711174050.603820-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de7cb03d
    • Randy Dunlap's avatar
      HWPOISON: offline support: fix spelling in Documentation/ABI/ · d0366880
      Randy Dunlap authored
      Correct spelling problems as identified by codespell.
      
      Link: https://lkml.kernel.org/r/20230710052223.18254-1-rdunlap@infradead.org
      Fixes: facb6011 ("HWPOISON: Add soft page offline support")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0366880
    • Haifeng Xu's avatar
      mm/mm_init.c: mark check_for_memory() as __init · b894da04
      Haifeng Xu authored
      The only caller of check_for_memory() is free_area_init(), which is
      annotated with __init, so it should be safe to also mark the former as
      __init.
      
      Link: https://lkml.kernel.org/r/20230710093750.1294-1-haifeng.xu@shopee.comSigned-off-by: default avatarHaifeng Xu <haifeng.xu@shopee.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b894da04
    • Sergey Senozhatsky's avatar
      zsmalloc: remove obj_tagged() · f9044f17
      Sergey Senozhatsky authored
      obj_tagged() is not needed at this point, because objects can only have
      one tag: OBJ_ALLOCATED_TAG.  We needed obj_tagged() for the zsmalloc LRU
      implementation, which has now been removed.  Simplify zsmalloc code and
      revert to the previous implementation that was in place before the
      zsmalloc LRU series.
      
      Link: https://lkml.kernel.org/r/20230709025817.3842416-1-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f9044f17
    • Axel Rasmussen's avatar
      selftests/mm: add uffd unit test for UFFDIO_POISON · 99aa7721
      Axel Rasmussen authored
      The test is pretty basic, and exercises UFFDIO_POISON straightforwardly. 
      We register a region with userfaultfd, in missing fault mode.  For each
      fault, we either UFFDIO_COPY a zeroed page (odd pages) or UFFDIO_POISON
      (even pages).  We do this mix to test "something like a real use case",
      where guest memory would be some mix of poisoned and non-poisoned pages.
      
      We read each page in the region, and assert that the odd pages are zeroed
      as expected, and the even pages yield a SIGBUS as expected.
      
      Why UFFDIO_COPY instead of UFFDIO_ZEROPAGE?  Because hugetlb doesn't
      support UFFDIO_ZEROPAGE, and we don't want to have special case code.
      
      Link: https://lkml.kernel.org/r/20230707215540.2324998-9-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: T.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99aa7721
    • Axel Rasmussen's avatar
      selftests/mm: refactor uffd_poll_thread to allow custom fault handlers · 7cf0f9e8
      Axel Rasmussen authored
      Previously, we had "one fault handler to rule them all", which used
      several branches to deal with all of the scenarios required by all of the
      various tests.
      
      In upcoming patches, I plan to add a new test, which has its own slightly
      different fault handling logic.  Instead of continuing to add cruft to the
      existing fault handler, let's allow tests to define custom ones, separate
      from other tests.
      
      Link: https://lkml.kernel.org/r/20230707215540.2324998-8-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: T.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7cf0f9e8
    • Axel Rasmussen's avatar
      mm: userfaultfd: document and enable new UFFDIO_POISON feature · f442ab50
      Axel Rasmussen authored
      Update the userfaultfd API to advertise this feature as part of feature
      flags and supported ioctls (returned upon registration).
      
      Add basic documentation describing the new feature.
      
      Link: https://lkml.kernel.org/r/20230707215540.2324998-7-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: T.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f442ab50
    • Axel Rasmussen's avatar
      mm: userfaultfd: support UFFDIO_POISON for hugetlbfs · 8a13897f
      Axel Rasmussen authored
      The behavior here is the same as it is for anon/shmem.  This is done
      separately because hugetlb pte marker handling is a bit different.
      
      Link: https://lkml.kernel.org/r/20230707215540.2324998-6-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: T.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8a13897f
    • Hugh Dickins's avatar
      mm: userfaultfd: add new UFFDIO_POISON ioctl: fix · 597425df
      Hugh Dickins authored
      Smatch has observed that pte_offset_map_lock() is now allowed to fail, and
      then ptl should not be unlocked.  Use -EAGAIN here like elsewhere.
      
      Link: https://lkml.kernel.org/r/bc7bba61-d34f-ad3a-ccf1-c191585ef851@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      597425df
    • Axel Rasmussen's avatar
      mm: userfaultfd: add new UFFDIO_POISON ioctl · fc71884a
      Axel Rasmussen authored
      The basic idea here is to "simulate" memory poisoning for VMs.  A VM
      running on some host might encounter a memory error, after which some
      page(s) are poisoned (i.e., future accesses SIGBUS).  They expect that
      once poisoned, pages can never become "un-poisoned".  So, when we live
      migrate the VM, we need to preserve the poisoned status of these pages.
      
      When live migrating, we try to get the guest running on its new host as
      quickly as possible.  So, we start it running before all memory has been
      copied, and before we're certain which pages should be poisoned or not.
      
      So the basic way to use this new feature is:
      
      - On the new host, the guest's memory is registered with userfaultfd, in
        either MISSING or MINOR mode (doesn't really matter for this purpose).
      - On any first access, we get a userfaultfd event. At this point we can
        communicate with the old host to find out if the page was poisoned.
      - If so, we can respond with a UFFDIO_POISON - this places a swap marker
        so any future accesses will SIGBUS. Because the pte is now "present",
        future accesses won't generate more userfaultfd events, they'll just
        SIGBUS directly.
      
      UFFDIO_POISON does not handle unmapping previously-present PTEs.  This
      isn't needed, because during live migration we want to intercept all
      accesses with userfaultfd (not just writes, so WP mode isn't useful for
      this).  So whether minor or missing mode is being used (or both), the PTE
      won't be present in any case, so handling that case isn't needed.
      
      Similarly, UFFDIO_POISON won't replace existing PTE markers.  This might
      be okay to do, but it seems to be safer to just refuse to overwrite any
      existing entry (like a UFFD_WP PTE marker).
      
      Link: https://lkml.kernel.org/r/20230707215540.2324998-5-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: T.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fc71884a
    • Axel Rasmussen's avatar
      mm: userfaultfd: extract file size check out into a helper · 435cdb41
      Axel Rasmussen authored
      This code is already duplicated twice, and UFFDIO_POISON will do the same
      check a third time.  So, it's worth extracting into a helper to save
      repetitive lines of code.
      
      Link: https://lkml.kernel.org/r/20230707215540.2324998-4-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: T.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      435cdb41
    • Axel Rasmussen's avatar
      mm: userfaultfd: check for start + len overflow in validate_range · 2ef5d724
      Axel Rasmussen authored
      Most userfaultfd ioctls take a `start + len` range as an argument.  We
      have the validate_range helper to check that such ranges are valid. 
      However, some (but not all!) ioctls *also* check that `start + len`
      doesn't wrap around (overflow).
      
      Just check for this in validate_range.  This saves some repetitive code,
      and adds the check to some ioctls which weren't bothering to check for it
      before.
      
      [axelrasmussen@google.com: call validate_range() on the src range too]
        Link: https://lkml.kernel.org/r/20230714182932.2608735-1-axelrasmussen@google.com
      [axelrasmussen@google.com: fix src/dst validation]
        Link: https://lkml.kernel.org/r/20230810192128.1855570-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20230707215540.2324998-3-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: T.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ef5d724