1. 18 Aug, 2023 40 commits
    • Jiaqi Yan's avatar
      selftests/mm: add tests for HWPOISON hugetlbfs read · ba91e7e5
      Jiaqi Yan authored
      Add tests for the improvement made to read operation on HWPOISON
      hugetlb page with different read granularities. For each chunk size,
      three read scenarios are tested:
      1. Simple regression test on read without HWPOISON.
      2. Sequential read page by page should succeed until encounters the 1st
         raw HWPOISON subpage.
      3. After skip a raw HWPOISON subpage by lseek, read()s always succeed.
      
      Link: https://lkml.kernel.org/r/20230713001833.3778937-5-jiaqiyan@google.comSigned-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ba91e7e5
    • Jiaqi Yan's avatar
      hugetlbfs: improve read HWPOISON hugepage · 38c1ddbd
      Jiaqi Yan authored
      When a hugepage contains HWPOISON pages, read() fails to read any byte of
      the hugepage and returns -EIO, although many bytes in the HWPOISON
      hugepage are readable.
      
      Improve this by allowing hugetlbfs_read_iter returns as many bytes as
      possible.  For a requested range [offset, offset + len) that contains
      HWPOISON page, return [offset, first HWPOISON page addr); the next read
      attempt will fail and return -EIO.
      
      Link: https://lkml.kernel.org/r/20230713001833.3778937-4-jiaqiyan@google.comSigned-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38c1ddbd
    • Jiaqi Yan's avatar
      mm/hwpoison: check if a raw page in a hugetlb folio is raw HWPOISON · b79f8eb4
      Jiaqi Yan authored
      Add the functionality, is_raw_hwpoison_page_in_hugepage, to tell if a raw
      page in a hugetlb folio is HWPOISON.  This functionality relies on
      RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
      becomes meaningless.
      
      is_raw_hwpoison_page_in_hugepage holds mf_mutex in order to synchronize
      with folio_set_hugetlb_hwpoison and folio_free_raw_hwp who iterate,
      insert, or delete entry in raw_hwp_list.  llist itself doesn't ensure
      insertion and removal are synchornized with the llist_for_each_entry used
      by is_raw_hwpoison_page_in_hugepage (unless iterated entries are already
      deleted from the list).  Caller can minimize the overhead of lock cycles
      by first checking HWPOISON flag of the folio.
      
      Exports this functionality to be immediately used in the read operation
      for hugetlbfs.
      
      Link: https://lkml.kernel.org/r/20230713001833.3778937-3-jiaqiyan@google.comSigned-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b79f8eb4
    • Jiaqi Yan's avatar
      mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp · 9e130c4b
      Jiaqi Yan authored
      Patch series "Improve hugetlbfs read on HWPOISON hugepages", v4.
      
      Today when hardware memory is corrupted in a hugetlb hugepage, kernel
      leaves the hugepage in pagecache [1]; otherwise future mmap or read will
      suject to silent data corruption.  This is implemented by returning -EIO
      from hugetlb_read_iter immediately if the hugepage has HWPOISON flag set.
      
      Since memory_failure already tracks the raw HWPOISON subpages in a
      hugepage, a natural improvement is possible: if userspace only asks for
      healthy subpages in the pagecache, kernel can return these data.
      
      This patchset implements this improvement.  It consist of three parts. 
      The 1st commit exports the functionality to tell if a subpage inside a
      hugetlb hugepage is a raw HWPOISON page.  The 2nd commit teaches
      hugetlbfs_read_iter to return as many healthy bytes as possible.  The 3rd
      commit properly tests this new feature.
      
      [1] commit 8625147c ("hugetlbfs: don't delete error page from pagecache")
      
      
      This patch (of 4):
      
      Traversal on llist (e.g.  llist_for_each_safe) is only safe AFTER entries
      are deleted from the llist.  Correct the way __folio_free_raw_hwp deletes
      and frees raw_hwp_page entries in raw_hwp_list: first llist_del_all, then
      kfree within llist_for_each_safe.
      
      As of today, concurrent adding, deleting, and traversal on raw_hwp_list
      from hugetlb.c and/or memory-failure.c are fine with each other.  Note
      this is guaranteed partly by the lock-free nature of llist, and partly by
      holding hugetlb_lock and/or mf_mutex.  For example, as llist_del_all is
      lock-free with itself, folio_clear_hugetlb_hwpoison()s from
      __update_and_free_hugetlb_folio and memory_failure won't need explicit
      locking when freeing the raw_hwp_list.  New code that manipulates
      raw_hwp_list must be careful to ensure the concurrency correctness.
      
      Link: https://lkml.kernel.org/r/20230713001833.3778937-1-jiaqiyan@google.com
      Link: https://lkml.kernel.org/r/20230713001833.3778937-2-jiaqiyan@google.comSigned-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9e130c4b
    • Yu Ma's avatar
      mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock · 6852c46c
      Yu Ma authored
      UnixBench/Execl represents a class of workload where bash scripts are
      spawned frequently to do some short jobs.  When running multiple parallel
      tasks, hot osq_lock is observed from do_mmap and exit_mmap.  Both of them
      come from load_elf_binary through the call chain
      "execl->do_execveat_common->bprm_execve->load_elf_binary".
      
      In do_mmap,it will call mmap_region to create vma node, initialize it and
      insert it to vma maintain structure in mm_struct and i_mmap tree of the
      mapping file, then increase map_count to record the number of vma nodes
      used.  The hot osq_lock is to protect operations on file's i_mmap tree. 
      For the mm_struct member change like vma insertion and map_count update,
      they do not affect i_mmap tree.  Move those operations out of the lock's
      critical section, to reduce hold time on the lock.
      
      With this change, on Intel Sapphire Rapids 112C/224T platform, based on
      v6.0-rc6, the 160 parallel score improves by 12%.  The patch has no
      obvious performance gain on v6.5-rc1 due to regression of this benchmark
      from this commit f1a79412 (mm: convert
      mm's rss stats into percpu_counter).  Related discussion and conclusion
      can be referred at the mail thread initiated by 0day as below: Link:
      https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-7e8fb108def0@intel.com/
      
      Link: https://lkml.kernel.org/r/20230712145739.604215-1-yu.ma@intel.comSigned-off-by: default avatarYu Ma <yu.ma@intel.com>
      Reviewed-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A . Shutemov <kirill@shutemov.name>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Zhu, Lipeng <lipeng.zhu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6852c46c
    • Xueshi Hu's avatar
      mm: remove clear_page_idle() · 73e791d7
      Xueshi Hu authored
      All callers have now been converted to call folio_clear_idle().
      
      Link: https://lkml.kernel.org/r/20230712134959.145373-1-xueshi.hu@smartx.comSigned-off-by: default avatarXueshi Hu <xueshi.hu@smartx.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Charan Teja Kalla <quic_charante@quicinc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      73e791d7
    • Hugh Dickins's avatar
      mm/pgtable: notes on pte_offset_map[_lock]() · 610d0657
      Hugh Dickins authored
      Add a block of comments on pte_offset_map_lock(), pte_offset_map() and
      pte_offset_map_nolock() to mm/pgtable-generic.c, to help explain them.
      
      Link: https://lkml.kernel.org/r/b791c3b0-25c6-a263-d785-d564344eb644@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      610d0657
    • Hugh Dickins's avatar
      mm: delete mmap_write_trylock() and vma_try_start_write() · cf95e337
      Hugh Dickins authored
      mmap_write_trylock() and vma_try_start_write() were added just for
      khugepaged, but now it has no use for them: delete.
      
      Link: https://lkml.kernel.org/r/4e6db3d-e8e-73fb-1f2a-8de2dab2a87c@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf95e337
    • Hugh Dickins's avatar
      mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() · d50791c2
      Hugh Dickins authored
      Now that retract_page_tables() can retract page tables reliably, without
      depending on trylocks, delete all the apparatus for khugepaged to try
      again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
      per-mm memory which was set aside for that in the khugepaged_mm_slot.
      
      But one part of that is worth keeping: when hpage_collapse_scan_file()
      found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot to
      be tried for retraction later - catching, for example, page tables where a
      reversible mprotect() of a portion had required splitting the pmd, but now
      it can be recollapsed.  Call collapse_pte_mapped_thp() directly in this
      case (why was it deferred before?  I assume an issue with needing
      mmap_lock for write, but now it's only needed for read).
      
      [hughd@google.com: fix mmap_locked handlng]
        Link: https://lkml.kernel.org/r/bfc6cab2-497f-32bf-dd5-98dc1987e4a9@google.com
      Link: https://lkml.kernel.org/r/a5dce57-6dfa-5559-4698-e817eb2f993@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d50791c2
    • Hugh Dickins's avatar
      mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() · 1043173e
      Hugh Dickins authored
      Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().  It
      does need mmap_read_lock(), but it does not need mmap_write_lock(), nor
      vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing paths are
      relying on pte_offset_map_lock() and pmd_lock(), so use those.
      
      Follow the pattern in retract_page_tables(); and using pte_free_defer()
      removes most of the need for tlb_remove_table_sync_one() here; but call
      pmdp_get_lockless_sync() to use it in the PAE case.
      
      First check the VMA, in case page tables are being torn down: from JannH. 
      Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
      acquired and the page looks suitable: from then on its state is stable.
      
      However, collapse_pte_mapped_thp() was doing something others don't:
      freeing a page table still containing "valid" entries.  i_mmap lock did
      stop a racing truncate from double-freeing those pages, but we prefer
      collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB flush
      can wait until the pmdp_collapse_flush() which follows, but the
      mmu_notifier_invalidate_range_start() has to be done earlier.
      
      Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
      for khugepaged to keep on repeatedly invalidating a range which is then
      found unsuitable e.g.  contains COWs.  "step 2", which does the clearing,
      must then be more careful (after dropping ptl to do mmu_notifier), with
      abort prepared to correct the accounting like "step 3".  But with those
      entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
      safe by the huge page lock, which stops new PTEs from being faulted in.
      
      [hughd@google.com: don't set mmap_locked = true in madvise_collapse()]
        Link: https://lkml.kernel.org/r/d3d9ff14-ef8-8f84-e160-bfa1f5794275@google.com
      [hughd@google.com: use ptep_clear() instead of pte_clear()]
        Link: https://lkml.kernel.org/r/e0197433-8a47-6a65-534d-eda26eeb78b0@google.com
      Link: https://lkml.kernel.org/r/b53be6a4-7715-51f9-aad-f1347dcb7c4@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1043173e
    • Hugh Dickins's avatar
      mm/khugepaged: retract_page_tables() without mmap or vma lock · 1d65b771
      Hugh Dickins authored
      Simplify shmem and file THP collapse's retract_page_tables(), and relax
      its locking: to improve its success rate and to lessen impact on others.
      
      Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
      target_mm, leave that part of the work to madvise_collapse() calling
      collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s result
      code to arrange for that.  That spares retract_page_tables() four
      arguments; and since it will be successful in retracting all of the page
      tables expected of it, no need to track and return a result code itself.
      
      It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
      but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
      allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
      THPs.  retract_page_tables() just needs to use those same spinlocks to
      exclude it briefly, while transitioning pmd from page table to none: so
      restore its use of pmd_lock() inside of which pte lock is nested.
      
      Users of pte_offset_map_lock() etc all now allow for them to fail: so
      retract_page_tables() now has no use for mmap_write_trylock() or
      vma_try_start_write().  In common with rmap and page_vma_mapped_walk(), it
      does not even need the mmap_read_lock().
      
      But those users do expect the page table to remain a good page table,
      until they unlock and rcu_read_unlock(): so the page table cannot be freed
      immediately, but rather by the recently added pte_free_defer().
      
      Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
      when PAE, and pmdp_collapse_flush() did not already do so: to make sure
      that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
      cannot pick up a pmd entry with mismatched pmd_low and pmd_high.
      
      retract_page_tables() can be enhanced to replace_page_tables(), which
      inserts the final huge pmd without mmap lock: going through an invalid
      state instead of pmd_none() followed by fault.  But that enhancement does
      raise some more questions: leave it until a later release.
      
      Link: https://lkml.kernel.org/r/f88970d9-d347-9762-ae6d-da978e8a4df@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d65b771
    • Hugh Dickins's avatar
      mm/pgtable: add pte_free_defer() for pgtable as page · 13cf577e
      Hugh Dickins authored
      Add the generic pte_free_defer(), to call pte_free() via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This version
      suits all those architectures which use an unfragmented page for one page
      table (none of whose pte_free()s use the mm arg which was passed to it).
      
      Link: https://lkml.kernel.org/r/78e921b0-b681-a1b0-dc20-44c9efa4ef3c@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13cf577e
    • Hugh Dickins's avatar
      s390: add pte_free_defer() for pgtables sharing page · 8211dad6
      Hugh Dickins authored
      Add s390-specific pte_free_defer(), to free table page via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This precedes
      the generic version to avoid build breakage from incompatible pgtable_t.
      
      This version is more complicated than others: because s390 fits two 2K
      page tables into one 4K page (so page->rcu_head must be shared between
      both halves), and already uses page->lru (which page->rcu_head overlays)
      to list any free halves; with clever management by page->_refcount bits.
      
      Build upon the existing management, adjusted to follow a new rule: that a
      page is never on the free list if pte_free_defer() was used on either half
      (marked by PageActive).  And for simplicity, delay calling RCU until both
      halves are freed.
      
      Not adding back unallocated fragments to the list in pte_free_defer() can
      result in wasting some amount of memory for pagetables, depending on how
      long the allocated fragment will stay in use.  In practice, this effect is
      expected to be insignificant, and not justify a far more complex approach,
      which might allow to add the fragments back later in __tlb_remove_table(),
      where we might not have a stable mm any more.
      
      [hughd@google.com: Claudio finds warning on mm_has_pgste() more useful than on mm_alloc_pgste()]
        Link: https://lkml.kernel.org/r/3bc095ba-a180-ce3b-82b1-2bfc64612f3@google.com
      Link: https://lkml.kernel.org/r/94eccf5f-264c-8abe-4567-e77f4b4e14a@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Tested-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Acked-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8211dad6
    • Hugh Dickins's avatar
      sparc: add pte_free_defer() for pte_t *pgtable_t · ad1ac8d9
      Hugh Dickins authored
      Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This precedes
      the generic version to avoid build breakage from incompatible pgtable_t.
      
      sparc32 supports pagetables sharing a page, but does not support THP;
      sparc64 supports THP, but does not support pagetables sharing a page.  So
      the sparc-specific pte_free_defer() is as simple as the generic one,
      except for converting between pte_t *pgtable_t and struct page *.
      
      Link: https://lkml.kernel.org/r/dc4f318d-a66a-5622-dc44-9018ea814b37@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ad1ac8d9
    • Hugh Dickins's avatar
      powerpc: add pte_free_defer() for pgtables sharing page · 32cc0b7c
      Hugh Dickins authored
      Add powerpc-specific pte_free_defer(), to free table page via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This precedes
      the generic version to avoid build breakage from incompatible pgtable_t.
      
      This is awkward because the struct page contains only one rcu_head, but
      that page may be shared between PTE_FRAG_NR pagetables, each wanting to
      use the rcu_head at the same time.  But powerpc never reuses a fragment
      once it has been freed: so mark the page Active in pte_free_defer(),
      before calling pte_fragment_free() directly; and there call_rcu() to
      pte_free_now() when last fragment is freed and the page is PageActive.
      
      Link: https://lkml.kernel.org/r/6e3ca5f1-334d-4b14-b92d-fc8e99914fcb@google.comSuggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32cc0b7c
    • Hugh Dickins's avatar
      powerpc: assert_pte_locked() use pte_offset_map_nolock() · 3d140215
      Hugh Dickins authored
      Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
      in assert_pte_locked().  BUG if pte_offset_map_nolock() fails.
      
      This mod might cause new crashes: which either expose my ignorance, or
      indicate issues to be fixed, or limit the usage of assert_pte_locked().
      
      [hughd@google.com: assert_pte_locked() still needs the pmd_none() check]
        Link: https://lkml.kernel.org/r/c73d1543-532c-3da2-8cf2-a95363a14116@google.com
      Link: https://lkml.kernel.org/r/e8d56c95-c132-a82e-5f5f-7bb1b738b057@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3d140215
    • Hugh Dickins's avatar
      arm: adjust_pte() use pte_offset_map_nolock() · de2e4626
      Hugh Dickins authored
      Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
      in adjust_pte(): because it gives the not-locked ptl for precisely that
      pte, which the caller can then safely lock; whereas pte_lockptr() is not
      so tightly coupled, because it dereferences the pmd pointer again.
      
      Link: https://lkml.kernel.org/r/4d5258bd-ffa0-018-253a-25f2c9b783f7@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de2e4626
    • Hugh Dickins's avatar
      mm/pgtable: add PAE safety to __pte_offset_map() · 146b42e0
      Hugh Dickins authored
      There is a faint risk that __pte_offset_map(), on a 32-bit architecture
      with a 64-bit pmd_t e.g.  x86-32 with CONFIG_X86_PAE=y, would succeed on a
      pmdval assembled from a pmd_low and a pmd_high which never belonged
      together: their combination not pointing to a page table at all, perhaps
      not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.
      
      Guard against that (on such configs) by local_irq_save() blocking TLB
      flush between present updates, as linux/pgtable.h suggests.  It's only
      needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
      __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
      lock, would just send it back to __pte_offset_map() again.
      
      Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
      used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
      synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
      at the right moment on those configs which do not already send it.
      
      CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86. 
      It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
      Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.  It is
      not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.
      
      Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
      little more work, to retry if pmd_low good for page table, but pmd_high
      non-zero from THP (and that might be making x86-specific assumptions).
      
      Link: https://lkml.kernel.org/r/3adcd8f-9191-2df1-d7ea-c4877698aad@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      146b42e0
    • Hugh Dickins's avatar
      mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s · a349d72f
      Hugh Dickins authored
      Patch series "mm: free retracted page table by RCU", v3.
      
      Some mmap_lock avoidance i.e.  latency reduction.  Initially just for the
      case of collapsing shmem or file pages to THPs: the usefulness of
      MADV_COLLAPSE on shmem is being limited by that mmap_write_lock it
      currently requires.
      
      Likely to be relied upon later in other contexts e.g.  freeing of empty
      page tables (but that's not work I'm doing).  mmap_write_lock avoidance
      when collapsing to anon THPs?  Perhaps, but again that's not work I've
      done: a quick attempt was not as easy as the shmem/file case.
      
      These changes (though of course not these exact patches) have been in
      Google's data centre kernel for three years now: we do rely upon them.
      
      
      This patch (of 13):
      
      Before putting them to use (several commits later), add rcu_read_lock() to
      pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
      separate commit, since it risks exposing imbalances: prior commits have
      fixed all the known imbalances, but we may find some have been missed.
      
      Link: https://lkml.kernel.org/r/7cd843a9-aa80-14f-5eb2-33427363c20@google.com
      Link: https://lkml.kernel.org/r/d3b01da5-2a6-833c-6681-67a3e024a16f@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a349d72f
    • Peng Zhang's avatar
      maple_tree: drop mas_first_entry() · 6783bd4b
      Peng Zhang authored
      The internal function mas_first_entry() is no longer used, so drop it.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-9-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6783bd4b
    • Peng Zhang's avatar
      maple_tree: replace mas_logical_pivot() with mas_safe_pivot() · 29b2681f
      Peng Zhang authored
      Replace mas_logical_pivot() with mas_safe_pivot() and drop
      mas_logical_pivot() since it won't be used anymore.  We can do this since
      now all nodes will have node limit pivot (if it is not full node).
      
      Link: https://lkml.kernel.org/r/20230711035444.526-8-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      29b2681f
    • Peng Zhang's avatar
      maple_tree: update mt_validate() · a489539e
      Peng Zhang authored
      Instead of using mas_first_entry() to find the leftmost leaf, use a simple
      loop instead.  Remove an unneeded check for root node.  To make the error
      message more accurate, check pivots first and then slots, because checking
      slots depend on the node limit pivot to break the loop.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-7-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a489539e
    • Peng Zhang's avatar
      maple_tree: make mas_validate_limits() check root node and node limit · 33af39d0
      Peng Zhang authored
      Update mas_validate_limits() to check root node, check node limit pivot if
      there is enough room for it to exist and check data_end.  Remove the check
      for child existence as it is done in mas_validate_child_slot().
      
      Link: https://lkml.kernel.org/r/20230711035444.526-6-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      33af39d0
    • Peng Zhang's avatar
      maple_tree: fix mas_validate_child_slot() to check last missed slot · e93fda5a
      Peng Zhang authored
      Don't break the loop before checking the last slot.  Also here check if
      non-leaf nodes are missing children.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-5-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e93fda5a
    • Peng Zhang's avatar
      maple_tree: make mas_validate_gaps() to check metadata · f8e5eac8
      Peng Zhang authored
      Make mas_validate_gaps() check whether the offset in the metadata points
      to the largest gap.  By the way, simplify this function.
      
      Add the verification that gaps beyond the node limit are zero.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-4-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8e5eac8
    • Peng Zhang's avatar
      maple_tree: don't use MAPLE_ARANGE64_META_MAX to indicate no gap · d695c30a
      Peng Zhang authored
      Patch series "Improve the validation for maple tree and some cleanup", v2.
      
      
      This patch (of 7):
      
      Do not use a special offset to indicate that there is no gap.  When there
      is no gap, offset can point to any valid slots because its gap is 0.
      
      Link: https://lkml.kernel.org/r/20230711035444.526-1-zhangpeng.00@bytedance.com
      Link: https://lkml.kernel.org/r/20230711035444.526-3-zhangpeng.00@bytedance.comSigned-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d695c30a
    • Sidhartha Kumar's avatar
      mm/memory: pass folio into do_page_mkwrite() · 86aa6998
      Sidhartha Kumar authored
      Saves one implicit call to compound_head().
      
      I'm not sure if I should change the name of the function to
      do_folio_mkwrite() and update the description comment to reference a folio
      as the vm_op is still called page_mkwrite.
      
      
      Link: https://lkml.kernel.org/r/20230711053544.156617-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      86aa6998
    • Miaohe Lin's avatar
      mm: memory-failure: fix race window when trying to get hugetlb folio · d31155b8
      Miaohe Lin authored
      page_folio() is fetched before calling get_hwpoison_hugetlb_folio()
      without hugetlb_lock being held.  So hugetlb page could be demoted before
      get_hwpoison_hugetlb_folio() holding hugetlb_lock but after page_folio()
      is fetched.  So get_hwpoison_hugetlb_folio() will hold unexpected extra
      refcnt of hugetlb folio while leaving demoted page un-refcnted.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-9-linmiaohe@huawei.com
      Fixes: 25182f05 ("mm,hwpoison: fix race with hugetlb page allocation")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d31155b8
    • Miaohe Lin's avatar
      mm: memory-failure: fetch compound head after extra page refcnt is held · a363d122
      Miaohe Lin authored
      Page might become thp, huge page or being splited after compound head is
      fetched but before page refcnt is bumped.  So hpage might be a tail page
      leading to VM_BUG_ON_PAGE(PageTail(page)) in PageTransHuge().
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-8-linmiaohe@huawei.com
      Fixes: 415c64c1 ("mm/memory-failure: split thp earlier in memory error handling")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a363d122
    • Miaohe Lin's avatar
      mm: memory-failure: minor cleanup for comments and codestyle · 5885c6a6
      Miaohe Lin authored
      Fix some wrong function names and grammar error in comments. Also remove
      unneeded space after for_each_process. No functional change intended.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-7-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5885c6a6
    • Miaohe Lin's avatar
      mm: memory-failure: remove unneeded header files · e9c36f7a
      Miaohe Lin authored
      Remove some unneeded header files. No functional change intended.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-6-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e9c36f7a
    • Miaohe Lin's avatar
      mm: memory-failure: use local variable huge to check hugetlb page · 55c7ac45
      Miaohe Lin authored
      Use local variable huge to check whether page is hugetlb page to avoid
      calling PageHuge() multiple times to save cpu cycles.  PageHuge() will be
      stable while extra page refcnt is held.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-5-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55c7ac45
    • Miaohe Lin's avatar
      mm: memory-failure: don't account hwpoison_filter() filtered pages · 80ee7cb2
      Miaohe Lin authored
      mf_generic_kill_procs() will return -EOPNOTSUPP when hwpoison_filter()
      filtered dax page.  In that case, action_result() isn't expected to be
      called to update mf_stats.  This will results in inaccurate but benign
      memory failure handling statistics.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80ee7cb2
    • Miaohe Lin's avatar
      mm: memory-failure: ensure moving HWPoison flag to the raw error pages · 92a025a7
      Miaohe Lin authored
      If hugetlb_vmemmap_optimized is enabled, folio_clear_hugetlb_hwpoison()
      called from try_memory_failure_hugetlb() won't transfer HWPoison flag to
      subpages while folio's HWPoison flag is cleared.  So when trying to free
      this hugetlb page into buddy, folio_clear_hugetlb_hwpoison() is not called
      to move HWPoison flag from head page to the raw error pages even if now
      hugetlb_vmemmap_optimized is cleared.  This will results in HWPoisoned
      page being used again and raw_hwp_page leak.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-3-linmiaohe@huawei.com
      Fixes: ac5fcde0 ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92a025a7
    • Miaohe Lin's avatar
      mm: memory-failure: remove unneeded PageHuge() check · dbe70dbb
      Miaohe Lin authored
      Patch series "A few fixup and cleanup patches for memory-failure", v2.
      
      This series contains a few fixup patches to fix inaccurate mf_stats, fix
      race window when trying to get hugetlb folio and so on.  Also there is
      minor cleanup for comments and codestyle.  More details can be found in
      the respective changelogs.
      
      
      This patch (of 8):
      
      PageHuge() check in me_huge_page() is just for potential problems.  Remove
      it as it's actually dead code and won't catch anything.
      
      Link: https://lkml.kernel.org/r/20230711055016.2286677-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20230711055016.2286677-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dbe70dbb
    • David Hildenbrand's avatar
      mm/memory_hotplug: document the signal_pending() check in offline_pages() · de7cb03d
      David Hildenbrand authored
      Let's update the documentation that any signal is sufficient, and add a
      comment that not only checking for fatal signals is historical baggage:
      changing it now could break existing user space.  although unlikely.
      
      For example, when an app provides a custom SIGALRM handler and triggers
      memory offlining, the timeout cmd would no longer stop memory offlining,
      because SIGALRM would no longer be considered a fatal signal.
      
      Note that using signal_pending() instead of fatal_signal_pending() is
      an anti-pattern, but slowly deprecating that behavior to eventually
      change it in the far future is probably not worth the effort.  If this
      ever becomes relevant for user-space, we might want to rethink.
      
      Link: https://lkml.kernel.org/r/20230711174050.603820-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de7cb03d
    • Randy Dunlap's avatar
      HWPOISON: offline support: fix spelling in Documentation/ABI/ · d0366880
      Randy Dunlap authored
      Correct spelling problems as identified by codespell.
      
      Link: https://lkml.kernel.org/r/20230710052223.18254-1-rdunlap@infradead.org
      Fixes: facb6011 ("HWPOISON: Add soft page offline support")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0366880
    • Haifeng Xu's avatar
      mm/mm_init.c: mark check_for_memory() as __init · b894da04
      Haifeng Xu authored
      The only caller of check_for_memory() is free_area_init(), which is
      annotated with __init, so it should be safe to also mark the former as
      __init.
      
      Link: https://lkml.kernel.org/r/20230710093750.1294-1-haifeng.xu@shopee.comSigned-off-by: default avatarHaifeng Xu <haifeng.xu@shopee.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b894da04
    • Sergey Senozhatsky's avatar
      zsmalloc: remove obj_tagged() · f9044f17
      Sergey Senozhatsky authored
      obj_tagged() is not needed at this point, because objects can only have
      one tag: OBJ_ALLOCATED_TAG.  We needed obj_tagged() for the zsmalloc LRU
      implementation, which has now been removed.  Simplify zsmalloc code and
      revert to the previous implementation that was in place before the
      zsmalloc LRU series.
      
      Link: https://lkml.kernel.org/r/20230709025817.3842416-1-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f9044f17
    • Axel Rasmussen's avatar
      selftests/mm: add uffd unit test for UFFDIO_POISON · 99aa7721
      Axel Rasmussen authored
      The test is pretty basic, and exercises UFFDIO_POISON straightforwardly. 
      We register a region with userfaultfd, in missing fault mode.  For each
      fault, we either UFFDIO_COPY a zeroed page (odd pages) or UFFDIO_POISON
      (even pages).  We do this mix to test "something like a real use case",
      where guest memory would be some mix of poisoned and non-poisoned pages.
      
      We read each page in the region, and assert that the odd pages are zeroed
      as expected, and the even pages yield a SIGBUS as expected.
      
      Why UFFDIO_COPY instead of UFFDIO_ZEROPAGE?  Because hugetlb doesn't
      support UFFDIO_ZEROPAGE, and we don't want to have special case code.
      
      Link: https://lkml.kernel.org/r/20230707215540.2324998-9-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: T.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99aa7721