1. 26 Apr, 2024 40 commits
    • Rick Edgecombe's avatar
      mm: introduce arch_get_unmapped_area_vmflags() · 96114870
      Rick Edgecombe authored
      When memory is being placed, mmap() will take care to respect the guard
      gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
      VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
      needs to consider two things:
      
       1. That the new mapping isn't placed in an any existing mappings guard
          gaps.
       2. That the new mapping isn't placed such that any existing mappings
          are not in *its* guard gaps.
      
      The longstanding behavior of mmap() is to ensure 1, but not take any care
      around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
      with a PAGE_SIZE size, and a type that has a guard gap is being placed,
      mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
      mapping that is supposed to have a guard gap will not have a gap to the
      adjacent VMA.
      
      In order to take the start gap into account, the maple tree search needs
      to know the size of start gap the new mapping will need.  The call chain
      from do_mmap() to the actual maple tree search looks like this:
      
      do_mmap(size, vm_flags, map_flags, ..)
      	mm/mmap.c:get_unmapped_area(size, map_flags, ...)
      		arch_get_unmapped_area(size, map_flags, ...)
      			vm_unmapped_area(struct vm_unmapped_area_info)
      
      One option would be to add another MAP_ flag to mean a one page start gap
      (as is for shadow stack), but this consumes a flag unnecessarily.  Another
      option could be to simply increase the size passed in do_mmap() by the
      start gap size, and adjust after the fact, but this will interfere with
      the alignment requirements passed in struct vm_unmapped_area_info, and
      unknown to mmap.c.  Instead, introduce variants of
      arch_get_unmapped_area/_topdown() that take vm_flags.  In future changes,
      these variants can be used in mmap.c:get_unmapped_area() to allow the
      vm_flags to be passed through to vm_unmapped_area(), while preserving the
      normal arch_get_unmapped_area/_topdown() for the existing callers.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-4-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96114870
    • Rick Edgecombe's avatar
      mm: switch mm->get_unmapped_area() to a flag · 529ce23a
      Rick Edgecombe authored
      The mm_struct contains a function pointer *get_unmapped_area(), which is
      set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown()
      during the initialization of the mm.
      
      Since the function pointer only ever points to two functions that are
      named the same across all arch's, a function pointer is not really
      required.  In addition future changes will want to add versions of the
      functions that take additional arguments.  So to save a pointers worth of
      bytes in mm_struct, and prevent adding additional function pointers to
      mm_struct in future changes, remove it and keep the information about
      which get_unmapped_area() to use in a flag.
      
      Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by
      mmf_init_flags().  Most MM flags get clobbered on fork.  In the
      pre-existing behavior mm->get_unmapped_area() would get copied to the new
      mm in dup_mm(), so not clobbering the flag preserves the existing behavior
      around inheriting the topdown-ness.
      
      Introduce a helper, mm_get_unmapped_area(), to easily convert code that
      refers to the old function pointer to instead select and call either
      arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the
      flag.  Then drop the mm->get_unmapped_area() function pointer.  Leave the
      get_unmapped_area() pointer in struct file_operations alone.  The main
      purpose of this change is to reorganize in preparation for future changes,
      but it also converts the calls of mm->get_unmapped_area() from indirect
      branches into a direct ones.
      
      The stress-ng bigheap benchmark calls realloc a lot, which calls through
      get_unmapped_area() in the kernel.  On x86, the change yielded a ~1%
      improvement there on a retpoline config.
      
      In testing a few x86 configs, removing the pointer unfortunately didn't
      result in any actual size reductions in the compiled layout of mm_struct. 
      But depending on compiler or arch alignment requirements, the change could
      shrink the size of mm_struct.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      529ce23a
    • Rick Edgecombe's avatar
      proc: refactor pde_get_unmapped_area as prep · 5def1e0f
      Rick Edgecombe authored
      Patch series "Cover a guard gap corner case", v4.
      
      In working on x86’s shadow stack feature, I came across some limitations
      around the kernel’s handling of guard gaps.  AFAICT these limitations
      are not too important for the traditional stack usage of guard gaps, but
      have bigger impact on shadow stack’s usage.  And now in addition to x86,
      we have two other architectures implementing shadow stack like features
      that plan to use guard gaps.  I wanted to see about addressing them, but I
      have not worked on mmap() placement related code before, so would greatly
      appreciate if people could take a look and point me in the right
      direction.
      
      The nature of the limitations of concern is as follows. In order to ensure 
      guard gaps between mappings, mmap() would need to consider two things:
       1. That the new mapping isn’t placed in an any existing mapping’s guard
          gap.
       2. That the new mapping isn’t placed such that any existing mappings are
          not in *its* guard gaps
      Currently mmap never considers (2), and (1) is not considered in some 
      situations.
      
      When not passing an address hint, or passing one without
      MAP_FIXED_NOREPLACE, (1) is enforced.  With MAP_FIXED_NOREPLACE, (1) is
      not enforced.  With MAP_FIXED, (1) is not considered, but this seems to be
      expected since MAP_FIXED can already clobber existing mappings.  For
      MAP_FIXED_NOREPLACE I would have guessed it should respect the guard gaps
      of existing mappings, but it is probably a little ambiguous.
      
      In this series I just tried to add enforcement of (2) for the normal (no
      address hint) case and only for the newer shadow stack memory (not
      stacks).  The reason is that with the no-address-hint situation, landing
      next to a guard gap could come up naturally and so be more influencable by
      attackers such that two shadow stacks could be adjacent without a guard
      gap.  Where as the address-hint scenarios would require more control -
      being able to call mmap() with specific arguments.  As for why not just
      fix the other corner cases anyway, I thought it might have some greater
      possibility of affecting existing apps.
      
      
      This patch (of 14):
      
      Future changes will perform a treewide change to remove the indirect
      branch that is involved in calling mm->get_unmapped_area().  After doing
      this, the function will no longer be able to be handled as a function
      pointer.  To make the treewide change diff cleaner and easier to review,
      refactor pde_get_unmapped_area() such that mm->get_unmapped_area() is
      called without being stored in a local function pointer.  With this in
      refactoring, follow on changes will be able to simply replace the call
      site with a future function that calls it directly.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-1-rick.p.edgecombe@intel.com
      Link: https://lkml.kernel.org/r/20240326021656.202649-2-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5def1e0f
    • ZhangPeng's avatar
      userfaultfd: early return in dup_userfaultfd() · afd58439
      ZhangPeng authored
      When vma->vm_userfaultfd_ctx.ctx is NULL, vma->vm_flags should have
      cleared __VM_UFFD_FLAGS. Therefore, there is no need to down_write or
      clear the flag, which will affect fork performance. Fix this by
      returning early if octx is NULL in dup_userfaultfd().
      
      By applying this patch we can get a 1.3% performance improvement for
      lmbench fork_prot. Results are as follows:
                         base      early return
      Process fork+exit: 419.1106  413.4804
      
      Link: https://lkml.kernel.org/r/20240327090835.3232629-1-zhangpeng362@huawei.comSigned-off-by: default avatarZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      afd58439
    • Kefeng Wang's avatar
    • David Hildenbrand's avatar
      mm: remove "prot" parameter from move_pte() · 82a616d0
      David Hildenbrand authored
      The "prot" parameter is unused, and using it instead of what's stored in
      that particular PTE would very likely be wrong.  Let's simply remove it.
      
      Link: https://lkml.kernel.org/r/20240327143301.741807-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      82a616d0
    • David Hildenbrand's avatar
      mm: optimize CONFIG_PER_VMA_LOCK member placement in vm_area_struct · 3b612c8f
      David Hildenbrand authored
      Currently, we end up wasting some memory in each vm_area_struct. Pahole
      states that:
      	[...]
      	int                        vm_lock_seq;          /*    40     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct vma_lock *          vm_lock;              /*    48     8 */
      	bool                       detached;             /*    56     1 */
      
      	/* XXX 7 bytes hole, try to pack */
      	[...]
      
      Let's reduce the holes and memory wastage by moving the bool:
      	[...]
      	bool                       detached;             /*    40     1 */
      
      	/* XXX 3 bytes hole, try to pack */
      
      	int                        vm_lock_seq;          /*    44     4 */
      	struct vma_lock *          vm_lock;              /*    48     8 */
      	[...]
      
      Effectively shrinking the vm_area_struct with CONFIG_PER_VMA_LOCK by
      8 byte.
      
      Likely, we could place "detached" in the lowest bit of vm_lock, but at
      least on 64bit that won't really make a difference, so keep it simple.
      
      Link: https://lkml.kernel.org/r/20240327143548.744070-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3b612c8f
    • Matthew Wilcox (Oracle)'s avatar
      filemap: remove __set_page_dirty() · 07db63a2
      Matthew Wilcox (Oracle) authored
      All callers have been converted to use folios; remove this wrapper.
      
      Link: https://lkml.kernel.org/r/20240327185447.1076689-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07db63a2
    • Matthew Wilcox (Oracle)'s avatar
      mm: use rwsem assertion macros for mmap_lock · ba168b52
      Matthew Wilcox (Oracle) authored
      This slightly strengthens our write assertion when lockdep is disabled. 
      It also downgrades us from BUG_ON to WARN_ON, but I think that's an
      improvement.  I don't think dumping the mm_struct was all that valuable;
      the call chain is what's important.
      
      Link: https://lkml.kernel.org/r/20240327190701.1082560-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ba168b52
    • Peter Xu's avatar
      mm: allow anon exclusive check over hugetlb tail pages · c0bff412
      Peter Xu authored
      PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
      to be called mostly in hugetlb specific paths and the head page was
      guaranteed.
      
      As we move forward towards merging hugetlb paths into generic mm, we may
      start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
      pages) for such check.  Allow it to properly fetch the head, in which case
      the anon-exclusiveness of the head will always represents the tail page.
      
      There's already a sign of it when we look at the GUP-fast which already
      contain the hugetlb processing altogether: we used to have a specific
      commit 5805192c ("mm/gup: handle cont-PTE hugetlb pages correctly in
      gup_must_unshare() via GUP-fast") covering that area.  Now with this more
      generic change, that can also go away.
      
      [akpm@linux-foundation.org: simplify PageAnonExclusive(), per Matthew]
        Link: https://lkml.kernel.org/r/Zg3u5Sh9EbbYPhaI@casper.infradead.org
      Link: https://lkml.kernel.org/r/20240403013249.1418299-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0bff412
    • Peter Xu's avatar
      mm/gup: handle hugetlb in the generic follow_page_mask code · 9cb28da5
      Peter Xu authored
      Now follow_page() is ready to handle hugetlb pages in whatever form, and
      over all architectures.  Switch to the generic code path.
      
      Time to retire hugetlb_follow_page_mask(), following the previous
      retirement of follow_hugetlb_page() in 48498071.
      
      There may be a slight difference of how the loops run when processing slow
      GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
      loop of __get_user_pages() will resolve one pgtable entry with the patch
      applied, rather than relying on the size of hugetlb hstate, the latter may
      cover multiple entries in one loop.
      
      A quick performance test on an aarch64 VM on M1 chip shows 15% degrade
      over a tight loop of slow gup after the path switched.  That shouldn't be
      a problem because slow-gup should not be a hot path for GUP in general:
      when page is commonly present, fast-gup will already succeed, while when
      the page is indeed missing and require a follow up page fault, the slow
      gup degrade will probably buried in the fault paths anyway.  It also
      explains why slow gup for THP used to be very slow before 57edfcfd
      ("mm/gup: accelerate thp gup even for "pages != NULL"") lands, the latter
      not part of a performance analysis but a side benefit.  If the performance
      will be a concern, we can consider handle CONT_PTE in follow_page().
      
      Before that is justified to be necessary, keep everything clean and simple.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-14-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cb28da5
    • Peter Xu's avatar
      mm/gup: handle hugepd for follow_page() · a12083d7
      Peter Xu authored
      Hugepd is only used in PowerPC so far on 4K page size kernels where hash
      mmu is used.  follow_page_mask() used to leverage hugetlb APIs to access
      hugepd entries.  Teach follow_page_mask() itself on hugepd.
      
      With previous refactors on fast-gup gup_huge_pd(), most of the code can be
      leveraged.  There's something not needed for follow page, for example,
      gup_hugepte() tries to detect pgtable entry change which will never happen
      with slow gup (which has the pgtable lock held), but that's not a problem
      to check.
      
      Since follow_page() always only fetch one page, set the end to "address +
      PAGE_SIZE" should suffice.  We will still do the pgtable walk once for
      each hugetlb page by setting ctx->page_mask properly.
      
      One thing worth mentioning is that some level of pgtable's _bad() helper
      will report is_hugepd() entries as TRUE on Power8 hash MMUs.  I think it
      at least applies to PUD on Power8 with 4K pgsize.  It means feeding a
      hugepd entry to pud_bad() will report a false positive.  Let's leave that
      for now because it can be arch-specific where I am a bit declined to
      touch.  In this patch it's not a problem as long as hugepd is detected
      before any bad pgtable entries.
      
      To allow slow gup like follow_*_page() to access hugepd helpers, hugepd
      codes are moved to the top.  Besides that, the helper record_subpages()
      will be used by either hugepd or fast-gup now.  To avoid "unused function"
      warnings we must provide a "#ifdef" for it, unfortunately.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-13-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a12083d7
    • Peter Xu's avatar
      mm/gup: handle huge pmd for follow_pmd_mask() · 4418c522
      Peter Xu authored
      Replace pmd_trans_huge() with pmd_leaf() to also cover pmd_huge() as long
      as enabled.
      
      FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.
      
      Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it
      into follow_huge_pmd() to match what it does.  Move it into gup.c so not
      depend on CONFIG_THP.
      
      When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
      it when the page is valid.  It was not a bug to set it before even if GUP
      failed (page==NULL), because follow_page_mask() callers always ignores
      page_mask if so.  But doing so makes the code cleaner.
      
      [peterx@redhat.com: allow follow_pmd_mask() to take hugetlb tail pages]
        Link: https://lkml.kernel.org/r/20240403013249.1418299-3-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240327152332.950956-12-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4418c522
    • Peter Xu's avatar
      mm/gup: handle huge pud for follow_pud_mask() · 1b167618
      Peter Xu authored
      Teach follow_pud_mask() to be able to handle normal PUD pages like
      hugetlb.
      
      Rename follow_devmap_pud() to follow_huge_pud() so that it can process
      either huge devmap or hugetlb.  Move it out of TRANSPARENT_HUGEPAGE_PUD
      and and huge_memory.c (which relies on CONFIG_THP).  Switch to pud_leaf()
      to detect both cases in the slow gup.
      
      In the new follow_huge_pud(), taking care of possible CoR for hugetlb if
      necessary.  touch_pud() needs to be moved out of huge_memory.c to be
      accessable from gup.c even if !THP.
      
      Since at it, optimize the non-present check by adding a pud_present()
      early check before taking the pgtable lock, failing the follow_page()
      early if PUD is not present: that is required by both devmap or hugetlb. 
      Use pud_huge() to also cover the pud_devmap() case.
      
      One more trivial thing to mention is, introduce "pud_t pud" in the code
      paths along the way, so the code doesn't dereference *pudp multiple time. 
      Not only because that looks less straightforward, but also because if the
      dereference really happened, it's not clear whether there can be race to
      see different *pudp values when it's being modified at the same time.
      
      Setting ctx->page_mask properly for a PUD entry.  As a side effect, this
      patch should also be able to optimize devmap GUP on PUD to be able to jump
      over the whole PUD range, but not yet verified.  Hugetlb already can do so
      prior to this patch.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-11-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b167618
    • Peter Xu's avatar
      mm/gup: cache *pudp in follow_pud_mask() · caf8cab7
      Peter Xu authored
      Introduce "pud_t pud" in the function, so the code won't dereference *pudp
      multiple time.  Not only because that looks less straightforward, but also
      because if the dereference really happened, it's not clear whether there
      can be race to see different *pudp values if it's being modified at the
      same time.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-10-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarJames Houghton <jthoughton@google.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      caf8cab7
    • Peter Xu's avatar
      mm/gup: handle hugetlb for no_page_table() · 878b0c45
      Peter Xu authored
      no_page_table() is not yet used for hugetlb code paths.  Make it prepared.
      
      The major difference here is hugetlb will return -EFAULT as long as page
      cache does not exist, even if VM_SHARED.  See hugetlb_follow_page_mask().
      
      Pass "address" into no_page_table() too, as hugetlb will need it.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-9-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      878b0c45
    • Peter Xu's avatar
      mm/gup: refactor record_subpages() to find 1st small page · f3c94c62
      Peter Xu authored
      All the fast-gup functions take a tail page to operate, always need to do
      page mask calculations before feeding that into record_subpages().
      
      Merge that logic into record_subpages(), so that it will do the nth_page()
      calculation.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f3c94c62
    • Peter Xu's avatar
      mm/gup: drop gup_fast_folio_allowed() in hugepd processing · 607c6319
      Peter Xu authored
      Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
      some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
      PPC_8XX), however those pages are not candidates for GUP.
      
      Commit a6e79df9 ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
      file-backed mappings") added a check to fail gup-fast if there's potential
      risk of violating GUP over writeback file systems.  That should never
      apply to hugepd.  Considering that hugepd is an old format (and even
      software-only), there's no plan to extend hugepd into other file typed
      memories that is prone to the same issue.
      
      Drop that check, not only because it'll never be true for hugepd per any
      known plan, but also it paves way for reusing the function outside
      fast-gup.
      
      To make sure we'll still remember this issue just in case hugepd will be
      extended to support non-hugetlbfs memories, add a rich comment above
      gup_huge_pd(), explaining the issue with proper references.
      
      [akpm@linux-foundation.org: fix comment, per David]
      Link: https://lkml.kernel.org/r/20240327152332.950956-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      607c6319
    • Peter Xu's avatar
      mm/arch: provide pud_pfn() fallback · 35a76f5c
      Peter Xu authored
      The comment in the code explains the reasons.  We took a different
      approach comparing to pmd_pfn() by providing a fallback function.
      
      Another option is to provide some lower level config options (compare to
      HUGETLB_PAGE or THP) to identify which layer an arch can support for such
      huge mappings.  However that can be an overkill.
      
      [peterx@redhat.com: fix loongson defconfig]
        Link: https://lkml.kernel.org/r/20240403013249.1418299-4-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240327152332.950956-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      35a76f5c
    • Peter Xu's avatar
      mm: introduce vma_pgtable_walk_{begin|end}() · 239e9a90
      Peter Xu authored
      Introduce per-vma begin()/end() helpers for pgtable walks.  This is a
      preparation work to merge hugetlb pgtable walkers with generic mm.
      
      The helpers need to be called before and after a pgtable walk, will start
      to be needed if the pgtable walker code supports hugetlb pages.  It's a
      hook point for any type of VMA, but for now only hugetlb uses it to
      stablize the pgtable pages from getting away (due to possible pmd
      unsharing).
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reviewed-by: default avatarMuchun Song <muchun.song@linux.dev>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      239e9a90
    • Peter Xu's avatar
      mm: make HPAGE_PXD_* macros even if !THP · b979db16
      Peter Xu authored
      These macros can be helpful when we plan to merge hugetlb code into
      generic code.  Move them out and define them as long as
      PGTABLE_HAS_HUGE_LEAVES is selected, because there are systems that only
      define HUGETLB_PAGE not THP.
      
      One note here is HPAGE_PMD_SHIFT must be defined even if PMD_SHIFT is not
      defined (e.g.  !CONFIG_MMU case); it (or in other forms, like
      HPAGE_PMD_NR) is already used in lots of common codes without ifdef
      guards.  Use the old trick to let complations work.
      
      Here we only need to differenciate HPAGE_PXD_SHIFT definitions.  All the
      rest macros will be defined based on it.  When at it, move HPAGE_PMD_NR /
      HPAGE_PMD_ORDER over together.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b979db16
    • Peter Xu's avatar
      mm/hugetlb: declare hugetlbfs_pagecache_present() non-static · 24334e78
      Peter Xu authored
      It will be used outside hugetlb.c soon.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24334e78
    • Peter Xu's avatar
      mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES · ac3830c3
      Peter Xu authored
      Patch series "mm/gup: Unify hugetlb, part 2", v4.
      
      The series removes the hugetlb slow gup path after a previous refactor
      work [1], so that slow gup now uses the exact same path to process all
      kinds of memory including hugetlb.
      
      For the long term, we may want to remove most, if not all, call sites of
      huge_pte_offset().  It'll be ideal if that API can be completely dropped
      from arch hugetlb API.  This series is one small step towards merging
      hugetlb specific codes into generic mm paths.  From that POV, this series
      removes one reference to huge_pte_offset() out of many others.
      
      One goal of such a route is that we can reconsider merging hugetlb
      features like High Granularity Mapping (HGM).  It was not accepted in the
      past because it may add lots of hugetlb specific codes and make the mm
      code even harder to maintain.  With a merged codeset, features like HGM
      can hopefully share some code with THP, legacy (PMD+) or modern
      (continuous PTEs).
      
      To make it work, the generic slow gup code will need to at least
      understand hugepd, which is already done like so in fast-gup.  Due to the
      specialty of hugepd to be software-only solution (no hardware recognizes
      the hugepd format, so it's purely artificial structures), there's chance
      we can merge some or all hugepd formats with cont_pte in the future.  That
      question is yet unsettled from Power side to have an acknowledgement.  As
      of now for this series, I kept the hugepd handling because we may still
      need to do so before getting a clearer picture of the future of hugepd. 
      The other reason is simply that we did it already for fast-gup and most
      codes are still around to be reused.  It'll make more sense to keep
      slow/fast gup behave the same before a decision is made to remove hugepd.
      
      There's one major difference for slow-gup on cont_pte / cont_pmd handling,
      currently supported on three architectures (aarch64, riscv, ppc).  Before
      the series, slow gup will be able to recognize e.g.  cont_pte entries with
      the help of huge_pte_offset() when hstate is around.  Now it's gone but
      still working, by looking up pgtable entries one by one.
      
      It's not ideal, but hopefully this change should not affect yet on major
      workloads.  There's some more information in the commit message of the
      last patch.  If this would be a concern, we can consider teaching slow gup
      to recognize cont pte/pmd entries, and that should recover the lost
      performance.  But I doubt its necessity for now, so I kept it as simple as
      it can be.
      
      Patch layout
      =============
      
      Patch 1-8:    Preparation works, or cleanups in relevant code paths
      Patch 9-11:   Teach slow gup with all kinds of huge entries (pXd, hugepd)
      Patch 12:     Drop hugetlb_follow_page_mask()
      
      More information can be found in the commit messages of each patch.
      
      [1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com
      [2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com
      
      
      
      
      Introduce a config option that will be selected as long as huge leaves are
      involved in pgtable (thp or hugetlbfs).  It would be useful to mark any
      code with this new config that can process either hugetlb or thp pages in
      any level that is higher than pte level.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240327152332.950956-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac3830c3
    • Matthew Wilcox (Oracle)'s avatar
    • Matthew Wilcox (Oracle)'s avatar
      c93012d8
    • Matthew Wilcox (Oracle)'s avatar
    • Matthew Wilcox (Oracle)'s avatar
      mm: convert huge_zero_page to huge_zero_folio · 5691753d
      Matthew Wilcox (Oracle) authored
      With all callers of is_huge_zero_page() converted, we can now switch the
      huge_zero_page itself from being a compound page to a folio.
      
      Link: https://lkml.kernel.org/r/20240326202833.523759-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5691753d
    • Matthew Wilcox (Oracle)'s avatar
      mm: convert migrate_vma_collect_pmd to use a folio · b002a7b0
      Matthew Wilcox (Oracle) authored
      Convert the pmd directly to a folio and use it.  Turns four calls to
      compound_head() into one.
      
      Link: https://lkml.kernel.org/r/20240326202833.523759-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b002a7b0
    • Matthew Wilcox (Oracle)'s avatar
      mm: add pmd_folio() · e06d03d5
      Matthew Wilcox (Oracle) authored
      Convert directly from a pmd to a folio without going through another
      representation first.  For now this is just a slightly shorter way to
      write it, but it might end up being more efficient later.
      
      Link: https://lkml.kernel.org/r/20240326202833.523759-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e06d03d5
    • Matthew Wilcox (Oracle)'s avatar
      mm: add is_huge_zero_folio() · 5beaee54
      Matthew Wilcox (Oracle) authored
      This is the folio equivalent of is_huge_zero_page().  It doesn't add any
      efficiency, but it does prevent the caller from passing a tail page and
      getting confused when the predicate returns false.
      
      Link: https://lkml.kernel.org/r/20240326202833.523759-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5beaee54
    • Matthew Wilcox (Oracle)'s avatar
      sparc: use is_huge_zero_pmd() · 4d30eac3
      Matthew Wilcox (Oracle) authored
      Patch series "Convert huge_zero_page to huge_zero_folio".
      
      Almost all the callers of is_huge_zero_page() already have a folio.  And
      they should -- is_huge_zero_page() will return false for tail pages, even
      if they're tail pages of the huge zero page.  That's confusing, and one of
      the benefits of the folio conversion is to get rid of this confusion.
      
      
      This patch (of 8):
      
      There's no need to convert to a page, much less a folio.  We can tell from
      the pmd whether it is a huge zero page or not.  Saves 60 bytes of text.
      
      Link: https://lkml.kernel.org/r/20240326202833.523759-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20240326202833.523759-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d30eac3
    • Chris Li's avatar
      zswap: replace RB tree with xarray · 796c2c23
      Chris Li authored
      Very deep RB tree requires rebalance at times.  That contributes to the
      zswap fault latencies.  Xarray does not need to perform tree rebalance. 
      Replacing RB tree to xarray can have some small performance gain.
      
      One small difference is that xarray insert might fail with ENOMEM, while
      RB tree insert does not allocate additional memory.
      
      The zswap_entry size will reduce a bit due to removing the RB node, which
      has two pointers and a color field.  Xarray store the pointer in the
      xarray tree rather than the zswap_entry.  Every entry has one pointer from
      the xarray tree.  Overall, switching to xarray should save some memory, if
      the swap entries are densely packed.
      
      Notice the zswap_rb_search and zswap_rb_insert often followed by
      zswap_rb_erase.  Use xa_erase and xa_store directly.  That saves one tree
      lookup as well.
      
      Remove zswap_invalidate_entry due to no need to call zswap_rb_erase any
      more.  Use zswap_free_entry instead.
      
      The "struct zswap_tree" has been replaced by "struct xarray".  The tree
      spin lock has transferred to the xarray lock.
      
      Run the kernel build testing 5 times for each version, averages:
      (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile,
      24 HT core, 32 jobs)
      
                 mm-unstable-4aaccadb5c04     xarray v9
      user       3548.902 			3534.375
      sys        522.232                      520.976
      real       202.796                      200.864
      
      [chrisl@kernel.org: restore original comment "erase" to "invalidate"]
        Link: https://lkml.kernel.org/r/20240326-zswap-xarray-v10-1-bf698417c968@kernel.org
      Link: https://lkml.kernel.org/r/20240326-zswap-xarray-v9-1-d2891a65dfc7@kernel.orgSigned-off-by: default avatarChris Li <chrisl@kernel.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Barry Song <v-songbaohua@oppo.com>
      Cc: Chengming Zhou <zhouchengming@bytedance.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      796c2c23
    • Baoquan He's avatar
      mm/page_alloc.c: change the array-length to MIGRATE_PCPTYPES · 0aac4566
      Baoquan He authored
      Earlier, in commit 1dd214b8 ("mm: page_alloc: avoid merging
      non-fallbackable pageblocks with others"), migrate type MIGRATE_CMA and
      MIGRATE_ISOLATE are removed from fallbacks list since they are never used.
      
      Later on, in commit ("aa02d3c1 mm/page_alloc: reduce fallbacks to
      (MIGRATE_PCPTYPES - 1)"), the array column size is reduced to
      'MIGRATE_PCPTYPES - 1'. In fact, the array row size need be reduced to
      MIGRATE_PCPTYPES too since it's only covering rows of the number
      MIGRATE_PCPTYPES. Even though the current code has handled cases
      when the migratetype is CMA, HIGHATOMIC and MEMORY_ISOLATION, making
      the row size right is still good to avoid future error and confusion.
      
      Link: https://lkml.kernel.org/r/20240326061134.1055295-8-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0aac4566
    • Baoquan He's avatar
      mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone · 96a5c186
      Baoquan He authored
      On one node, for lower zone's ->lowmem_reserve[], it will show how much
      memory is reserved in this lower zone to avoid excessive page allocation
      from the relevant higher zone's fallback allocation.
      
      However, currently lower zone's lowmem_reserve[] element will be filled
      even though the relevant higher zone is empty.  That doesnt' make sense
      and can cause confusion.
      
      E.g on node 0 of one system as below, it has zone
      DMA/DMA32/NORMAL/MOVABLE/DEVICE, among them zone MOVABLE/DEVICE are the
      highest and both are empty.  In zone DMA/DMA32's protection array, we can
      see that it has value for zone MOVABLE and DEVICE.
      
      Node 0, zone      DMA
        ......
        pages free     2816
              boost    0
              min      7
              low      10
              high     13
              spanned  4095
              present  3998
              managed  3840
              cma      0
              protection: (0, 1582, 23716, 23716, 23716)
         ......
      Node 0, zone    DMA32
        pages free     403269
              boost    0
              min      753
              low      1158
              high     1563
              spanned  1044480
              present  487039
              managed  405070
              cma      0
              protection: (0, 0, 22134, 22134, 22134)
         ......
      Node 0, zone   Normal
        pages free     5423879
              boost    0
              min      10539
              low      16205
              high     21871
              spanned  5767168
              present  5767168
              managed  5666438
              cma      0
              protection: (0, 0, 0, 0, 0)
         ......
      Node 0, zone  Movable
        pages free     0
              boost    0
              min      32
              low      32
              high     32
              spanned  0
              present  0
              managed  0
              cma      0
              protection: (0, 0, 0, 0, 0)
      Node 0, zone   Device
        pages free     0
              boost    0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              cma      0
              protection: (0, 0, 0, 0, 0)
      
      Here, clear out the element value in lower zone's ->lowmem_reserve[] if the
      relevant higher zone is empty.
      
      And also replace space with tab in _deferred_grow_zone()
      
      Link: https://lkml.kernel.org/r/20240326061134.1055295-7-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96a5c186
    • Baoquan He's avatar
      mm/mm_init.c: remove the outdated code comment above deferred_grow_zone() · f55d3471
      Baoquan He authored
      The noinline attribute has been taken off in commit 9420f89d ("mm:
      move most of core MM initialization to mm/mm_init.c").  So remove the
      unneeded code comment above deferred_grow_zone().
      
      And also remove the unneeded bracket in deferred_init_pages().
      
      Link: https://lkml.kernel.org/r/20240326061134.1055295-6-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f55d3471
    • Baoquan He's avatar
      mm/page_alloc.c: remove unneeded codes in !NUMA version of build_zonelists() · bb8ea62d
      Baoquan He authored
      When CONFIG_NUMA=n, MAX_NUMNODES is always 1 because Kconfig item
      NODES_SHIFT depends on NUMA.  So in !NUMA version of build_zonelists(), no
      need to bother with the two for loop because code execution won't enter
      them ever.
      
      Here, remove those unneeded codes in !NUMA version of build_zonelists().
      
      [bhe@redhat.com: remove unused locals]
        Link: https://lkml.kernel.org/r/ZgQL1WOf9K88nLpQ@MiWiFi-R3L-srv
      Link: https://lkml.kernel.org/r/20240326061134.1055295-5-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bb8ea62d
    • Baoquan He's avatar
      mm: make __absent_pages_in_range() as static · b6dd9459
      Baoquan He authored
      It's only called in mm/mm_init.c now.
      
      Link: https://lkml.kernel.org/r/20240326061134.1055295-4-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b6dd9459
    • Baoquan He's avatar
      mm/init: remove the unnecessary special treatment for memory-less node · c091dd96
      Baoquan He authored
      Because memory-less node's ->node_present_pages and its zone's
      ->present_pages are all 0, the judgement before calling node_set_state()
      to set N_MEMORY, N_HIGH_MEMORY, N_NORMAL_MEMORY for node is enough to skip
      memory-less node.  The 'continue;' statement inside for_each_node() loop
      of free_area_init() is gilding the lily.
      
      Here, remove the special handling to make memory-less node share the same
      code flow as normal node.
      
      And also rephrase the code comments above the 'continue' statement
      and move them above above line 'if (pgdat->node_present_pages)'.
      
      [bhe@redhat.com: redo code comments, per Mike]
        Link: https://lkml.kernel.org/r/ZhYJAVQRYJSTKZng@MiWiFi-R3L-srv
      Link: https://lkml.kernel.org/r/20240326061134.1055295-3-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c091dd96
    • Baoquan He's avatar
      mm: move array mem_section init code out of memory_present() · 850ed205
      Baoquan He authored
      Patch series "mm/init: minor clean up and improvement".
      
      These are all observed when going through code flow during mm init.
      
      
      This patch (of 7):
      
      When CONFIG_SPARSEMEM_EXTREME is enabled, mem_section need be initialized
      to point at a two-dimensional array, and its 1st dimension of length
      NR_SECTION_ROOTS will be dynamically allocated.  Once the allocation is
      done, it's available for all nodes.
      
      So take the 1st dimension of mem_section initialization out of
      memory_present()(), and put it into memblocks_present() which is a more
      appripriate place.
      
      Link: https://lkml.kernel.org/r/20240326061134.1055295-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20240326061134.1055295-2-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      850ed205
    • Vlastimil Babka's avatar
      mm, slab: move slab_memcg hooks to mm/memcontrol.c · e6100a45
      Vlastimil Babka authored
      The hooks make multiple calls to functions in mm/memcontrol.c, including
      to th current_obj_cgroup() marked __always_inline.  It might be faster to
      make a single call to the hook in mm/memcontrol.c instead.  The hooks also
      don't use almost anything from mm/slub.c.  obj_full_size() can move with
      the hooks and cache_vmstat_idx() to the internal mm/slab.h
      
      Link: https://lkml.kernel.org/r/20240326-slab-memcg-v3-2-d85d2563287a@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Josh Poimboeuf <jpoimboe@kernel.org>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6100a45