1. 26 Apr, 2024 40 commits
    • Matthew Wilcox (Oracle)'s avatar
      mm: correct page_mapped_in_vma() for large folios · 7e834741
      Matthew Wilcox (Oracle) authored
      Patch series "Unify vma_address and vma_pgoff_address".
      
      The current vma_address() pretends that the ambiguity between head & tail
      page is an advantage.  If you pass a head page to vma_address(), it will
      operate on all pages in the folio, while if you pass a tail page, it will
      operate on a single page.  That's not what any of the callers actually
      want, so first convert all callers to use vma_pgoff_address() and then
      rename vma_pgoff_address() to vma_address().
      
      
      This patch (of 3):
      
      If 'page' is the first page of a large folio then vma_address() will scan
      for any page in the entire folio.  This can lead to page_mapped_in_vma()
      returning true if some of the tail pages are mapped and the head page is
      not.  This could lead to memory failure choosing to kill a task
      unnecessarily.
      
      Link: https://lkml.kernel.org/r/20240328225831.1765286-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20240328225831.1765286-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7e834741
    • Baolin Wang's avatar
      mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split statistics · 835c3a25
      Baolin Wang authored
      Now the mTHP can also be split or added into the deferred list, so add
      folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
      confusion with PMD mapped THP related statistics.
      
      [baolin.wang@linux.alibaba.com: check THP earlier in case folio is split, per Lance]
        Link: https://lkml.kernel.org/r/b99f8cb14bc85fdb6ab43721d1331cb5ebed2581.1713771041.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarLance Yang <ioworker0@gmail.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      835c3a25
    • Baolin Wang's avatar
      mm: support multi-size THP numa balancing · d2136d74
      Baolin Wang authored
      Now the anonymous page allocation already supports multi-size THP (mTHP),
      but the numa balancing still prohibits mTHP migration even though it is an
      exclusive mapping, which is unreasonable.
      
      Allow scanning mTHP:
      Commit 859d4adc ("mm: numa: do not trap faults on shared data section
      pages") skips shared CoW pages' NUMA page migration to avoid shared data
      segment migration. In addition, commit 80d47f5d ("mm: don't try to
      NUMA-migrate COW pages that have other uses") change to use page_count()
      to avoid GUP pages migration, that will also skip the mTHP numa scanning.
      Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
      issue, although there is still a GUP race, the issue seems to have been
      resolved by commit 80d47f5d. Meanwhile, use the folio_likely_mapped_shared()
      to skip shared CoW pages though this is not a precise sharers count. To
      check if the folio is shared, ideally we want to make sure every page is
      mapped to the same process, but doing that seems expensive and using
      the estimated mapcount seems can work when running autonuma benchmark.
      
      Allow migrating mTHP:
      As mentioned in the previous thread[1], large folios (including THP) are
      more susceptible to false sharing issues among threads than 4K base page,
      leading to pages ping-pong back and forth during numa balancing, which is
      currently not easy to resolve. Therefore, as a start to support mTHP numa
      balancing, we can follow the PMD mapped THP's strategy, that means we can
      reuse the 2-stage filter in should_numa_migrate_memory() to check if the
      mTHP is being heavily contended among threads (through checking the CPU id
      and pid of the last access) to avoid false sharing at some degree. Thus,
      we can restore all PTE maps upon the first hint page fault of a large folio
      to follow the PMD mapped THP's strategy. In the future, we can continue to
      optimize the NUMA balancing algorithm to avoid the false sharing issue with
      large folios as much as possible.
      
      Performance data:
      Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
      Base: 2024-03-25 mm-unstable branch
      Enable mTHP to run autonuma-benchmark
      
      mTHP:16K
      Base				Patched
      numa01				numa01
      224.70				143.48
      numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
      118.05				47.43
      numa02				numa02
      13.45				9.29
      numa02_SMT			numa02_SMT
      14.80				7.50
      
      mTHP:64K
      Base				Patched
      numa01				numa01
      216.15				114.40
      numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
      115.35				47.41
      numa02				numa02
      13.24				9.25
      numa02_SMT			numa02_SMT
      14.67				7.34
      
      mTHP:128K
      Base				Patched
      numa01				numa01
      205.13				144.45
      numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
      112.93				41.88
      numa02				numa02
      13.16				9.18
      numa02_SMT			numa02_SMT
      14.81				7.49
      
      [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/
      
      [baolin.wang@linux.alibaba.com: v3]
        Link: https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2136d74
    • Baolin Wang's avatar
      mm: factor out the numa mapping rebuilding into a new helper · 6b0ed7b3
      Baolin Wang authored
      Patch series "support multi-size THP numa balancing", v2.
      
      This patchset tries to support mTHP numa balancing, as a simple solution
      to start, the NUMA balancing algorithm for mTHP will follow the THP
      strategy as the basic support.  Please find details in each patch.
      
      
      This patch (of 2):
      
      To support large folio's numa balancing, factor out the numa mapping
      rebuilding into a new helper as a preparation.
      
      Link: https://lkml.kernel.org/r/cover.1712132950.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/cover.1711683069.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b0ed7b3
    • Barry Song's avatar
      mm: alloc_anon_folio: avoid doing vma_thp_gfp_mask in fallback cases · 68dbcf48
      Barry Song authored
      Fallback rates surpassing 90% have been observed on phones utilizing 64KiB
      CONT-PTE mTHP.  In these scenarios, when one out of every 16 PTEs fails to
      allocate large folios, the remaining 15 PTEs fallback.  Consequently,
      invoking vma_thp_gfp_mask seems redundant in such cases.  Furthermore,
      abstaining from its use can also contribute to improved code readability.
      
      Link: https://lkml.kernel.org/r/20240329073750.20012-1-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68dbcf48
    • Sergey Senozhatsky's avatar
      zram: add max_pages param to recompression · 34efe1c3
      Sergey Senozhatsky authored
      Introduce "max_pages" param to recompress device attribute which sets an
      upper limit on the number of entries (pages) zram attempts to recompress
      (in this particular recompression call).  S/W recompression can be quite
      expensive so limiting the number of pages recompress touches can be quite
      helpful.
      
      Link: https://lkml.kernel.org/r/20240329094050.2815699-1-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34efe1c3
    • York Jasper Niebuhr's avatar
      mm: init_mlocked_on_free_v3 · ba42b524
      York Jasper Niebuhr authored
      Implements the "init_mlocked_on_free" boot option. When this boot option
      is enabled, any mlock'ed pages are zeroed on free. If
      the pages are munlock'ed beforehand, no initialization takes place.
      This boot option is meant to combat the performance hit of
      "init_on_free" as reported in commit 6471384a ("mm: security:
      introduce init_on_alloc=1 and init_on_free=1 boot options"). With
      "init_mlocked_on_free=1" only relevant data is freed while everything
      else is left untouched by the kernel. Correspondingly, this patch
      introduces no performance hit for unmapping non-mlock'ed memory. The
      unmapping overhead for purely mlocked memory was measured to be
      approximately 13%. Realistically, most systems mlock only a fraction of
      the total memory so the real-world system overhead should be close to
      zero.
      
      Optimally, userspace programs clear any key material or other
      confidential memory before exit and munlock the according memory
      regions. If a program crashes, userspace key managers fail to do this
      job. Accordingly, no munlock operations are performed so the data is
      caught and zeroed by the kernel. Should the program not crash, all
      memory will ideally be munlocked so no overhead is caused.
      
      CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON can be set to enable
      "init_mlocked_on_free" by default.
      
      Link: https://lkml.kernel.org/r/20240329145605.149917-1-yjnworkstation@gmail.comSigned-off-by: default avatarYork Jasper Niebuhr <yjnworkstation@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: York Jasper Niebuhr <yjnworkstation@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ba42b524
    • Jinjiang Tu's avatar
      selftest/mm: ksm_functional_tests: extend test case for ksm fork/exec · 6c47de3b
      Jinjiang Tu authored
      This extends test_prctl_fork() and test_prctl_fork_exec() to make sure
      that deduplication really happens, instead of only testing the
      MMF_VM_MERGE_ANY flag is set.
      
      [colin.i.king@gmail.com: fix spelling mistake in ksft_test_result_skip message]
        Link: https://lkml.kernel.org/r/20240402081537.1365939-1-colin.i.king@gmail.com
      Link: https://lkml.kernel.org/r/20240328111010.1502191-4-tujinjiang@huawei.comSigned-off-by: default avatarJinjiang Tu <tujinjiang@huawei.com>
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stefan Roesch <shr@devkernel.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c47de3b
    • Jinjiang Tu's avatar
      selftest/mm: ksm_functional_tests: refactor mmap_and_merge_range() · 7abaacb8
      Jinjiang Tu authored
      In order to extend test_prctl_fork() and test_prctl_fork_exec() to make
      sure that deduplication really happens, mmap_and_merge_range() needs to be
      refactored.
      
      Firstly, mmap_and_merge_range() will be called with no need to call enable
      KSM by madvise or prctl.  So, switch the 'bool use_prctl' parameter to
      enum ksm_merge_mode.
      
      Secondly, mmap_and_merge_range() will be called in child process in the
      two testcases, it isn't appropriate to call ksft_test_result_{fail, skip},
      because the global variables ksft_{fail, skip} aren't consistent with the
      parent process.  Thus, convert calls of ksft_test_result_{fail, skip} to
      ksft_print_msg(), return differrent error according to the two cases, and
      rename mmap_and_merge_range() to __mmap_and_merge_range().  For existing
      callers, introduce new mmap_and_merge_range() to handle different return
      values of __mmap_and_merge_range().
      
      Link: https://lkml.kernel.org/r/20240328111010.1502191-3-tujinjiang@huawei.comSigned-off-by: default avatarJinjiang Tu <tujinjiang@huawei.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stefan Roesch <shr@devkernel.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7abaacb8
    • Jinjiang Tu's avatar
      mm/ksm: fix ksm exec support for prctl · 3a9e567c
      Jinjiang Tu authored
      Patch series "mm/ksm: fix ksm exec support for prctl", v4.
      
      commit 3c6f33b7 ("mm/ksm: support fork/exec for prctl") inherits
      MMF_VM_MERGE_ANY flag when a task calls execve().  However, it doesn't
      create the mm_slot, so ksmd will not try to scan this task.  The first
      patch fixes the issue.
      
      The second patch refactors to prepare for the third patch.  The third
      patch extends the selftests of ksm to verfity the deduplication really
      happens after fork/exec inherits ths KSM setting.
      
      
      This patch (of 3):
      
      commit 3c6f33b7 ("mm/ksm: support fork/exec for prctl") inherits
      MMF_VM_MERGE_ANY flag when a task calls execve().  Howerver, it doesn't
      create the mm_slot, so ksmd will not try to scan this task.
      
      To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init()
      when the mm has MMF_VM_MERGE_ANY flag.
      
      Link: https://lkml.kernel.org/r/20240328111010.1502191-1-tujinjiang@huawei.com
      Link: https://lkml.kernel.org/r/20240328111010.1502191-2-tujinjiang@huawei.com
      Fixes: 3c6f33b7 ("mm/ksm: support fork/exec for prctl")
      Signed-off-by: default avatarJinjiang Tu <tujinjiang@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stefan Roesch <shr@devkernel.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a9e567c
    • Rick Edgecombe's avatar
      selftests/x86: add placement guard gap test for shstk · a9bc15cb
      Rick Edgecombe authored
      The existing shadow stack test for guard gaps just checks that new
      mappings are not placed in an existing mapping's guard gap.  Add one that
      checks that new mappings are not placed such that preexisting mappings are
      in the new mappings guard gap.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-15-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a9bc15cb
    • Rick Edgecombe's avatar
      x86/mm: care about shadow stack guard gap during placement · c44357c2
      Rick Edgecombe authored
      When memory is being placed, mmap() will take care to respect the guard
      gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
      VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
      needs to consider two things:
      
       1. That the new mapping isn't placed in an any existing mappings guard
          gaps.
       2. That the new mapping isn't placed such that any existing mappings
          are not in *its* guard gaps.
      
      The longstanding behavior of mmap() is to ensure 1, but not take any care
      around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
      with a PAGE_SIZE size, and a type that has a guard gap is being placed,
      mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
      mapping that is supposed to have a guard gap will not have a gap to the
      adjacent VMA.
      
      Now that the vm_flags is passed into the arch get_unmapped_area()'s, and
      vm_unmapped_area() is ready to consider it, have VM_SHADOW_STACK's get
      guard gap consideration for scenario 2.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-14-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c44357c2
    • Rick Edgecombe's avatar
      x86/mm: implement HAVE_ARCH_UNMAPPED_AREA_VMFLAGS · c5ecd8eb
      Rick Edgecombe authored
      When memory is being placed, mmap() will take care to respect the guard
      gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
      VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
      needs to consider two things:
      
       1. That the new mapping isn't placed in an any existing mappings guard
          gaps.
       2. That the new mapping isn't placed such that any existing mappings
          are not in *its* guard gaps.
      
      The longstanding behavior of mmap() is to ensure 1, but not take any care
      around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
      with a PAGE_SIZE size, and a type that has a guard gap is being placed,
      mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
      mapping that is supposed to have a guard gap will not have a gap to the
      adjacent VMA.
      
      Add x86 arch implementations of arch_get_unmapped_area_vmflags/_topdown()
      so future changes can allow the guard gap of type of vma being placed to
      be taken into account.  This will be used for shadow stack memory.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-13-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5ecd8eb
    • Rick Edgecombe's avatar
      mm: take placement mappings gap into account · 44bd7ace
      Rick Edgecombe authored
      When memory is being placed, mmap() will take care to respect the guard
      gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
      VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
      needs to consider two things:
      
       1. That the new mapping isn't placed in an any existing mappings guard
          gaps.
       2. That the new mapping isn't placed such that any existing mappings
          are not in *its* guard gaps.
      
      The longstanding behavior of mmap() is to ensure 1, but not take any care
      around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
      with a PAGE_SIZE size, and a type that has a guard gap is being placed,
      mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
      mapping that is supposed to have a guard gap will not have a gap to the
      adjacent VMA.
      
      For MAP_GROWSDOWN/VM_GROWSDOWN and MAP_GROWSUP/VM_GROWSUP this has not
      been a problem in practice because applications place these kinds of
      mappings very early, when there is not many mappings to find a space
      between.  But for shadow stacks, they may be placed throughout the
      lifetime of the application.
      
      Use the start_gap field to find a space that includes the guard gap for
      the new mapping.  Take care to not interfere with the alignment.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-12-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44bd7ace
    • Rick Edgecombe's avatar
      treewide: use initializer for struct vm_unmapped_area_info · b80fa3cb
      Rick Edgecombe authored
      Future changes will need to add a new member to struct
      vm_unmapped_area_info.  This would cause trouble for any call site that
      doesn't initialize the struct.  Currently every caller sets each member
      manually, so if new ones are added they will be uninitialized and the core
      code parsing the struct will see garbage in the new member.
      
      It could be possible to initialize the new member manually to 0 at each
      call site.  This and a couple other options were discussed.  Having some
      struct vm_unmapped_area_info instances not zero initialized will put those
      sites at risk of feeding garbage into vm_unmapped_area(), if the
      convention is to zero initialize the struct and any new field addition
      missed a call site that initializes each field manually.  So it is useful
      to do things similar across the kernel.
      
      The consensus (see links) was that in general the best way to accomplish
      taking into account both code cleanliness and minimizing the chance of
      introducing bugs, was to do C99 static initialization.  As in: struct
      vm_unmapped_area_info info = {};
      
      With this method of initialization, the whole struct will be zero
      initialized, and any statements setting fields to zero will be unneeded. 
      The change should not leave cleanup at the call sides.
      
      While iterating though the possible solutions a few archs kindly acked
      other variations that still zero initialized the struct.  These sites have
      been modified in previous changes using the pattern acked by the
      respective arch.
      
      So to be reduce the chance of bugs via uninitialized fields, perform a
      tree wide change using the consensus for the best general way to do this
      change.  Use C99 static initializing to zero the struct and remove and
      statements that simply set members to zero.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-11-rick.p.edgecombe@intel.com
      Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t
      Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/
      Link: https://lore.kernel.org/lkml/ec3e377a-c0a0-4dd3-9cb9-96517e54d17e@csgroup.eu/Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b80fa3cb
    • Rick Edgecombe's avatar
      powerpc: use initializer for struct vm_unmapped_area_info · 9d8187b9
      Rick Edgecombe authored
      Future changes will need to add a new member to struct
      vm_unmapped_area_info.  This would cause trouble for any call site that
      doesn't initialize the struct.  Currently every caller sets each member
      manually, so if new members are added they will be uninitialized and the
      core code parsing the struct will see garbage in the new member.
      
      It could be possible to initialize the new member manually to 0 at each
      call site.  This and a couple other options were discussed, and a working
      consensus (see links) was that in general the best way to accomplish this
      would be via static initialization with designated member initiators. 
      Having some struct vm_unmapped_area_info instances not zero initialized
      will put those sites at risk of feeding garbage into vm_unmapped_area() if
      the convention is to zero initialize the struct and any new member
      addition misses a call site that initializes each member manually.
      
      It could be possible to leave the code mostly untouched, and just change
      the line:
      struct vm_unmapped_area_info info
      to:
      struct vm_unmapped_area_info info = {};
      
      However, that would leave cleanup for the members that are manually set to
      zero, as it would no longer be required.
      
      So to be reduce the chance of bugs via uninitialized members, instead
      simply continue the process to initialize the struct this way tree wide. 
      This will zero any unspecified members.  Move the member initializers to
      the struct declaration when they are known at that time.  Leave the
      members out that were manually initialized to zero, as this would be
      redundant for designated initializers.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-10-rick.p.edgecombe@intel.com
      Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t
      Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9d8187b9
    • Rick Edgecombe's avatar
      parisc: use initializer for struct vm_unmapped_area_info · 5e145228
      Rick Edgecombe authored
      Future changes will need to add a new member to struct
      vm_unmapped_area_info.  This would cause trouble for any call site that
      doesn't initialize the struct.  Currently every caller sets each member
      manually, so if new members are added they will be uninitialized and the
      core code parsing the struct will see garbage in the new member.
      
      It could be possible to initialize the new member manually to 0 at each
      call site.  This and a couple other options were discussed, and a working
      consensus (see links) was that in general the best way to accomplish this
      would be via static initialization with designated member initiators. 
      Having some struct vm_unmapped_area_info instances not zero initialized
      will put those sites at risk of feeding garbage into vm_unmapped_area() if
      the convention is to zero initialize the struct and any new member
      addition misses a call site that initializes each member manually.
      
      It could be possible to leave the code mostly untouched, and just change
      the line:
      struct vm_unmapped_area_info info
      to:
      struct vm_unmapped_area_info info = {};
      
      However, that would leave cleanup for the members that are manually set
      to zero, as it would no longer be required.
      
      So to be reduce the chance of bugs via uninitialized members, instead
      simply continue the process to initialize the struct this way tree wide. 
      This will zero any unspecified members.  Move the member initializers to
      the struct declaration when they are known at that time.  Leave the
      members out that were manually initialized to zero, as this would be
      redundant for designated initializers.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-9-rick.p.edgecombe@intel.com
      Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t
      Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: default avatarHelge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e145228
    • Rick Edgecombe's avatar
      csky: use initializer for struct vm_unmapped_area_info · bf6f3c18
      Rick Edgecombe authored
      Future changes will need to add a new member to struct
      vm_unmapped_area_info.  This would cause trouble for any call site that
      doesn't initialize the struct.  Currently every caller sets each member
      manually, so if new members are added they will be uninitialized and the
      core code parsing the struct will see garbage in the new member.
      
      It could be possible to initialize the new member manually to 0 at each
      call site.  This and a couple other options were discussed, and a working
      consensus (see links) was that in general the best way to accomplish this
      would be via static initialization with designated member initiators. 
      Having some struct vm_unmapped_area_info instances not zero initialized
      will put those sites at risk of feeding garbage into vm_unmapped_area() if
      the convention is to zero initialize the struct and any new member
      addition misses a call site that initializes each member manually.
      
      It could be possible to leave the code mostly untouched, and just change
      the line:
      struct vm_unmapped_area_info info
      to:
      struct vm_unmapped_area_info info = {};
      
      However, that would leave cleanup for the members that are manually set to
      zero, as it would no longer be required.
      
      So to be reduce the chance of bugs via uninitialized members, instead
      simply continue the process to initialize the struct this way tree wide. 
      This will zero any unspecified members.  Move the member initializers to
      the struct declaration when they are known at that time.  Leave the
      members out that were manually initialized to zero, as this would be
      redundant for designated initializers.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-8-rick.p.edgecombe@intel.com
      Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t
      Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Reviewed-by: default avatarGuo Ren <guoren@kernel.org>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bf6f3c18
    • Rick Edgecombe's avatar
      thp: add thp_get_unmapped_area_vmflags() · ed48e87c
      Rick Edgecombe authored
      When memory is being placed, mmap() will take care to respect the guard
      gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
      VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
      needs to consider two things:
      
       1. That the new mapping isn't placed in an any existing mappings guard
          gaps.
       2. That the new mapping isn't placed such that any existing mappings
          are not in *its* guard gaps.
      
      The longstanding behavior of mmap() is to ensure 1, but not take any care
      around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
      with a PAGE_SIZE size, and a type that has a guard gap is being placed,
      mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
      mapping that is supposed to have a guard gap will not have a gap to the
      adjacent VMA.
      
      Add a THP implementations of the vm_flags variant of get_unmapped_area(). 
      Future changes will call this from mmap.c in the do_mmap() path to allow
      shadow stacks to be placed with consideration taken for the start guard
      gap.  Shadow stack memory is always private and anonymous and so special
      guard gap logic is not needed in a lot of caseis, but it can be mapped by
      THP, so needs to be handled.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-7-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed48e87c
    • Rick Edgecombe's avatar
      mm: use get_unmapped_area_vmflags() · 8a0fe564
      Rick Edgecombe authored
      When memory is being placed, mmap() will take care to respect the guard
      gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
      VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
      needs to consider two things:
      
       1. That the new mapping isn't placed in an any existing mappings guard
          gaps.
       2. That the new mapping isn't placed such that any existing mappings
          are not in *its* guard gaps.
      
      The long standing behavior of mmap() is to ensure 1, but not take any care
      around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
      with a PAGE_SIZE size, and a type that has a guard gap is being placed,
      mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
      mapping that is supposed to have a guard gap will not have a gap to the
      adjacent VMA.
      
      Use mm_get_unmapped_area_vmflags() in the do_mmap() so future changes can
      cause shadow stack mappings to be placed with a guard gap.  Also use the
      THP variant that takes vm_flags, such that THP shadow stack can get the
      same treatment.  Adjust the vm_flags calculation to happen earlier so that
      the vm_flags can be passed into __get_unmapped_area().
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-6-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8a0fe564
    • Rick Edgecombe's avatar
      mm: remove export for get_unmapped_area() · 529781b2
      Rick Edgecombe authored
      The mm/mmap.c function get_unmapped_area() is not used from any modules,
      so it doesn't need to be exported.  Remove the export.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-5-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      529781b2
    • Rick Edgecombe's avatar
      mm: introduce arch_get_unmapped_area_vmflags() · 96114870
      Rick Edgecombe authored
      When memory is being placed, mmap() will take care to respect the guard
      gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
      VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
      needs to consider two things:
      
       1. That the new mapping isn't placed in an any existing mappings guard
          gaps.
       2. That the new mapping isn't placed such that any existing mappings
          are not in *its* guard gaps.
      
      The longstanding behavior of mmap() is to ensure 1, but not take any care
      around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
      with a PAGE_SIZE size, and a type that has a guard gap is being placed,
      mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
      mapping that is supposed to have a guard gap will not have a gap to the
      adjacent VMA.
      
      In order to take the start gap into account, the maple tree search needs
      to know the size of start gap the new mapping will need.  The call chain
      from do_mmap() to the actual maple tree search looks like this:
      
      do_mmap(size, vm_flags, map_flags, ..)
      	mm/mmap.c:get_unmapped_area(size, map_flags, ...)
      		arch_get_unmapped_area(size, map_flags, ...)
      			vm_unmapped_area(struct vm_unmapped_area_info)
      
      One option would be to add another MAP_ flag to mean a one page start gap
      (as is for shadow stack), but this consumes a flag unnecessarily.  Another
      option could be to simply increase the size passed in do_mmap() by the
      start gap size, and adjust after the fact, but this will interfere with
      the alignment requirements passed in struct vm_unmapped_area_info, and
      unknown to mmap.c.  Instead, introduce variants of
      arch_get_unmapped_area/_topdown() that take vm_flags.  In future changes,
      these variants can be used in mmap.c:get_unmapped_area() to allow the
      vm_flags to be passed through to vm_unmapped_area(), while preserving the
      normal arch_get_unmapped_area/_topdown() for the existing callers.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-4-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96114870
    • Rick Edgecombe's avatar
      mm: switch mm->get_unmapped_area() to a flag · 529ce23a
      Rick Edgecombe authored
      The mm_struct contains a function pointer *get_unmapped_area(), which is
      set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown()
      during the initialization of the mm.
      
      Since the function pointer only ever points to two functions that are
      named the same across all arch's, a function pointer is not really
      required.  In addition future changes will want to add versions of the
      functions that take additional arguments.  So to save a pointers worth of
      bytes in mm_struct, and prevent adding additional function pointers to
      mm_struct in future changes, remove it and keep the information about
      which get_unmapped_area() to use in a flag.
      
      Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by
      mmf_init_flags().  Most MM flags get clobbered on fork.  In the
      pre-existing behavior mm->get_unmapped_area() would get copied to the new
      mm in dup_mm(), so not clobbering the flag preserves the existing behavior
      around inheriting the topdown-ness.
      
      Introduce a helper, mm_get_unmapped_area(), to easily convert code that
      refers to the old function pointer to instead select and call either
      arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the
      flag.  Then drop the mm->get_unmapped_area() function pointer.  Leave the
      get_unmapped_area() pointer in struct file_operations alone.  The main
      purpose of this change is to reorganize in preparation for future changes,
      but it also converts the calls of mm->get_unmapped_area() from indirect
      branches into a direct ones.
      
      The stress-ng bigheap benchmark calls realloc a lot, which calls through
      get_unmapped_area() in the kernel.  On x86, the change yielded a ~1%
      improvement there on a retpoline config.
      
      In testing a few x86 configs, removing the pointer unfortunately didn't
      result in any actual size reductions in the compiled layout of mm_struct. 
      But depending on compiler or arch alignment requirements, the change could
      shrink the size of mm_struct.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      529ce23a
    • Rick Edgecombe's avatar
      proc: refactor pde_get_unmapped_area as prep · 5def1e0f
      Rick Edgecombe authored
      Patch series "Cover a guard gap corner case", v4.
      
      In working on x86’s shadow stack feature, I came across some limitations
      around the kernel’s handling of guard gaps.  AFAICT these limitations
      are not too important for the traditional stack usage of guard gaps, but
      have bigger impact on shadow stack’s usage.  And now in addition to x86,
      we have two other architectures implementing shadow stack like features
      that plan to use guard gaps.  I wanted to see about addressing them, but I
      have not worked on mmap() placement related code before, so would greatly
      appreciate if people could take a look and point me in the right
      direction.
      
      The nature of the limitations of concern is as follows. In order to ensure 
      guard gaps between mappings, mmap() would need to consider two things:
       1. That the new mapping isn’t placed in an any existing mapping’s guard
          gap.
       2. That the new mapping isn’t placed such that any existing mappings are
          not in *its* guard gaps
      Currently mmap never considers (2), and (1) is not considered in some 
      situations.
      
      When not passing an address hint, or passing one without
      MAP_FIXED_NOREPLACE, (1) is enforced.  With MAP_FIXED_NOREPLACE, (1) is
      not enforced.  With MAP_FIXED, (1) is not considered, but this seems to be
      expected since MAP_FIXED can already clobber existing mappings.  For
      MAP_FIXED_NOREPLACE I would have guessed it should respect the guard gaps
      of existing mappings, but it is probably a little ambiguous.
      
      In this series I just tried to add enforcement of (2) for the normal (no
      address hint) case and only for the newer shadow stack memory (not
      stacks).  The reason is that with the no-address-hint situation, landing
      next to a guard gap could come up naturally and so be more influencable by
      attackers such that two shadow stacks could be adjacent without a guard
      gap.  Where as the address-hint scenarios would require more control -
      being able to call mmap() with specific arguments.  As for why not just
      fix the other corner cases anyway, I thought it might have some greater
      possibility of affecting existing apps.
      
      
      This patch (of 14):
      
      Future changes will perform a treewide change to remove the indirect
      branch that is involved in calling mm->get_unmapped_area().  After doing
      this, the function will no longer be able to be handled as a function
      pointer.  To make the treewide change diff cleaner and easier to review,
      refactor pde_get_unmapped_area() such that mm->get_unmapped_area() is
      called without being stored in a local function pointer.  With this in
      refactoring, follow on changes will be able to simply replace the call
      site with a future function that calls it directly.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-1-rick.p.edgecombe@intel.com
      Link: https://lkml.kernel.org/r/20240326021656.202649-2-rick.p.edgecombe@intel.comSigned-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5def1e0f
    • ZhangPeng's avatar
      userfaultfd: early return in dup_userfaultfd() · afd58439
      ZhangPeng authored
      When vma->vm_userfaultfd_ctx.ctx is NULL, vma->vm_flags should have
      cleared __VM_UFFD_FLAGS. Therefore, there is no need to down_write or
      clear the flag, which will affect fork performance. Fix this by
      returning early if octx is NULL in dup_userfaultfd().
      
      By applying this patch we can get a 1.3% performance improvement for
      lmbench fork_prot. Results are as follows:
                         base      early return
      Process fork+exit: 419.1106  413.4804
      
      Link: https://lkml.kernel.org/r/20240327090835.3232629-1-zhangpeng362@huawei.comSigned-off-by: default avatarZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      afd58439
    • Kefeng Wang's avatar
    • David Hildenbrand's avatar
      mm: remove "prot" parameter from move_pte() · 82a616d0
      David Hildenbrand authored
      The "prot" parameter is unused, and using it instead of what's stored in
      that particular PTE would very likely be wrong.  Let's simply remove it.
      
      Link: https://lkml.kernel.org/r/20240327143301.741807-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      82a616d0
    • David Hildenbrand's avatar
      mm: optimize CONFIG_PER_VMA_LOCK member placement in vm_area_struct · 3b612c8f
      David Hildenbrand authored
      Currently, we end up wasting some memory in each vm_area_struct. Pahole
      states that:
      	[...]
      	int                        vm_lock_seq;          /*    40     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct vma_lock *          vm_lock;              /*    48     8 */
      	bool                       detached;             /*    56     1 */
      
      	/* XXX 7 bytes hole, try to pack */
      	[...]
      
      Let's reduce the holes and memory wastage by moving the bool:
      	[...]
      	bool                       detached;             /*    40     1 */
      
      	/* XXX 3 bytes hole, try to pack */
      
      	int                        vm_lock_seq;          /*    44     4 */
      	struct vma_lock *          vm_lock;              /*    48     8 */
      	[...]
      
      Effectively shrinking the vm_area_struct with CONFIG_PER_VMA_LOCK by
      8 byte.
      
      Likely, we could place "detached" in the lowest bit of vm_lock, but at
      least on 64bit that won't really make a difference, so keep it simple.
      
      Link: https://lkml.kernel.org/r/20240327143548.744070-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3b612c8f
    • Matthew Wilcox (Oracle)'s avatar
      filemap: remove __set_page_dirty() · 07db63a2
      Matthew Wilcox (Oracle) authored
      All callers have been converted to use folios; remove this wrapper.
      
      Link: https://lkml.kernel.org/r/20240327185447.1076689-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07db63a2
    • Matthew Wilcox (Oracle)'s avatar
      mm: use rwsem assertion macros for mmap_lock · ba168b52
      Matthew Wilcox (Oracle) authored
      This slightly strengthens our write assertion when lockdep is disabled. 
      It also downgrades us from BUG_ON to WARN_ON, but I think that's an
      improvement.  I don't think dumping the mm_struct was all that valuable;
      the call chain is what's important.
      
      Link: https://lkml.kernel.org/r/20240327190701.1082560-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ba168b52
    • Peter Xu's avatar
      mm: allow anon exclusive check over hugetlb tail pages · c0bff412
      Peter Xu authored
      PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
      to be called mostly in hugetlb specific paths and the head page was
      guaranteed.
      
      As we move forward towards merging hugetlb paths into generic mm, we may
      start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
      pages) for such check.  Allow it to properly fetch the head, in which case
      the anon-exclusiveness of the head will always represents the tail page.
      
      There's already a sign of it when we look at the GUP-fast which already
      contain the hugetlb processing altogether: we used to have a specific
      commit 5805192c ("mm/gup: handle cont-PTE hugetlb pages correctly in
      gup_must_unshare() via GUP-fast") covering that area.  Now with this more
      generic change, that can also go away.
      
      [akpm@linux-foundation.org: simplify PageAnonExclusive(), per Matthew]
        Link: https://lkml.kernel.org/r/Zg3u5Sh9EbbYPhaI@casper.infradead.org
      Link: https://lkml.kernel.org/r/20240403013249.1418299-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0bff412
    • Peter Xu's avatar
      mm/gup: handle hugetlb in the generic follow_page_mask code · 9cb28da5
      Peter Xu authored
      Now follow_page() is ready to handle hugetlb pages in whatever form, and
      over all architectures.  Switch to the generic code path.
      
      Time to retire hugetlb_follow_page_mask(), following the previous
      retirement of follow_hugetlb_page() in 48498071.
      
      There may be a slight difference of how the loops run when processing slow
      GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
      loop of __get_user_pages() will resolve one pgtable entry with the patch
      applied, rather than relying on the size of hugetlb hstate, the latter may
      cover multiple entries in one loop.
      
      A quick performance test on an aarch64 VM on M1 chip shows 15% degrade
      over a tight loop of slow gup after the path switched.  That shouldn't be
      a problem because slow-gup should not be a hot path for GUP in general:
      when page is commonly present, fast-gup will already succeed, while when
      the page is indeed missing and require a follow up page fault, the slow
      gup degrade will probably buried in the fault paths anyway.  It also
      explains why slow gup for THP used to be very slow before 57edfcfd
      ("mm/gup: accelerate thp gup even for "pages != NULL"") lands, the latter
      not part of a performance analysis but a side benefit.  If the performance
      will be a concern, we can consider handle CONT_PTE in follow_page().
      
      Before that is justified to be necessary, keep everything clean and simple.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-14-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cb28da5
    • Peter Xu's avatar
      mm/gup: handle hugepd for follow_page() · a12083d7
      Peter Xu authored
      Hugepd is only used in PowerPC so far on 4K page size kernels where hash
      mmu is used.  follow_page_mask() used to leverage hugetlb APIs to access
      hugepd entries.  Teach follow_page_mask() itself on hugepd.
      
      With previous refactors on fast-gup gup_huge_pd(), most of the code can be
      leveraged.  There's something not needed for follow page, for example,
      gup_hugepte() tries to detect pgtable entry change which will never happen
      with slow gup (which has the pgtable lock held), but that's not a problem
      to check.
      
      Since follow_page() always only fetch one page, set the end to "address +
      PAGE_SIZE" should suffice.  We will still do the pgtable walk once for
      each hugetlb page by setting ctx->page_mask properly.
      
      One thing worth mentioning is that some level of pgtable's _bad() helper
      will report is_hugepd() entries as TRUE on Power8 hash MMUs.  I think it
      at least applies to PUD on Power8 with 4K pgsize.  It means feeding a
      hugepd entry to pud_bad() will report a false positive.  Let's leave that
      for now because it can be arch-specific where I am a bit declined to
      touch.  In this patch it's not a problem as long as hugepd is detected
      before any bad pgtable entries.
      
      To allow slow gup like follow_*_page() to access hugepd helpers, hugepd
      codes are moved to the top.  Besides that, the helper record_subpages()
      will be used by either hugepd or fast-gup now.  To avoid "unused function"
      warnings we must provide a "#ifdef" for it, unfortunately.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-13-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a12083d7
    • Peter Xu's avatar
      mm/gup: handle huge pmd for follow_pmd_mask() · 4418c522
      Peter Xu authored
      Replace pmd_trans_huge() with pmd_leaf() to also cover pmd_huge() as long
      as enabled.
      
      FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.
      
      Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it
      into follow_huge_pmd() to match what it does.  Move it into gup.c so not
      depend on CONFIG_THP.
      
      When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
      it when the page is valid.  It was not a bug to set it before even if GUP
      failed (page==NULL), because follow_page_mask() callers always ignores
      page_mask if so.  But doing so makes the code cleaner.
      
      [peterx@redhat.com: allow follow_pmd_mask() to take hugetlb tail pages]
        Link: https://lkml.kernel.org/r/20240403013249.1418299-3-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240327152332.950956-12-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4418c522
    • Peter Xu's avatar
      mm/gup: handle huge pud for follow_pud_mask() · 1b167618
      Peter Xu authored
      Teach follow_pud_mask() to be able to handle normal PUD pages like
      hugetlb.
      
      Rename follow_devmap_pud() to follow_huge_pud() so that it can process
      either huge devmap or hugetlb.  Move it out of TRANSPARENT_HUGEPAGE_PUD
      and and huge_memory.c (which relies on CONFIG_THP).  Switch to pud_leaf()
      to detect both cases in the slow gup.
      
      In the new follow_huge_pud(), taking care of possible CoR for hugetlb if
      necessary.  touch_pud() needs to be moved out of huge_memory.c to be
      accessable from gup.c even if !THP.
      
      Since at it, optimize the non-present check by adding a pud_present()
      early check before taking the pgtable lock, failing the follow_page()
      early if PUD is not present: that is required by both devmap or hugetlb. 
      Use pud_huge() to also cover the pud_devmap() case.
      
      One more trivial thing to mention is, introduce "pud_t pud" in the code
      paths along the way, so the code doesn't dereference *pudp multiple time. 
      Not only because that looks less straightforward, but also because if the
      dereference really happened, it's not clear whether there can be race to
      see different *pudp values when it's being modified at the same time.
      
      Setting ctx->page_mask properly for a PUD entry.  As a side effect, this
      patch should also be able to optimize devmap GUP on PUD to be able to jump
      over the whole PUD range, but not yet verified.  Hugetlb already can do so
      prior to this patch.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-11-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b167618
    • Peter Xu's avatar
      mm/gup: cache *pudp in follow_pud_mask() · caf8cab7
      Peter Xu authored
      Introduce "pud_t pud" in the function, so the code won't dereference *pudp
      multiple time.  Not only because that looks less straightforward, but also
      because if the dereference really happened, it's not clear whether there
      can be race to see different *pudp values if it's being modified at the
      same time.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-10-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarJames Houghton <jthoughton@google.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      caf8cab7
    • Peter Xu's avatar
      mm/gup: handle hugetlb for no_page_table() · 878b0c45
      Peter Xu authored
      no_page_table() is not yet used for hugetlb code paths.  Make it prepared.
      
      The major difference here is hugetlb will return -EFAULT as long as page
      cache does not exist, even if VM_SHARED.  See hugetlb_follow_page_mask().
      
      Pass "address" into no_page_table() too, as hugetlb will need it.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-9-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      878b0c45
    • Peter Xu's avatar
      mm/gup: refactor record_subpages() to find 1st small page · f3c94c62
      Peter Xu authored
      All the fast-gup functions take a tail page to operate, always need to do
      page mask calculations before feeding that into record_subpages().
      
      Merge that logic into record_subpages(), so that it will do the nth_page()
      calculation.
      
      Link: https://lkml.kernel.org/r/20240327152332.950956-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f3c94c62
    • Peter Xu's avatar
      mm/gup: drop gup_fast_folio_allowed() in hugepd processing · 607c6319
      Peter Xu authored
      Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
      some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
      PPC_8XX), however those pages are not candidates for GUP.
      
      Commit a6e79df9 ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
      file-backed mappings") added a check to fail gup-fast if there's potential
      risk of violating GUP over writeback file systems.  That should never
      apply to hugepd.  Considering that hugepd is an old format (and even
      software-only), there's no plan to extend hugepd into other file typed
      memories that is prone to the same issue.
      
      Drop that check, not only because it'll never be true for hugepd per any
      known plan, but also it paves way for reusing the function outside
      fast-gup.
      
      To make sure we'll still remember this issue just in case hugepd will be
      extended to support non-hugetlbfs memories, add a rich comment above
      gup_huge_pd(), explaining the issue with proper references.
      
      [akpm@linux-foundation.org: fix comment, per David]
      Link: https://lkml.kernel.org/r/20240327152332.950956-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      607c6319
    • Peter Xu's avatar
      mm/arch: provide pud_pfn() fallback · 35a76f5c
      Peter Xu authored
      The comment in the code explains the reasons.  We took a different
      approach comparing to pmd_pfn() by providing a fallback function.
      
      Another option is to provide some lower level config options (compare to
      HUGETLB_PAGE or THP) to identify which layer an arch can support for such
      huge mappings.  However that can be an overkill.
      
      [peterx@redhat.com: fix loongson defconfig]
        Link: https://lkml.kernel.org/r/20240403013249.1418299-4-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240327152332.950956-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Jones <andrew.jones@linux.dev>
      Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      35a76f5c