1. 12 Jul, 2024 33 commits
    • Christophe Leroy's avatar
      powerpc/mm: fix __find_linux_pte() on 32 bits with PMD leaf entries · 6a9f66c8
      Christophe Leroy authored
      Building on 32 bits with pmd_leaf() not returning always false leads to
      the following error:
      
        CC      arch/powerpc/mm/pgtable.o
      arch/powerpc/mm/pgtable.c: In function '__find_linux_pte':
      arch/powerpc/mm/pgtable.c:506:1: error: function may return address of local variable [-Werror=return-local-addr]
        506 | }
            | ^
      arch/powerpc/mm/pgtable.c:394:15: note: declared here
        394 |         pud_t pud, *pudp;
            |               ^~~
      arch/powerpc/mm/pgtable.c:394:15: note: declared here
      
      This is due to pmd_offset() being a no-op in that case.
      
      So rework it for powerpc/32 so that pXd_offset() are used on real
      pointers and not on on-stack copies.
      
      Behind fixing the problem, it also has the advantage of simplifying
      __find_linux_pte() including the removal of stack frame:
      
      After this patch:
      
      	00000018 <__find_linux_pte>:
      	  18:	2c 06 00 00 	cmpwi   r6,0
      	  1c:	41 82 00 0c 	beq     28 <__find_linux_pte+0x10>
      	  20:	39 20 00 00 	li      r9,0
      	  24:	91 26 00 00 	stw     r9,0(r6)
      	  28:	2f 85 00 00 	cmpwi   cr7,r5,0
      	  2c:	41 9e 00 0c 	beq     cr7,38 <__find_linux_pte+0x20>
      	  30:	39 20 00 00 	li      r9,0
      	  34:	99 25 00 00 	stb     r9,0(r5)
      	  38:	54 89 65 3a 	rlwinm  r9,r4,12,20,29
      	  3c:	7c 63 48 2e 	lwzx    r3,r3,r9
      	  40:	2f 83 00 00 	cmpwi   cr7,r3,0
      	  44:	41 9e 00 30 	beq     cr7,74 <__find_linux_pte+0x5c>
      	  48:	54 69 07 3a 	rlwinm  r9,r3,0,28,29
      	  4c:	2f 89 00 0c 	cmpwi   cr7,r9,12
      	  50:	54 63 00 26 	clrrwi  r3,r3,12
      	  54:	54 84 b5 36 	rlwinm  r4,r4,22,20,27
      	  58:	3c 63 c0 00 	addis   r3,r3,-16384
      	  5c:	7c 63 22 14 	add     r3,r3,r4
      	  60:	4c be 00 20 	bnelr+  cr7
      	  64:	4d 82 00 20 	beqlr
      	  68:	39 20 00 17 	li      r9,23
      	  6c:	91 26 00 00 	stw     r9,0(r6)
      	  70:	4e 80 00 20 	blr
      	  74:	38 60 00 00 	li      r3,0
      	  78:	4e 80 00 20 	blr
      
      Before this patch:
      
      	00000018 <__find_linux_pte>:
      	  18:	2c 06 00 00 	cmpwi   r6,0
      	  1c:	94 21 ff e0 	stwu    r1,-32(r1)
      	  20:	41 82 00 0c 	beq     2c <__find_linux_pte+0x14>
      	  24:	39 20 00 00 	li      r9,0
      	  28:	91 26 00 00 	stw     r9,0(r6)
      	  2c:	2f 85 00 00 	cmpwi   cr7,r5,0
      	  30:	41 9e 00 0c 	beq     cr7,3c <__find_linux_pte+0x24>
      	  34:	39 20 00 00 	li      r9,0
      	  38:	99 25 00 00 	stb     r9,0(r5)
      	  3c:	54 89 65 3a 	rlwinm  r9,r4,12,20,29
      	  40:	7c 63 48 2e 	lwzx    r3,r3,r9
      	  44:	54 69 07 3a 	rlwinm  r9,r3,0,28,29
      	  48:	2f 89 00 0c 	cmpwi   cr7,r9,12
      	  4c:	90 61 00 0c 	stw     r3,12(r1)
      	  50:	41 9e 00 4c 	beq     cr7,9c <__find_linux_pte+0x84>
      	  54:	80 61 00 0c 	lwz     r3,12(r1)
      	  58:	54 69 07 3a 	rlwinm  r9,r3,0,28,29
      	  5c:	2f 89 00 0c 	cmpwi   cr7,r9,12
      	  60:	90 61 00 08 	stw     r3,8(r1)
      	  64:	41 9e 00 38 	beq     cr7,9c <__find_linux_pte+0x84>
      	  68:	80 61 00 08 	lwz     r3,8(r1)
      	  6c:	2f 83 00 00 	cmpwi   cr7,r3,0
      	  70:	41 9e 00 54 	beq     cr7,c4 <__find_linux_pte+0xac>
      	  74:	54 69 07 3a 	rlwinm  r9,r3,0,28,29
      	  78:	2f 89 00 0c 	cmpwi   cr7,r9,12
      	  7c:	54 69 00 26 	clrrwi  r9,r3,12
      	  80:	54 8a b5 36 	rlwinm  r10,r4,22,20,27
      	  84:	3c 69 c0 00 	addis   r3,r9,-16384
      	  88:	7c 63 52 14 	add     r3,r3,r10
      	  8c:	54 84 93 be 	srwi    r4,r4,14
      	  90:	41 9e 00 14 	beq     cr7,a4 <__find_linux_pte+0x8c>
      	  94:	38 21 00 20 	addi    r1,r1,32
      	  98:	4e 80 00 20 	blr
      	  9c:	54 69 00 26 	clrrwi  r9,r3,12
      	  a0:	54 84 93 be 	srwi    r4,r4,14
      	  a4:	3c 69 c0 00 	addis   r3,r9,-16384
      	  a8:	54 84 25 36 	rlwinm  r4,r4,4,20,27
      	  ac:	7c 63 22 14 	add     r3,r3,r4
      	  b0:	41 a2 ff e4 	beq     94 <__find_linux_pte+0x7c>
      	  b4:	39 20 00 17 	li      r9,23
      	  b8:	91 26 00 00 	stw     r9,0(r6)
      	  bc:	38 21 00 20 	addi    r1,r1,32
      	  c0:	4e 80 00 20 	blr
      	  c4:	38 60 00 00 	li      r3,0
      	  c8:	38 21 00 20 	addi    r1,r1,32
      	  cc:	4e 80 00 20 	blr
      
      Link: https://lkml.kernel.org/r/50a3cfbab5b11890a0da027de5cb011a9d47ba89.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a9f66c8
    • Christophe Leroy's avatar
      powerpc/mm: remove _PAGE_PSIZE · afc8969f
      Christophe Leroy authored
      _PAGE_PSIZE macro is never used outside the place it is defined and is
      used only on 8xx and e500.
      
      Remove indirection, remove it and use its content directly.
      
      Link: https://lkml.kernel.org/r/c41da3b0ceda7311a50f0391cc4d54302ae15b74.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      afc8969f
    • Christophe Leroy's avatar
      mm: provide mm_struct and address to huge_ptep_get() · e6c0c032
      Christophe Leroy authored
      On powerpc 8xx huge_ptep_get() will need to know whether the given ptep is
      a PTE entry or a PMD entry.  This cannot be known with the PMD entry
      itself because there is no easy way to know it from the content of the
      entry.
      
      So huge_ptep_get() will need to know either the size of the page or get
      the pmd.
      
      In order to be consistent with huge_ptep_get_and_clear(), give mm and
      address to huge_ptep_get().
      
      Link: https://lkml.kernel.org/r/cc00c70dd384298796a4e1b25d6c4eb306d3af85.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6c0c032
    • Christophe Leroy's avatar
      mm: define __pte_leaf_size() to also take a PMD entry · 18d095b2
      Christophe Leroy authored
      On powerpc 8xx, when a page is 8M size, the information is in the PMD
      entry.  So allow architectures to provide __pte_leaf_size() instead of
      pte_leaf_size() and provide the PMD entry to that function.
      
      When __pte_leaf_size() is not defined, define it as a pte_leaf_size() so
      that architectures not interested in the PMD arguments are not impacted.
      
      Only define a default pte_leaf_size() when __pte_leaf_size() is not
      defined to make sure nobody adds new calls to pte_leaf_size() in the core.
      
      Link: https://lkml.kernel.org/r/c7c008f0a314bf8029ad7288fdc908db1ec7e449.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      18d095b2
    • Michael Ellerman's avatar
      powerpc/64e: drop unused TLB miss handlers · 0db46aaa
      Michael Ellerman authored
      There are two possibilities for book3e_htw_mode, PPC_HTW_E6500 or
      PPC_HTW_NONE.
      
      The TLB miss handlers are patched to use, respectively:
        - exc_[data|indstruction]_tlb_miss_e6500_book3e
        - exc_[data|indstruction]_tlb_miss_bolted_book3e
      
      Which means the default handlers are never used.  Remove those, and use
      the bolted handlers (PPC_HTW_NONE) by default.
      
      Link: https://lkml.kernel.org/r/9a670adc1771fb1871fba93ace5372f7eadc286f.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0db46aaa
    • Michael Ellerman's avatar
      powerpc/64e: consolidate TLB miss handler patching · 264488bf
      Michael Ellerman authored
      The 64e TLB miss handler patching is done in setup_mmu_htw(), and then
      again immediately afterward in early_init_mmu_global().  Consolidate it
      into a single location.
      
      Link: https://lkml.kernel.org/r/7033b37493fb48a3e5245b59d0a42afb75dabfc1.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      264488bf
    • Michael Ellerman's avatar
      powerpc/64e: drop MMU_FTR_TYPE_FSL_E checks in 64-bit code · aca69900
      Michael Ellerman authored
      All 64-bit Book3E have MMU_FTR_TYPE_FSL_E, since A2 was removed, so remove
      checks for it in 64-bit only code.
      
      Link: https://lkml.kernel.org/r/2b0b0bc9752e6cece222e4e2050358da70bb631d.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aca69900
    • Michael Ellerman's avatar
      powerpc/64e: drop E500 ifdefs in 64-bit code · ceb9314f
      Michael Ellerman authored
      All 64-bit Book3E have E500=y, so drop the unneeded ifdefs.
      
      Link: https://lkml.kernel.org/r/7fb88809c88a1b774063eda602a9333079403f83.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ceb9314f
    • Michael Ellerman's avatar
      powerpc/64e: split out nohash Book3E 64-bit code · a898530e
      Michael Ellerman authored
      A reasonable chunk of nohash/tlb.c is 64-bit only code, split it out into
      a separate file.
      
      Link: https://lkml.kernel.org/r/cb2b118f9d8a86f82d01bfb9ad309d1d304480a1.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a898530e
    • Michael Ellerman's avatar
      powerpc/64e: remove unused IBM HTW code · 88715b6e
      Michael Ellerman authored
      Patch series "Reimplement huge pages without hugepd on powerpc (8xx, e500,
      book3s/64)", v7.
      
      Unlike most architectures, powerpc 8xx HW requires a two-level pagetable
      topology for all page sizes.  So a leaf PMD-contig approach is not
      feasible as such.
      
      Possible sizes on 8xx are 4k, 16k, 512k and 8M.
      
      First level (PGD/PMD) covers 4M per entry.  For 8M pages, two PMD entries
      must point to a single entry level-2 page table.  Until now that was done
      using hugepd.  This series changes it to use standard page tables where
      the entry is replicated 1024 times on each of the two pagetables refered
      by the two associated PMD entries for that 8M page.
      
      For e500 and book3s/64 there are less constraints because it is not tied
      to the HW assisted tablewalk like on 8xx, so it is easier to use leaf PMDs
      (and PUDs).
      
      On e500 the supported page sizes are 4M, 16M, 64M, 256M and 1G.  All at
      PMD level on e500/32 (mpc85xx) and mix of PMD and PUD for e500/64.  We
      encode page size with 4 available bits in PTE entries.  On e300/32 PGD
      entries size is increases to 64 bits in order to allow leaf-PMD entries
      because PTE are 64 bits on e500.
      
      On book3s/64 only the hash-4k mode is concerned.  It supports 16M pages as
      cont-PMD and 16G pages as cont-PUD.  In other modes (radix-4k, radix-6k
      and hash-64k) the sizes match with PMD and PUD sizes so that's just leaf
      entries.  The hash processing make things a bit more complex.  To ease
      things, __hash_page_huge() is modified to bail out when DIRTY or ACCESSED
      bits are missing, leaving it to mm core to fix it.
      
      
      This patch (of 23):
      
      The nohash HTW_IBM (Hardware Table Walk) code is unused since support for
      A2 was removed in commit fb5a5157 ("powerpc: Remove platforms/ wsp and
      associated pieces") (2014).
      
      The remaining supported CPUs use either no HTW (data_tlb_miss_bolted), or
      the e6500 HTW (data_tlb_miss_e6500).
      
      Link: https://lkml.kernel.org/r/cover.1719928057.git.christophe.leroy@csgroup.eu
      Link: https://lkml.kernel.org/r/820dd1385ecc931f07b0d7a0fa827b1613917ab6.1719928057.git.christophe.leroy@csgroup.euSigned-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      88715b6e
    • Sergey Senozhatsky's avatar
      zsmalloc: rename class stat mutators · 791abe1e
      Sergey Senozhatsky authored
      A cosmetic change.
      
      o Rename class_stat_inc() and class_stat_dec() to class_stat_add()
        and class_stat_sub() correspondingly. inc/dec are usually associated
        with +1/-1 modifications, while zsmlloc can modify stats by up
        to ->objs_per_zspage. Use add/sub (follow atomics naming).
      
      o Rename zs_stat_get() to class_stat_read()
        get() is usually associated with ref-counting and is paired with put().
        zs_stat_get() simply reads class stat so rename to reflect it.
        (This also follows atomics naming).
      
      Link: https://lkml.kernel.org/r/20240701031140.3756345-1-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      791abe1e
    • Lance Yang's avatar
      mm: add docs for per-order mTHP split counters · 9b89e018
      Lance Yang authored
      This commit introduces documentation for mTHP split counters in
      transhuge.rst.
      
      [ioworker0@gmail.com: improve the doc as suggested by Ryan]
        Link: https://lkml.kernel.org/r/20240704012905.42971-3-ioworker0@gmail.com
      [ioworker0@gmail.com: tweak Documentation/admin-guide/mm/transhuge.rst]
        Link: https://lkml.kernel.org/r/20240707013659.1151-1-ioworker0@gmail.com
      Link: https://lkml.kernel.org/r/20240628130750.73097-3-ioworker0@gmail.comSigned-off-by: default avatarMingzhe Yang <mingzhe.yang@ly.com>
      Signed-off-by: default avatarLance Yang <ioworker0@gmail.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Bang Li <libang.li@antgroup.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9b89e018
    • Lance Yang's avatar
      mm: add per-order mTHP split counters · f216c845
      Lance Yang authored
      Patch series "mm: introduce per-order mTHP split counters", v3.
      
      At present, the split counters in THP statistics no longer include
      PTE-mapped mTHP.  Therefore, we want to introduce per-order mTHP split
      counters to monitor the frequency of mTHP splits.  This will assist
      developers in better analyzing and optimizing system performance.
      
      /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats
              split
              split_failed
              split_deferred
      
      
      This patch (of 2):
      
      Currently, the split counters in THP statistics no longer include
      PTE-mapped mTHP.  Therefore, we propose introducing per-order mTHP split
      counters to monitor the frequency of mTHP splits.  This will help
      developers better analyze and optimize system performance.
      
      /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats
              split
              split_failed
              split_deferred
      
      [ioworker0@gmail.com: make things more readable, per Barry and Baolin]
        Link: https://lkml.kernel.org/r/20240704012905.42971-2-ioworker0@gmail.com
      [ioworker0@gmail.com: use == for `order' test, per David]
        Link: https://lkml.kernel.org/r/20240705113119.82210-1-ioworker0@gmail.com
      Link: https://lkml.kernel.org/r/20240704012905.42971-1-ioworker0@gmail.com
      Link: https://lkml.kernel.org/r/20240704012905.42971-2-ioworker0@gmail.com
      Link: https://lkml.kernel.org/r/20240628130750.73097-1-ioworker0@gmail.com
      Link: https://lkml.kernel.org/r/20240628130750.73097-2-ioworker0@gmail.comSigned-off-by: default avatarMingzhe Yang <mingzhe.yang@ly.com>
      Signed-off-by: default avatarLance Yang <ioworker0@gmail.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarBarry Song <baohua@kernel.org>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Bang Li <libang.li@antgroup.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f216c845
    • Chengming Zhou's avatar
      mm/zsmalloc: move record_obj() into obj_malloc() · d468f1b8
      Chengming Zhou authored
      We always record_obj() to make handle points to object after obj_malloc(),
      so simplify the code by moving record_obj() into obj_malloc().  There
      should be no functional change.
      
      Link: https://lkml.kernel.org/r/20240627075959.611783-2-chengming.zhou@linux.devSigned-off-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d468f1b8
    • Chengming Zhou's avatar
      mm/zsmalloc: clarify class per-fullness zspage counts · 538148f9
      Chengming Zhou authored
      We always use insert_zspage() and remove_zspage() to update zspage's
      fullness location, which will account correctly.
      
      But this special async free path use "splice" instead of remove_zspage(),
      so the per-fullness zspage count for ZS_INUSE_RATIO_0 won't decrease.
      
      Clean things up by decreasing when iterate over the zspage free list.
      
      This doesn't actually fix anything.  ZS_INUSE_RATIO_0 is just a
      "placeholder" which is never used anywhere.
      
      Link: https://lkml.kernel.org/r/20240627075959.611783-1-chengming.zhou@linux.devSigned-off-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      538148f9
    • Andrii Nakryiko's avatar
      selftests/proc: add PROCMAP_QUERY ioctl tests · 81510a0e
      Andrii Nakryiko authored
      Extend existing proc-pid-vm.c tests with PROCMAP_QUERY ioctl() API.  Test
      a few successful and negative cases, validating querying filtering and
      exact vs next VMA logic works as expected.
      
      Link: https://lkml.kernel.org/r/20240627170900.1672542-7-andrii@kernel.orgSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      81510a0e
    • Andrii Nakryiko's avatar
      tools: sync uapi/linux/fs.h header into tools subdir · 77179b6f
      Andrii Nakryiko authored
      We need this UAPI header in tools/include subdirectory for using it from
      BPF selftests.
      
      Link: https://lkml.kernel.org/r/20240627170900.1672542-6-andrii@kernel.orgSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      77179b6f
    • Andrii Nakryiko's avatar
      docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence · c10cb914
      Andrii Nakryiko authored
      Call out PROCMAP_QUERY ioctl() existence in the section describing
      /proc/PID/maps file in documentation.  We refer user to UAPI header for
      low-level details of this programmatic interface.
      
      Link: https://lkml.kernel.org/r/20240627170900.1672542-5-andrii@kernel.orgSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c10cb914
    • Andrii Nakryiko's avatar
      fs/procfs: add build ID fetching to PROCMAP_QUERY API · bfc69fd0
      Andrii Nakryiko authored
      The need to get ELF build ID reliably is an important aspect when dealing
      with profiling and stack trace symbolization, and /proc/<pid>/maps textual
      representation doesn't help with this.
      
      To get backing file's ELF build ID, application has to first resolve VMA,
      then use it's start/end address range to follow a special
      /proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this is
      necessary because backing file might have been removed from the disk or
      was already replaced with another binary in the same file path.
      
      Such approach, beyond just adding complexity of having to do a bunch of
      extra work, has extra security implications.  Because application opens
      underlying ELF file and needs read access to its entire contents (as far
      as kernel is concerned), kernel puts additional capable() checks on
      following /proc/<pid>/map_files/<start>-<end> symlink.  And that makes
      sense in general.
      
      But in the case of build ID, profiler/symbolizer doesn't need the contents
      of ELF file, per se.  It's only build ID that is of interest, and ELF
      build ID itself doesn't provide any sensitive information.
      
      So this patch adds a way to request backing file's ELF build ID along the
      rest of VMA information in the same API.  User has control over whether
      this piece of information is requested or not by either setting
      build_id_size field to zero or non-zero maximum buffer size they provided
      through build_id_addr field (which encodes user pointer as __u64 field). 
      This is a completely optional piece of information, and so has no
      performance implications for user cases that don't care about build ID,
      while improving performance and simplifying the setup for those
      application that do need it.
      
      Kernel already implements build ID fetching, which is used from BPF
      subsystem.  We are reusing this code here, but plan a follow up changes to
      make it work better under more relaxed assumption (compared to what
      existing code assumes) of being called from user process context, in which
      page faults are allowed.  BPF-specific implementation currently bails out
      if necessary part of ELF file is not paged in, all due to extra
      BPF-specific restrictions (like the need to fetch build ID in restrictive
      contexts such as NMI handler).
      
      [andrii@kernel.org: fix integer to pointer cast warning in do_procmap_query()]
        Link: https://lkml.kernel.org/r/20240701174805.1897344-1-andrii@kernel.org
      Link: https://lkml.kernel.org/r/20240627170900.1672542-4-andrii@kernel.orgSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bfc69fd0
    • Andrii Nakryiko's avatar
      fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps · ed5d583a
      Andrii Nakryiko authored
      /proc/<pid>/maps file is extremely useful in practice for various tasks
      involving figuring out process memory layout, what files are backing any
      given memory range, etc.  One important class of applications that
      absolutely rely on this are profilers/stack symbolizers (perf tool being
      one of them).  Patterns of use differ, but they generally would fall into
      two categories.
      
      In on-demand pattern, a profiler/symbolizer would normally capture stack
      trace containing absolute memory addresses of some functions, and would
      then use /proc/<pid>/maps file to find corresponding backing ELF files
      (normally, only executable VMAs are of interest), file offsets within
      them, and then continue from there to get yet more information (ELF
      symbols, DWARF information) to get human-readable symbolic information. 
      This pattern is used by Meta's fleet-wide profiler, as one example.
      
      In preprocessing pattern, application doesn't know the set of addresses of
      interest, so it has to fetch all relevant VMAs (again, probably only
      executable ones), store or cache them, then proceed with profiling and
      stack trace capture.  Once done, it would do symbolization based on stored
      VMA information.  This can happen at much later point in time.  This
      patterns is used by perf tool, as an example.
      
      In either case, there are both performance and correctness requirement
      involved.  This address to VMA information translation has to be done as
      efficiently as possible, but also not miss any VMA (especially in the case
      of loading/unloading shared libraries).  In practice, correctness can't be
      guaranteed (due to process dying before VMA data can be captured, or
      shared library being unloaded, etc), but any effort to maximize the chance
      of finding the VMA is appreciated.
      
      Unfortunately, for all the /proc/<pid>/maps file universality and
      usefulness, it doesn't fit the above use cases 100%.
      
      First, it's main purpose is to emit all VMAs sequentially, but in practice
      captured addresses would fall only into a smaller subset of all process'
      VMAs, mainly containing executable text.  Yet, library would need to parse
      most or all of the contents to find needed VMAs, as there is no way to
      skip VMAs that are of no use.  Efficient library can do the linear pass
      and it is still relatively efficient, but it's definitely an overhead that
      can be avoided, if there was a way to do more targeted querying of the
      relevant VMA information.
      
      Second, it's a text based interface, which makes its programmatic use from
      applications and libraries more cumbersome and inefficient due to the need
      to handle text parsing to get necessary pieces of information.  The
      overhead is actually payed both by kernel, formatting originally binary
      VMA data into text, and then by user space application, parsing it back
      into binary data for further use.
      
      For the on-demand pattern of usage, described above, another problem when
      writing generic stack trace symbolization library is an unfortunate
      performance-vs-correctness tradeoff that needs to be made.  Library has to
      make a decision to either cache parsed contents of /proc/<pid>/maps (after
      initial processing) to service future requests (if application requests to
      symbolize another set of addresses (for the same process), captured at
      some later time, which is typical for periodic/continuous profiling cases)
      to avoid higher costs of re-parsing this file.  Or it has to choose to
      cache the contents in memory to speed up future requests.  In the former
      case, more memory is used for the cache and there is a risk of getting
      stale data if application loads or unloads shared libraries, or otherwise
      changed its set of VMAs somehow, e.g., through additional mmap() calls. 
      In the latter case, it's the performance hit that comes from re-opening
      the file and re-parsing its contents all over again.
      
      This patch aims to solve this problem by providing a new API built on top
      of /proc/<pid>/maps.  It's meant to address both non-selectiveness and
      text nature of /proc/<pid>/maps, by giving user more control of what sort
      of VMA(s) needs to be queried, and being binary-based interface eliminates
      the overhead of text formatting (on kernel side) and parsing (on user
      space side).
      
      It's also designed to be extensible and forward/backward compatible by
      including required struct size field, which user has to provide.  We use
      established copy_struct_from_user() approach to handle extensibility.
      
      User has a choice to pick either getting VMA that covers provided address
      or -ENOENT if none is found (exact, least surprising, case).  Or, with an
      extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can get either
      VMA that covers the address (if there is one), or the closest next VMA
      (i.e., VMA with the smallest vm_start > addr).  The latter allows more
      efficient use, but, given it could be a surprising behavior, requires an
      explicit opt-in.
      
      There is another query flag that is useful for some use cases. 
      PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return
      file-backed VMAs.  Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
      makes it possible to efficiently iterate only file-backed VMAs of the
      process, which is what profilers/symbolizers are normally interested in.
      
      All the above querying flags can be combined with (also optional) set of
      desired VMA permissions flags.  This allows to, for example, iterate only
      an executable subset of VMAs, which is what preprocessing pattern, used by
      perf tool, would benefit from, as the assumption is that captured stack
      traces would have addresses of executable code.  This saves time by
      skipping non-executable VMAs altogether efficienty.
      
      All these querying flags (modifiers) are orthogonal and can be combined in
      a semantically meaningful and natural way.
      
      Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes sense
      given it's querying the same set of VMA data.  It's also benefitial
      because permission checks for /proc/<pid>/maps is performed at open time
      once, and the actual data read of text contents of /proc/<pid>/maps is
      done without further permission checks.  We piggyback on this pattern with
      ioctl()-based API as well, as that's a desired property.  Both for
      performance reasons, but also for security and flexibility reasons.
      
      Allowing application to open an FD for /proc/self/maps without any extra
      capabilities, and then passing it to some sort of profiling agent through
      Unix-domain socket, would allow such profiling agent to not require some
      of the capabilities that are otherwise expected when opening
      /proc/<pid>/maps file for *another* process.  This is a desirable property
      for some more restricted setups.
      
      This new ioctl-based implementation doesn't interfere with seq_file-based
      implementation of /proc/<pid>/maps textual interface, and so could be used
      together or independently without paying any price for that.
      
      Note also, that fetching VMA name (e.g., backing file path, or special
      hard-coded or user-provided names) is optional just like build ID.  If
      user sets vma_name_size to zero, kernel code won't attempt to retrieve it,
      saving resources.
      
      Earlier versions of this patch set were adding per-VMA locking, which is
      why we have a code structure that is ready for abstracting mmap_lock vs
      vm_lock differences (query_vma_setup(), query_vma_teardown(), and
      query_vma_find_by_addr()), but given anon_vma_name() is not yet compatible
      with per-VMA locking, initial implementation sticks to using only
      mmap_lock for now.  It will be easy to add back per-VMA locking once all
      the pieces are ready later on.  Which is why we keep existing code
      structure with setup/teardown/query helper functions.
      
      [andrii@kernel.org: improve PROCMAP_QUERY's compat mode handling]
        Link: https://lkml.kernel.org/r/20240701174805.1897344-2-andrii@kernel.org
      Link: https://lkml.kernel.org/r/20240627170900.1672542-3-andrii@kernel.orgSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed5d583a
    • Andrii Nakryiko's avatar
      fs/procfs: extract logic for getting VMA name constituents · acd4b2ec
      Andrii Nakryiko authored
      Patch series "ioctl()-based API to query VMAs from /proc/<pid>/maps", v6.
      
      Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
      applications to query VMA information more efficiently than reading *all*
      VMAs nonselectively through text-based interface of /proc/<pid>/maps file.
      
      Patch #2 goes into a lot of details and background on some common patterns
      of using /proc/<pid>/maps in the area of performance profiling and
      subsequent symbolization of captured stack traces.  As mentioned in that
      patch, patterns of VMA querying can differ depending on specific use case,
      but can generally be grouped into two main categories: the need to query a
      small subset of VMAs covering a given batch of addresses, or
      reading/storing/caching all (typically, executable) VMAs upfront for later
      processing.
      
      The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by
      the former pattern of usage.  Earlier revisions had a patch adding a tool
      that faithfully reproduces an efficient VMA matching pass of a symbolizer,
      collecting a subset of covering VMAs for a given set of addresses as
      efficiently as possible.  This tool served both as a testing ground, as
      well as a benchmarking tool.  It implements everything both for currently
      existing text-based /proc/<pid>/maps interface, as well as for newly-added
      PROCMAP_QUERY ioctl().  This revision dropped the tool from the patch set
      and, once the API lands upstream, this tool might be added separately on
      Github as an example.
      
      Based on discussion on earlier revisions of this patch set, it turned out
      that this ioctl() API is competitive with highly-optimized text-based
      pre-processing pattern that perf tool is using.  Based on perf discussion,
      this revision adds more flexibility in specifying a subset of VMAs that
      are of interest.  Now it's possible to specify desired permissions of VMAs
      (e.g., request only executable ones) and/or restrict to only a subset of
      VMAs that have file backing.  This further improves the efficiency when
      using this new API thanks to more selective (executable VMAs only)
      querying.
      
      In addition to a custom benchmarking tool, and experimental perf
      integration (available at [0]), Daniel Mueller has since also implemented
      an experimental integration into blazesym (see [1]), a library used for
      stack trace symbolization by our server fleet-wide profiler and another
      on-device profiler agent that runs on weaker ARM devices.  The latter
      ARM-based device profiler is especially sensitive to performance, and so
      we benchmarked and compared text-based /proc/<pid>/maps solution to the
      equivalent one using PROCMAP_QUERY ioctl().
      
      Results are very encouraging, giving us 5x improvement for end-to-end
      so-called "address normalization" pass, which is the part of the
      symbolization process that happens locally on ARM device, before being
      sent out for further heavier-weight processing on more powerful remote
      server.  Note that this is not an artificial microbenchmark.  It's a full
      end-to-end API call being measured with real-world data on real-world
      device.
      
        TEXT-BASED
        ==========
        Benchmarking main/normalize_process_no_build_ids_uncached_maps
        main/normalize_process_no_build_ids_uncached_maps
      	  time:   [49.777 µs 49.982 µs 50.250 µs]
      
        IOCTL-BASED
        ===========
        Benchmarking main/normalize_process_no_build_ids_uncached_maps
        main/normalize_process_no_build_ids_uncached_maps
      	  time:   [10.328 µs 10.391 µs 10.457 µs]
      	  change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02)
      	  Performance has improved.
      
      You can see above that we see the drop from 50µs down to 10µs for
      exactly the same amount of work, with the same data and target process.
      
      With the aforementioned custom tool, we see about ~40x improvement (it
      might vary a bit, depending on a specific captured set of addresses).  And
      even for perf-based benchmark it's on par or slightly ahead when using
      permission-based filtering (fetching only executable VMAs).
      
      Earlier revisions attempted to use per-VMA locking, if kernel was compiled
      with CONFIG_PER_VMA_LOCK=y, but it turned out that anon_vma_name() is not
      yet compatible with per-VMA locking and assumes mmap_lock to be taken,
      which makes the use of per-VMA locking for this API premature.  It was
      agreed ([2]) to continue for now with just mmap_lock, but the code
      structure is such that it should be easy to add per-VMA locking support
      once all the pieces are ready.
      
      One thing that did not change was basing this new API as an ioctl()
      command on /proc/<pid>/maps file.  An ioctl-based API on top of pidfd was
      considered, but has its own downsides.  Implementing ioctl() directly on
      pidfd will cause access permission checks on every single ioctl(), which
      leads to performance concerns and potential spam of capable() audit
      messages.  It also prevents a nice pattern, possible with
      /proc/<pid>/maps, in which application opens /proc/self/maps FD (requiring
      no additional capabilities) and passed this FD to profiling agent for
      querying.  To achieve similar pattern, a new file would have to be created
      from pidf just for VMA querying, which is considered to be inferior to
      just querying /proc/<pid>/maps FD as proposed in current approach.  These
      aspects were discussed in the hallway track at recent LSF/MM/BPF 2024 and
      sticking to procfs ioctl() was the final agreement we arrived at.
      
        [0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/
        [1] https://github.com/libbpf/blazesym/pull/675
        [2] https://lore.kernel.org/bpf/7rm3izyq2vjp5evdjc7c6z4crdd3oerpiknumdnmmemwyiwx7t@hleldw7iozi3/
      
      
      This patch (of 6):
      
      Extract generic logic to fetch relevant pieces of data to describe VMA
      name.  This could be just some string (either special constant or
      user-provided), or a string with some formatted wrapping text (e.g.,
      "[anon_shmem:<something>]"), or, commonly, file path.  seq_file-based
      logic has different methods to handle all three cases, but they are
      currently mixed in with extracting underlying sources of data.
      
      This patch splits this into data fetching and data formatting, so that
      data fetching can be reused later on.
      
      There should be no functional changes.
      
      Link: https://lkml.kernel.org/r/20240627170900.1672542-1-andrii@kernel.org
      Link: https://lkml.kernel.org/r/20240627170900.1672542-2-andrii@kernel.orgSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      acd4b2ec
    • Vivek Kasireddy's avatar
      selftests/udmabuf: add tests to verify data after page migration · 8d42e2a9
      Vivek Kasireddy authored
      Since the memfd pages associated with a udmabuf may be migrated as part of
      udmabuf create, we need to verify the data coherency after successful
      migration.  The new tests added in this patch try to do just that using 4k
      sized pages and also 2 MB sized huge pages for the memfd.
      
      Successful completion of the tests would mean that there is no disconnect
      between the memfd pages and the ones associated with a udmabuf.  And,
      these tests can also be augmented in the future to test newer udmabuf
      features (such as handling memfd hole punch).
      
      The idea for these tests comes from a patch by Mike Kravetz here:
      https://lists.freedesktop.org/archives/dri-devel/2023-June/410623.html
      
      v1->v2: (suggestions from Shuah)
      - Use ksft_* functions to print and capture results of tests
      - Use appropriate KSFT_* status codes for exit()
      - Add Mike Kravetz's suggested-by tag
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-10-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d42e2a9
    • Vivek Kasireddy's avatar
      udmabuf: pin the pages using memfd_pin_folios() API · c6a3194c
      Vivek Kasireddy authored
      Using memfd_pin_folios() will ensure that the pages are pinned
      correctly using FOLL_PIN. And, this also ensures that we don't
      accidentally break features such as memory hotunplug as it would
      not allow pinning pages in the movable zone.
      
      Using this new API also simplifies the code as we no longer have
      to deal with extracting individual pages from their mappings or
      handle shmem and hugetlb cases separately.
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-9-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6a3194c
    • Vivek Kasireddy's avatar
      udmabuf: convert udmabuf driver to use folios · 5e72b2b4
      Vivek Kasireddy authored
      This is mainly a preparatory patch to use memfd_pin_folios() API for
      pinning folios.  Using folios instead of pages makes sense as the udmabuf
      driver needs to handle both shmem and hugetlb cases.  And, using the
      memfd_pin_folios() API makes this easier as we no longer need to
      separately handle shmem vs hugetlb cases in the udmabuf driver.
      
      Note that, the function vmap_udmabuf() still needs a list of pages; so, we
      collect all the head pages into a local array in this case.
      
      Other changes in this patch include the addition of helpers for checking
      the memfd seals and exporting dmabuf.  Moving code from udmabuf_create()
      into these helpers improves readability given that udmabuf_create() is a
      bit long.
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-8-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e72b2b4
    • Vivek Kasireddy's avatar
      udmabuf: add back support for mapping hugetlb pages · 0c8b91ef
      Vivek Kasireddy authored
      A user or admin can configure a VMM (Qemu) Guest's memory to be backed by
      hugetlb pages for various reasons.  However, a Guest OS would still
      allocate (and pin) buffers that are backed by regular 4k sized pages.  In
      order to map these buffers and create dma-bufs for them on the Host, we
      first need to find the hugetlb pages where the buffer allocations are
      located and then determine the offsets of individual chunks (within those
      pages) and use this information to eventually populate a scatterlist.
      
      Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
      were passed to the Host kernel and Qemu was launched with these
      relevant options: qemu-system-x86_64 -m 4096m....
      -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
      -display gtk,gl=on
      -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
      -machine memory-backend=mem1
      
      Replacing -display gtk,gl=on with -display gtk,gl=off above would
      exercise the mmap handler.
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-7-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Acked-by: Mike Kravetz <mike.kravetz@oracle.com> (v2)
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0c8b91ef
    • Vivek Kasireddy's avatar
      udmabuf: use vmf_insert_pfn and VM_PFNMAP for handling mmap · 7d79cd78
      Vivek Kasireddy authored
      Add VM_PFNMAP to vm_flags in the mmap handler to ensure that the mappings
      would be managed without using struct page.
      
      And, in the vm_fault handler, use vmf_insert_pfn to share the page's pfn
      to userspace instead of directly sharing the page (via struct page *).
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-6-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d79cd78
    • Arnd Bergmann's avatar
      udmabuf: add CONFIG_MMU dependency · 725553d2
      Arnd Bergmann authored
      There is no !CONFIG_MMU version of vmf_insert_pfn():
      
      arm-linux-gnueabi-ld: drivers/dma-buf/udmabuf.o: in function `udmabuf_vm_fault':
      udmabuf.c:(.text+0xaa): undefined reference to `vmf_insert_pfn'
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-5-vivek.kasireddy@intel.comSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Gerd Hoffmann <kraxel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      725553d2
    • Vivek Kasireddy's avatar
      mm/gup: introduce memfd_pin_folios() for pinning memfd folios · 89c1905d
      Vivek Kasireddy authored
      For drivers that would like to longterm-pin the folios associated with a
      memfd, the memfd_pin_folios() API provides an option to not only pin the
      folios via FOLL_PIN but also to check and migrate them if they reside in
      movable zone or CMA block.  This API currently works with memfds but it
      should work with any files that belong to either shmemfs or hugetlbfs. 
      Files belonging to other filesystems are rejected for now.
      
      The folios need to be located first before pinning them via FOLL_PIN.  If
      they are found in the page cache, they can be immediately pinned. 
      Otherwise, they need to be allocated using the filesystem specific APIs
      and then pinned.
      
      [akpm@linux-foundation.org: improve the CONFIG_MMU=n situation, per SeongJae]
      [vivek.kasireddy@intel.com: return -EINVAL if the end offset is greater than the size of memfd]
        Link: https://lkml.kernel.org/r/IA0PR11MB71850525CBC7D541CAB45DF1F8DB2@IA0PR11MB7185.namprd11.prod.outlook.com
      Link: https://lkml.kernel.org/r/20240624063952.1572359-4-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2)
      Reviewed-by: David Hildenbrand <david@redhat.com> (v3)
      Reviewed-by: Christoph Hellwig <hch@lst.de> (v6)
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      89c1905d
    • Vivek Kasireddy's avatar
      mm/gup: introduce check_and_migrate_movable_folios() · 53ba78de
      Vivek Kasireddy authored
      This helper is the folio equivalent of check_and_migrate_movable_pages(). 
      Therefore, all the rules that apply to check_and_migrate_movable_pages()
      also apply to this one as well.  Currently, this helper is only used by
      memfd_pin_folios().
      
      This patch also includes changes to rename and convert the internal
      functions collect_longterm_unpinnable_pages() and
      migrate_longterm_unpinnable_pages() to work on folios.  As a result,
      check_and_migrate_movable_pages() is now a wrapper around
      check_and_migrate_movable_folios().
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-3-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53ba78de
    • Vivek Kasireddy's avatar
      mm/gup: introduce unpin_folio/unpin_folios helpers · 6cc04054
      Vivek Kasireddy authored
      Patch series "mm/gup: Introduce memfd_pin_folios() for pinning memfd
      folios", v16.
      
      Currently, some drivers (e.g, Udmabuf) that want to longterm-pin the
      pages/folios associated with a memfd, do so by simply taking a reference
      on them.  This is not desirable because the pages/folios may reside in
      Movable zone or CMA block.
      
      Therefore, having drivers use memfd_pin_folios() API ensures that the
      folios are appropriately pinned via FOLL_PIN for longterm DMA.
      
      This patchset also introduces a few helpers and converts the Udmabuf
      driver to use folios and memfd_pin_folios() API to longterm-pin the folios
      for DMA.  Two new Udmabuf selftests are also included to test the driver
      and the new API.
      
      
      This patch (of 9):
      
      These helpers are the folio versions of unpin_user_page/unpin_user_pages. 
      They are currently only useful for unpinning folios pinned by
      memfd_pin_folios() or other associated routines.  However, they could find
      new uses in the future, when more and more folio-only helpers are added to
      GUP.
      
      We should probably sanity check the folio as part of unpin similar to how
      it is done in unpin_user_page/unpin_user_pages but we cannot cleanly do
      that at the moment without also checking the subpage.  Therefore, sanity
      checking needs to be added to these routines once we have a way to
      determine if any given folio is anon-exclusive (via a per folio
      AnonExclusive flag).
      
      Link: https://lkml.kernel.org/r/20240624063952.1572359-1-vivek.kasireddy@intel.com
      Link: https://lkml.kernel.org/r/20240624063952.1572359-2-vivek.kasireddy@intel.comSigned-off-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cc04054
    • Chengming Zhou's avatar
      mm/zswap: use only one pool in zswap · 8edc9c4e
      Chengming Zhou authored
      Zswap uses 32 pools to workaround the locking scalability problem in zswap
      backends (mainly zsmalloc nowadays), which brings its own problems like
      memory waste and more memory fragmentation.
      
      Testing results show that we can have near performance with only one pool
      in zswap after changing zsmalloc to use per-size_class lock instead of
      pool spinlock.
      
      Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and
      zswap shrinker enabled with 10GB swapfile on ext4.
      
                                      real    user    sys
      6.10.0-rc3                      138.18  1241.38 1452.73
      6.10.0-rc3-onepool              149.45  1240.45 1844.69
      6.10.0-rc3-onepool-perclass     138.23  1242.37 1469.71
      
      And do the same testing using zbud, which shows a little worse performance
      as expected since we don't do any locking optimization for zbud.  I think
      it's acceptable since zsmalloc became a lot more popular than other
      backends, and we may want to support only zsmalloc in the future.
      
                                      real    user    sys
      6.10.0-rc3-zbud			138.23  1239.58 1430.09
      6.10.0-rc3-onepool-zbud		139.64  1241.37 1516.59
      
      [chengming.zhou@linux.dev: fix error handling in zswap_pool_create(), per Dan Carpenter]
        Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-2-d30e9cd2b793@linux.dev
      [chengming.zhou@linux.dev: fix error handling again in zswap_pool_create(), per Yosry]
        Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-2-ad941699cb61@linux.dev
      Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-2-5e5081ea11b3@linux.devSigned-off-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Chengming Zhou <zhouchengming@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8edc9c4e
    • Chengming Zhou's avatar
      mm/zsmalloc: change back to per-size_class lock · 64bd0197
      Chengming Zhou authored
      Patch series "mm/zsmalloc: change back to per-size_class lock, v2".
      
      Commit c0547d0b ("zsmalloc: consolidate zs_pool's migrate_lock and
      size_class's locks") changed per-size_class lock to pool spinlock to
      prepare reclaim support in zsmalloc.  Then reclaim support in zsmalloc had
      been dropped in favor of LRU reclaim in zswap, but this locking change had
      been left there.
      
      Obviously, the scalability of pool spinlock is worse than per-size_class. 
      And we have a workaround that using 32 pools in zswap to avoid this
      scalability problem, which brings its own problems like memory waste and
      more memory fragmentation.
      
      So this series changes back to use per-size_class lock and using testing
      data in much stressed situation to verify that we can use only one pool in
      zswap.  Note we only test and care about the zsmalloc backend, which makes
      sense now since zsmalloc became a lot more popular than other backends.
      
      Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and
      zswap shrinker enabled with 10GB swapfile on ext4.
      
      				real	user    sys
      6.10.0-rc3			138.18	1241.38 1452.73
      6.10.0-rc3-onepool		149.45	1240.45 1844.69
      6.10.0-rc3-onepool-perclass	138.23	1242.37 1469.71
      
      We can see from "sys" column that per-size_class locking with only one
      pool in zswap can have near performance with the current 32 pools.
      
      
      This patch (of 2):
      
      This patch is almost the revert of the commit c0547d0b ("zsmalloc:
      consolidate zs_pool's migrate_lock and size_class's locks"), which changed
      to use a global pool->lock instead of per-size_class lock and
      pool->migrate_lock, was preparation for suppporting reclaim in zsmalloc. 
      Then reclaim in zsmalloc had been dropped in favor of LRU reclaim in
      zswap.
      
      In theory, per-size_class is more fine-grained than the pool->lock, since
      a pool can have many size_classes.  As for the additional
      pool->migrate_lock, only free() and map() need to grab it to access stable
      handle to get zspage, and only in read lock mode.
      
      Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-0-ad941699cb61@linux.dev
      Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-0-d30e9cd2b793@linux.dev
      Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-0-5e5081ea11b3@linux.dev
      Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-1-5e5081ea11b3@linux.devSigned-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      64bd0197
    • Andrew Morton's avatar
      mm/hugetlb.c: undo errant change · 998d4e2c
      Andrew Morton authored
      During conflict resolution a line was unintentionally removed by a ksm.c
      patch.
      
      Link: https://lkml.kernel.org/r/85b0d694-d1ac-8e7a-2e50-1edc03eee21a@google.com
      Fixes: ac90c56b ("mm/ksm: refactor out try_to_merge_with_zero_page()")
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      998d4e2c
  2. 10 Jul, 2024 7 commits