• Ryan Roberts's avatar
    arm64/mm: make set_ptes() robust when OAs cross 48-bit boundary · 6e8f5887
    Ryan Roberts authored
    Patch series "mm/memory: optimize fork() with PTE-mapped THP", v3.
    
    Now that the rmap overhaul[1] is upstream that provides a clean interface
    for rmap batching, let's implement PTE batching during fork when
    processing PTE-mapped THPs.
    
    This series is partially based on Ryan's previous work[2] to implement
    cont-pte support on arm64, but its a complete rewrite based on [1] to
    optimize all architectures independent of any such PTE bits, and to use
    the new rmap batching functions that simplify the code and prepare for
    further rmap accounting changes.
    
    We collect consecutive PTEs that map consecutive pages of the same large
    folio, making sure that the other PTE bits are compatible, and (a) adjust
    the refcount only once per batch, (b) call rmap handling functions only
    once per batch and (c) perform batch PTE setting/updates.
    
    While this series should be beneficial for adding cont-pte support on
    ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
    for large folios with minimal added overhead and further changes[4] that
    build up on top of the total mapcount.
    
    Independent of all that, this series results in a speedup during fork with
    PTE-mapped THP, which is the default with THPs that are smaller than a PMD
    (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
    
    On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
    of the same size (stddev < 1%) results in the following runtimes for
    fork() (shorter is better):
    
    Folio Size | v6.8-rc1 |      New | Change
    ------------------------------------------
          4KiB | 0.014328 | 0.014035 |   - 2%
         16KiB | 0.014263 | 0.01196  |   -16%
         32KiB | 0.014334 | 0.01094  |   -24%
         64KiB | 0.014046 | 0.010444 |   -26%
        128KiB | 0.014011 | 0.010063 |   -28%
        256KiB | 0.013993 | 0.009938 |   -29%
        512KiB | 0.013983 | 0.00985  |   -30%
       1024KiB | 0.013986 | 0.00982  |   -30%
       2048KiB | 0.014305 | 0.010076 |   -30%
    
    Note that these numbers are even better than the ones from v1 (verified
    over multiple reboots), even though there were only minimal code changes. 
    Well, I removed a pte_mkclean() call for anon folios, maybe that also
    plays a role.
    
    But my experience is that fork() is extremely sensitive to code size,
    inlining, ...  so I suspect we'll see on other architectures rather a
    change of -20% instead of -30%, and it will be easy to "lose" some of that
    speedup in the future by subtle code changes.
    
    Next up is PTE batching when unmapping.  Only tested on x86-64. 
    Compile-tested on most other architectures.
    
    [1] https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com
    [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
    [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
    [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
    [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com
    
    
    This patch (of 15):
    
    Since the high bits [51:48] of an OA are not stored contiguously in the
    PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE
    to the pte to get the pte with the next pfn.  This works until the pfn
    crosses the 48-bit boundary, at which point we overflow into the upper
    attributes.
    
    Of course one could argue (and Matthew Wilcox has :) that we will never
    see a folio cross this boundary because we only allow naturally aligned
    power-of-2 allocation, so this would require a half-petabyte folio.  So
    its only a theoretical bug.  But its better that the code is robust
    regardless.
    
    I've implemented pte_next_pfn() as part of the fix, which is an opt-in
    core-mm interface.  So that is now available to the core-mm, which will be
    needed shortly to support forthcoming fork()-batching optimizations.
    
    Link: https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20240125173534.1659317-1-ryan.roberts@arm.com
    Link: https://lkml.kernel.org/r/20240129124649.189745-2-david@redhat.com
    Fixes: 4a169d61 ("arm64: implement the new page table range API")
    Closes: https://lore.kernel.org/linux-mm/fdaeb9a5-d890-499a-92c8-d171df43ad01@arm.com/Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
    Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
    Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
    Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
    Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Dinh Nguyen <dinguyen@kernel.org>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Russell King (Oracle) <linux@armlinux.org.uk>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    6e8f5887
pgtable.h 32 KB