• Mateusz Guzik's avatar
    mm: batch unlink_file_vma calls in free_pgd_range · 3577dbb1
    Mateusz Guzik authored
    Execs of dynamically linked binaries at 20-ish cores are bottlenecked on
    the i_mmap_rwsem semaphore, while the biggest singular contributor is
    free_pgd_range inducing the lock acquire back-to-back for all consecutive
    mappings of a given file.
    
    Tracing the count of said acquires while building the kernel shows:
    [1, 2)     799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [2, 3)          0 |                                                    |
    [3, 4)       3009 |                                                    |
    [4, 5)       3009 |                                                    |
    [5, 6)     326442 |@@@@@@@@@@@@@@@@@@@@@                               |
    
    So in particular there were 326442 opportunities to coalesce 5 acquires
    into 1.
    
    Doing so increases execs per second by 4% (~50k to ~52k) when running
    the benchmark linked below.
    
    The lock remains the main bottleneck, I have not looked at other spots
    yet.
    
    Bench can be found here:
    http://apollo.backplane.com/DFlyMisc/doexec.c
    
    $ cc -O2 -o shared-doexec doexec.c
    $ ./shared-doexec $(nproc)
    
    Note this particular test makes sure binaries are separate, but the
    loader is shared.
    
    Stats collected on the patched kernel (+ "noinline") with:
    bpftrace -e 'kprobe:unlink_file_vma_batch_process
    { @ = lhist(((struct unlink_vma_file_batch *)arg0)->count, 0, 8, 1); }'
    
    Link: https://lkml.kernel.org/r/20240521234321.359501-1-mjguzik@gmail.comSigned-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
    Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    3577dbb1
memory.c 179 KB