-
Kirill Smelkov authored
Benchmark the time it takes for virtmem to handle pagefault with noop loadblk for loadblk both implemented in C and in Python. On my computer it is: name µs/op PagefaultC 269 ± 0% pagefault_py 291 ± 0% Quite a big time in other words. It turned out to be mostly spent in fallocate'ing pages on tmpfs from /dev/shm. Part of the above 269 µs/op is taken by freeing (reclaiming) pages back when benchmarking work size exceed /dev/shm size, and part to allocating. If I limit the work size (via npage in benchmem.c) to be less than whole /dev/shm it starts to be ~ 170 µs/op and with additional tracing it shows as something like this: .. on_pagefault_start 0.954 µs .. vma_on_pagefault_pre 0.954 µs .. ramh_alloc_page_pre 0.954 µs .. ramh_alloc_page 169.992 µs .. vma_on_pagefault 172.853 µs .. vma_on_pagefault_pre 172.853 µs .. vma_on_pagefault 174.046 µs .. on_pagefault_end 174.046 µs .. whole: 171.900 µs so almost all time is spent in ramh_alloc_page which is doing the fallocate: https://lab.nexedi.com/nexedi/wendelin.core/blob/f11386a4/bigfile/ram_shmfs.c#L125 Simple benchmark[1] confirmed it is indeed the case for fallocate(tmpfs) to be relatively slow[2] (and that for recent kernels it regressed somewhat compared to Linux 3.16). Profile flamegraph for that benchmark[3] shows internal loading of shmem_fallocate which for 1 hardware page is not that too slow (e.g. <1µs) but when a request comes for a region internally performs it page by page and so accumulates that ~ 170µs for 2M. I've tried to briefly rerun the benchmark with huge pages activated on /dev/shm via mount /dev/shm -o huge=always,remount as both regular user and as root but it was executing several times slower. Probably something to investigate more later. [1] https://lab.nexedi.com/kirr/misc/blob/4f84a06e/tmpfs/t_fallocate.c [2] https://lab.nexedi.com/kirr/misc/blob/4f84a06e/tmpfs/1.txt [3] https://lab.nexedi.com/kirr/misc/raw/4f84a06e/tmpfs/fallocate-2M-nohuge.svg
3cfc2728