• Arne Welzel's avatar
    dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() · 528b16bf
    Arne Welzel authored
    On systems with many cores using dm-crypt, heavy spinlock contention in
    percpu_counter_compare() can be observed when the page allocation limit
    for a given device is reached or close to be reached. This is due
    to percpu_counter_compare() taking a spinlock to compute an exact
    result on potentially many CPUs at the same time.
    
    Switch to non-exact comparison of allocated and allowed pages by using
    the value returned by percpu_counter_read_positive() to avoid taking
    the percpu_counter spinlock.
    
    This may over/under estimate the actual number of allocated pages by at
    most (batch-1) * num_online_cpus().
    
    Currently, batch is bounded by 32. The system on which this issue was
    first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
    change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
    allocations, this seems an acceptable error. Certainly preferred over
    running into the spinlock contention.
    
    This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
    and 192GB RAM as follows, but can be provoked on systems with less CPUs
    as well.
    
     * Disable swap
     * Tune vm settings to promote regular writeback
         $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
         $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
         $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
    
     * Create 8 dmcrypt devices based on files on a tmpfs
     * Create and mount an ext4 filesystem on each crypt devices
     * Run stress-ng --hdd 8 within one of above filesystems
    
    Total %system usage collected from sysstat goes to ~35%. Write throughput
    on the underlying loop device is ~2GB/s. perf profiling an individual
    kworker kcryptd thread shows the following profile, indicating spinlock
    contention in percpu_counter_compare():
    
        99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
          |
          --ret_from_fork
            kthread
            worker_thread
            |
             --99.92%--process_one_work
                |
                |--80.52%--kcryptd_crypt
                |    |
                |    |--62.58%--mempool_alloc
                |    |  |
                |    |   --62.24%--crypt_page_alloc
                |    |     |
                |    |      --61.51%--__percpu_counter_compare
                |    |        |
                |    |         --61.34%--__percpu_counter_sum
                |    |           |
                |    |           |--58.68%--_raw_spin_lock_irqsave
                |    |           |  |
                |    |           |   --58.30%--native_queued_spin_lock_slowpath
                |    |           |
                |    |            --0.69%--cpumask_next
                |    |                |
                |    |                 --0.51%--_find_next_bit
                |    |
                |    |--10.61%--crypt_convert
                |    |          |
                |    |          |--6.05%--xts_crypt
                ...
    
    After applying this patch and running the same test, %system usage is
    lowered to ~7% and write throughput on the loop device increases
    to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
    in the profile and not hitting the percpu_counter() spinlock anymore.
    
        |--8.15%--mempool_alloc
        |    |
        |    |--3.93%--crypt_page_alloc
        |    |    |
        |    |     --3.75%--__alloc_pages
        |    |         |
        |    |          --3.62%--get_page_from_freelist
        |    |              |
        |    |               --3.22%--rmqueue_bulk
        |    |                   |
        |    |                    --2.59%--_raw_spin_lock
        |    |                      |
        |    |                       --2.57%--native_queued_spin_lock_slowpath
        |    |
        |     --3.05%--_raw_spin_lock_irqsave
        |               |
        |                --2.49%--native_queued_spin_lock_slowpath
    Suggested-by: default avatarDJ Gregor <dj@corelight.com>
    Reviewed-by: default avatarMikulas Patocka <mpatocka@redhat.com>
    Signed-off-by: default avatarArne Welzel <arne.welzel@corelight.com>
    Fixes: 5059353d ("dm crypt: limit the number of allocated pages")
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    528b16bf
dm-crypt.c 94.4 KB