• Petr Tesarik's avatar
    swiotlb: reduce area lock contention for non-primary IO TLB pools · 55c54386
    Petr Tesarik authored
    If multiple areas and multiple IO TLB pools exist, first iterate the
    current CPU specific area in all pools. Then move to the next area index.
    
    This is best illustrated by a diagram:
    
            area 0 |  area 1 | ... | area M |
    pool 0    A         B              C
    pool 1    D         E
    ...
    pool N    F         G              H
    
    Currently, each pool is searched before moving on to the next pool,
    i.e. the search order is A, B ... C, D, E ... F, G ... H. With this patch,
    each area is searched in all pools before moving on to the next area,
    i.e. the search order is A, D ... F, B, E ... G ... C ... H.
    
    Note that preemption is not disabled, and raw_smp_processor_id() may not
    return a stable result, but it is called only once to determine the initial
    area index. The search will iterate over all areas eventually, even if the
    current task is preempted.
    
    Next, some pools may have less (but not more) areas than default_nareas.
    Skip such pools if the distance from the initial area index is greater than
    pool->nareas. This logic ensures that for every pool the search starts in
    the initial CPU's own area and never tries any area twice.
    
    To verify performance impact, I booted the kernel with a minimum pool
    size ("swiotlb=512,4,force"), so multiple pools get allocated, and I ran
    these benchmarks:
    
    - small: single-threaded I/O of 4 KiB blocks,
    - big: single-threaded I/O of 64 KiB blocks,
    - 4way: 4-way parallel I/O of 4 KiB blocks.
    
    The "var" column in the tables below is the coefficient of variance over 5
    runs of the test, the "diff" column is the relative difference against base
    in read-write I/O bandwidth (MiB/s).
    
    Tested on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based
    block device on the host:
    
    	base	   patched
    	var	var	diff
    small	0.69%	0.62%	+25.4%
    big	2.14%	2.27%	+25.7%
    4way	2.65%	1.70%	+23.6%
    
    Tested on a Raspberry Pi against a class-10 A1 microSD card:
    
    	base	   patched
    	var	var	diff
    small	0.53%	1.96%	-0.3%
    big	0.02%	0.57%	+0.8%
    4way	6.17%	0.40%	+0.3%
    
    These results confirm that there is significant performance boost in the
    software IO TLB slot allocation itself. Where performance is dominated by
    actual hardware, there is no measurable change.
    Signed-off-by: default avatarPetr Tesarik <petr.tesarik1@huawei-partners.com>
    Reviewed-by: default avatarMirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    55c54386
swiotlb.c 47.4 KB