• Qu Wenruo's avatar
    btrfs: scrub: fix grouping of read IO · ae76d8e3
    Qu Wenruo authored
    [REGRESSION]
    There are several regression reports about the scrub performance with
    v6.4 kernel.
    
    On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
    v6.4 can only go 1GB/s, an obvious 66% performance drop.
    
    [CAUSE]
    Iostat shows a very different behavior between v6.3 and v6.4 kernel:
    
      Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
      nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02  21.18 100.00
      nvme0n1p3 15578.00  993616.00     5.00   0.03    0.09    63.78   1.32 100.00
    
    The upper one is v6.3 while the lower one is v6.4.
    
    There are several obvious differences:
    
    - Very few read merges
      This turns out to be a behavior change that we no longer do bio
      plug/unplug.
    
    - Very low aqu-sz
      This is due to the submit-and-wait behavior of flush_scrub_stripes(),
      and extra extent/csum tree search.
    
    Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
    to merge the reads, while SATA SSDs can not handle high queue depth well
    either.
    
    [FIX]
    For now this patch focuses on the read speed fix. Dev-replace replace
    speed needs more work.
    
    For the read part, we go two directions to fix the problems:
    
    - Re-introduce blk plug/unplug to merge read requests
      This is pretty simple, and the behavior is pretty easy to observe.
    
      This would enlarge the average read request size to 512K.
    
    - Introduce multi-group reads and no longer wait for each group
      Instead of the old behavior, which submits 8 stripes and waits for
      them, here we would enlarge the total number of stripes to 16 * 8.
      Which is 8M per device, the same limit as the old scrub in-flight
      bios size limit.
    
      Now every time we fill a group (8 stripes), we submit them and
      continue to next stripes.
    
      Only when the full 16 * 8 stripes are all filled, we submit the
      remaining ones (the last group), and wait for all groups to finish.
      Then submit the repair writes and dev-replace writes.
    
      This should enlarge the queue depth.
    
    This would greatly improve the merge rate (thus read block size) and
    queue depth:
    
    Before (with regression, and cached extent/csum path):
    
     Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
     nvme0n1p3 20666.00 1318240.00    10.00   0.05    0.08    63.79   1.63 100.00
    
    After (with all patches applied):
    
     nvme0n1p3  5165.00 2278304.00 30557.00  85.54    0.55   441.10   2.81 100.00
    
    i.e. 1287 to 2224 MB/s.
    
    CC: stable@vger.kernel.org # 6.4+
    Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
    Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    ae76d8e3
scrub.c 87.1 KB