• Damien Le Moal's avatar
    block: Do not remove zone write plugs still in use · 7b295187
    Damien Le Moal authored
    Large write BIOs that span a zone boundary are split in
    blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone
    write plugging. Such split BIO will be chained with one fragment
    targeting one zone and the remainder of the BIO targeting the next
    zone. The two BIOs can be executed in parallel, without a predetermine
    order relative to eachother and their completion may be reversed: the
    remainder first completing and the first fragment then completing. In
    such case, bio_endio() will not immediately execute
    blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the
    split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for
    the parent BIO will be executed only once the first fragment completes.
    
    In the case of a device with small zones and very large BIOs, uch
    completion pattern can lead to disk_should_remove_zone_wplug() to return
    true for the zone of the parent BIO when the parent BIO request
    completes and blk_zone_write_plug_complete_request() is executed. This
    triggers the removal of the zone write plug from the hash table using
    disk_remove_zone_wplug(). With the zone write plug of the parent BIO
    missing, the call to disk_get_zone_wplug() in
    blk_zone_write_plug_bio_endio() returns NULL and triggers a warning.
    
    This patterns can be recreated fairly easily using a scsi_debug device
    with small zone and btrfs. E.g.
    
    modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \
    	zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4
    mkfs.btrfs -f -O zoned /dev/sda
    mount -t btrfs /dev/sda /mnt
    fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \
    	--bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \
    	--runtime=10
    umount /dev/sda
    
    Will result in the warning:
    
    [   29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0
    ...
    [   29.058682] Call Trace:
    [   29.059095]  <TASK>
    [   29.059473]  ? __warn+0x80/0x120
    [   29.059983]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
    [   29.060728]  ? report_bug+0x160/0x190
    [   29.061283]  ? handle_bug+0x36/0x70
    [   29.061830]  ? exc_invalid_op+0x17/0x60
    [   29.062399]  ? asm_exc_invalid_op+0x1a/0x20
    [   29.063025]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
    [   29.063760]  bio_endio+0xb7/0x150
    [   29.064280]  btrfs_clone_write_end_io+0x2b/0x60 [btrfs]
    [   29.065049]  blk_update_request+0x17c/0x500
    [   29.065666]  scsi_end_request+0x27/0x1a0 [scsi_mod]
    [   29.066356]  scsi_io_completion+0x5b/0x690 [scsi_mod]
    [   29.067077]  blk_complete_reqs+0x3a/0x50
    [   29.067692]  __do_softirq+0xcf/0x2b3
    [   29.068248]  ? sort_range+0x20/0x20
    [   29.068791]  run_ksoftirqd+0x1c/0x30
    [   29.069339]  smpboot_thread_fn+0xcc/0x1b0
    [   29.069936]  kthread+0xcf/0x100
    [   29.070438]  ? kthread_complete_and_exit+0x20/0x20
    [   29.071314]  ret_from_fork+0x31/0x50
    [   29.071873]  ? kthread_complete_and_exit+0x20/0x20
    [   29.072563]  ret_from_fork_asm+0x11/0x20
    [   29.073146]  </TASK>
    
    either when fio executes or when unmount is executed.
    
    Fix this by modifying disk_should_remove_zone_wplug() to check that the
    reference count to a zone write plug is not larger than 2, that is, that
    the only references left on the zone are the caller held reference
    (blk_zone_write_plug_complete_request()) and the initial extra reference
    for the zone write plug taken when it was initialized (and that is
    dropped when the zone write plug is removed from the hash table).
    
    To be consistent with this change, make sure to drop the request or BIO
    held reference to the zone write plug before calling
    disk_zone_wplug_unplug_bio(). All references are also dropped using
    disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone
    write plug is freed if it needs to be.
    
    Comments are also improved to clarify zone write plugs reference
    handling.
    Reported-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
    Fixes: dd291d77 ("block: Introduce zone write plugging")
    Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
    Link: https://lore.kernel.org/r/20240501110907.96950-8-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
    7b295187
blk-zoned.c 53.8 KB