1. 01 May, 2024 7 commits
    • Damien Le Moal's avatar
      block: Do not remove zone write plugs still in use · 7b295187
      Damien Le Moal authored
      Large write BIOs that span a zone boundary are split in
      blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone
      write plugging. Such split BIO will be chained with one fragment
      targeting one zone and the remainder of the BIO targeting the next
      zone. The two BIOs can be executed in parallel, without a predetermine
      order relative to eachother and their completion may be reversed: the
      remainder first completing and the first fragment then completing. In
      such case, bio_endio() will not immediately execute
      blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the
      split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for
      the parent BIO will be executed only once the first fragment completes.
      
      In the case of a device with small zones and very large BIOs, uch
      completion pattern can lead to disk_should_remove_zone_wplug() to return
      true for the zone of the parent BIO when the parent BIO request
      completes and blk_zone_write_plug_complete_request() is executed. This
      triggers the removal of the zone write plug from the hash table using
      disk_remove_zone_wplug(). With the zone write plug of the parent BIO
      missing, the call to disk_get_zone_wplug() in
      blk_zone_write_plug_bio_endio() returns NULL and triggers a warning.
      
      This patterns can be recreated fairly easily using a scsi_debug device
      with small zone and btrfs. E.g.
      
      modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \
      	zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4
      mkfs.btrfs -f -O zoned /dev/sda
      mount -t btrfs /dev/sda /mnt
      fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \
      	--bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \
      	--runtime=10
      umount /dev/sda
      
      Will result in the warning:
      
      [   29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0
      ...
      [   29.058682] Call Trace:
      [   29.059095]  <TASK>
      [   29.059473]  ? __warn+0x80/0x120
      [   29.059983]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
      [   29.060728]  ? report_bug+0x160/0x190
      [   29.061283]  ? handle_bug+0x36/0x70
      [   29.061830]  ? exc_invalid_op+0x17/0x60
      [   29.062399]  ? asm_exc_invalid_op+0x1a/0x20
      [   29.063025]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
      [   29.063760]  bio_endio+0xb7/0x150
      [   29.064280]  btrfs_clone_write_end_io+0x2b/0x60 [btrfs]
      [   29.065049]  blk_update_request+0x17c/0x500
      [   29.065666]  scsi_end_request+0x27/0x1a0 [scsi_mod]
      [   29.066356]  scsi_io_completion+0x5b/0x690 [scsi_mod]
      [   29.067077]  blk_complete_reqs+0x3a/0x50
      [   29.067692]  __do_softirq+0xcf/0x2b3
      [   29.068248]  ? sort_range+0x20/0x20
      [   29.068791]  run_ksoftirqd+0x1c/0x30
      [   29.069339]  smpboot_thread_fn+0xcc/0x1b0
      [   29.069936]  kthread+0xcf/0x100
      [   29.070438]  ? kthread_complete_and_exit+0x20/0x20
      [   29.071314]  ret_from_fork+0x31/0x50
      [   29.071873]  ? kthread_complete_and_exit+0x20/0x20
      [   29.072563]  ret_from_fork_asm+0x11/0x20
      [   29.073146]  </TASK>
      
      either when fio executes or when unmount is executed.
      
      Fix this by modifying disk_should_remove_zone_wplug() to check that the
      reference count to a zone write plug is not larger than 2, that is, that
      the only references left on the zone are the caller held reference
      (blk_zone_write_plug_complete_request()) and the initial extra reference
      for the zone write plug taken when it was initialized (and that is
      dropped when the zone write plug is removed from the hash table).
      
      To be consistent with this change, make sure to drop the request or BIO
      held reference to the zone write plug before calling
      disk_zone_wplug_unplug_bio(). All references are also dropped using
      disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone
      write plug is freed if it needs to be.
      
      Comments are also improved to clarify zone write plugs reference
      handling.
      Reported-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Fixes: dd291d77 ("block: Introduce zone write plugging")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Link: https://lore.kernel.org/r/20240501110907.96950-8-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b295187
    • Damien Le Moal's avatar
      block: Unhash a zone write plug only if needed · 79ae35a4
      Damien Le Moal authored
      Fix disk_remove_zone_wplug() to ensure that a zone write plug already
      removed from a disk hash table of zone write plugs is not removed
      again. Do this by checking the BLK_ZONE_WPLUG_UNHASHED flag of the plug
      and calling hlist_del_init_rcu() only if the flag is not set.
      
      Furthermore, since BIO completions can happen at any time, that is,
      decrementing of the zone write plug reference count can happen at any
      time, make sure to use disk_put_zone_wplug() instead of atomic_dec() to
      ensure that the zone write plug is freed when its last reference is
      dropped. In order to do this, disk_remove_zone_wplug() is moved after
      the definition of disk_put_zone_wplug(). disk_should_remove_zone_wplug()
      is moved as well to keep it together with disk_remove_zone_wplug().
      
      To be consistent with this change, add a check in disk_put_zone_wplug()
      to ensure that a zone write plug being freed was already removed from
      the disk hash table.
      
      Fixes: dd291d77 ("block: Introduce zone write plugging")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20240501110907.96950-7-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79ae35a4
    • Damien Le Moal's avatar
      block: Hold a reference on zone write plugs to schedule submission · 9e78c38a
      Damien Le Moal authored
      Since a zone write plug BIO work is a field of struct blk_zone_wplug, we
      must ensure that a zone write plug is never freed when its BIO
      submission work is queued or running. Do this by holding a reference on
      the zone write plug when the submission work is scheduled for execution
      with queue_work() and releasing the reference at the end of the
      execution of the work function blk_zone_wplug_bio_work().
      The helper function disk_zone_wplug_schedule_bio_work() is introduced to
      get a reference on a zone write plug and queue its work. This helper is
      used in disk_zone_wplug_unplug_bio() and disk_zone_wplug_handle_error().
      
      Fixes: dd291d77 ("block: Introduce zone write plugging")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20240501110907.96950-6-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9e78c38a
    • Damien Le Moal's avatar
      block: Fix reference counting for zone write plugs in error state · 19aad274
      Damien Le Moal authored
      When zone is reset or finished, disk_zone_wplug_set_wp_offset() is
      called to update the zone write plug write pointer offset and to clear
      the zone error state (BLK_ZONE_WPLUG_ERROR flag) if it is set.
      However, this processing is missing dropping the reference to the zone
      write plug that was taken in disk_zone_wplug_set_error() when the error
      flag was first set. Furthermore, the error state handling must release
      the zone write plug lock to first execute a report zones command. When
      the report zone races with a reset or finish operation that clears the
      error, we can end up decrementing the zone write plug reference count
      twice: once in disk_zone_wplug_set_wp_offset() for the reset/finish
      operation and one more time in disk_zone_wplugs_work() once
      disk_zone_wplug_handle_error() completes.
      
      Fix this by introducing disk_zone_wplug_clear_error() as the symmetric
      function of disk_zone_wplug_set_error(). disk_zone_wplug_clear_error()
      decrements the zone write plug reference count obtained in
      disk_zone_wplug_set_error() only if the error handling has not started
      yet, that is, only if disk_zone_wplugs_work() has not yet taken the zone
      write plug off the error list. This ensure that either
      disk_zone_wplug_clear_error() or disk_zone_wplugs_work() drop the zone
      write plug reference count.
      
      Fixes: dd291d77 ("block: Introduce zone write plugging")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20240501110907.96950-5-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      19aad274
    • Damien Le Moal's avatar
      block: Fix zone write plug initialization from blk_revalidate_zone_cb() · 74b7ae5f
      Damien Le Moal authored
      When revalidating the zones of a zoned block device,
      blk_revalidate_zone_cb() must allocate a zone write plug for any
      sequential write required zone that is not empty nor full. However, the
      current code tests the latter case by comparing the zone write pointer
      offset to the zone size instead of the zone capacity. Furthermore,
      disk_get_and_lock_zone_wplug() is called with a sector argument equal to
      the zone start instead of the current zone write pointer position.
      This commit fixes both issues by calling disk_get_and_lock_zone_wplug()
      for a zone that is not empty and with a write pointer offset lower than
      the zone capacity and use the zone capacity sector as the sector
      argument for disk_get_and_lock_zone_wplug().
      
      Fixes: dd291d77 ("block: Introduce zone write plugging")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20240501110907.96950-4-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      74b7ae5f
    • Damien Le Moal's avatar
      block: Exclude conventional zones when faking max open limit · 6b7593b5
      Damien Le Moal authored
      For a device that has no limits for the maximum number of open and
      active zones, we default to using the number of zones, limited to
      BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE (128), for the maximum number of open
      zones indicated to the user. However, for a device that has conventional
      zones and less zones than BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, we should
      not account conventional zones and set the limit to the number of
      sequential write required zones. Furthermore, for cases where the limit
      is equal to the number of sequential write required zones, we can
      advertize a limit of 0 to indicate "no limits".
      
      Fix this by moving the zone write plug mempool resizing from
      disk_revalidate_zone_resources() to disk_update_zone_resources() where
      we can safely compute the number of conventional zones and update the
      limits.
      
      Fixes: 843283e9 ("block: Fake max open zones limit when there is no limit")
      Reported-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20240501110907.96950-3-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6b7593b5
    • Damien Le Moal's avatar
      dm: Check that a zoned table leads to a valid mapped device · 44cccb30
      Damien Le Moal authored
      Using targets such as dm-linear, a mapped device can be created to
      contain only conventional zones. Such device should not be treated as
      zoned as it does not contain any mandatory sequential write required
      zone. Since such device can be randomly written, we can modify
      dm_set_zones_restrictions() to set the mapped device zoned queue limit
      to false to expose it as a regular block device. The function
      dm_check_zoned() does this after counting the number of conventional
      zones of the mapped device and comparing it to the total number of zones
      reported. The special dm_check_zoned_cb() report zones callback function
      is used to count conventional zones.
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Link: https://lore.kernel.org/r/20240501110907.96950-2-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      44cccb30
  2. 26 Apr, 2024 1 commit
  3. 25 Apr, 2024 3 commits
  4. 23 Apr, 2024 1 commit
    • Damien Le Moal's avatar
      block: use a per disk workqueue for zone write plugging · a8f59e5a
      Damien Le Moal authored
      A zone write plug BIO work function blk_zone_wplug_bio_work() calls
      submit_bio_noacct_nocheck() to execute the next unplugged BIO. This
      function may block. So executing zone plugs BIO works using the block
      layer global kblockd workqueue can potentially lead to preformance or
      latency issues as the number of concurrent work for a workqueue is
      limited to WQ_DFL_ACTIVE (256).
      1) For a system with a large number of zoned disks, issuing write
         requests to otherwise unused zones may be delayed wiating for a work
         thread to become available.
      2) Requeue operations which use kblockd but are independent of zone
         write plugging may alsoi end up being delayed.
      
      To avoid these potential performance issues, create a workqueue per
      zoned device to execute zone plugs BIO work. The workqueue max active
      parameter is set to the maximum number of zone write plugs allocated
      with the zone write plug mempool. This limit is equal to the maximum
      number of open zones of the disk and defaults to 128 for disks that do
      not have a limit on the number of open zones.
      
      Fixes: dd291d77 ("block: Introduce zone write plugging")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20240420075811.1276893-3-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a8f59e5a
  5. 19 Apr, 2024 1 commit
  6. 17 Apr, 2024 27 commits