1. 24 Nov, 2022 2 commits
    • Damien Le Moal's avatar
      block: mq-deadline: Do not break sequential write streams to zoned HDDs · 015d02f4
      Damien Le Moal authored
      mq-deadline ensures an in order dispatching of write requests to zoned
      block devices using a per zone lock (a bit). This implies that for any
      purely sequential write workload, the drive is exercised most of the
      time at a maximum queue depth of one.
      
      However, when such sequential write workload crosses a zone boundary
      (when sequentially writing multiple contiguous zones), zone write
      locking may prevent the last write to one zone to be issued (as the
      previous write is still being executed) but allow the first write to the
      following zone to be issued (as that zone is not yet being writen and
      not locked). This result in an out of order delivery of the sequential
      write commands to the device every time a zone boundary is crossed.
      
      While such behavior does not break the sequential write constraint of
      zoned block devices (and does not generate any write error), some zoned
      hard-disks react badly to seeing these out of order writes, resulting in
      lower write throughput.
      
      This problem can be addressed by always dispatching the first request
      of a stream of sequential write requests, regardless of the zones
      targeted by these sequential writes. To do so, the function
      deadline_skip_seq_writes() is introduced and used in
      deadline_next_request() to select the next write command to issue if the
      target device is an HDD (blk_queue_nonrot() being false).
      deadline_fifo_request() is modified using the new
      deadline_earlier_request() and deadline_is_seq_write() helpers to ignore
      requests in the fifo list that have a preceding request in lba order
      that is sequential.
      
      With this fix, a sequential write workload executed with the following
      fio command:
      
      fio  --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \
           --size=68719476736  --ioengine=libaio --iodepth=32 --rw=write \
           --bs=65536
      
      results in an increase from 225 MB/s to 250 MB/s of the write throughput
      of an SMR HDD (11% increase).
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      015d02f4
    • Damien Le Moal's avatar
      block: mq-deadline: Fix dd_finish_request() for zoned devices · 2820e5d0
      Damien Le Moal authored
      dd_finish_request() tests if the per prio fifo_list is not empty to
      determine if request dispatching must be restarted for handling blocked
      write requests to zoned devices with a call to
      blk_mq_sched_mark_restart_hctx(). While simple, this implementation has
      2 problems:
      
      1) Only the priority level of the completed request is considered.
         However, writes to a zone may be blocked due to other writes to the
         same zone using a different priority level. While this is unlikely to
         happen in practice, as writing a zone with different IO priorirites
         does not make sense, nothing in the code prevents this from
         happening.
      2) The use of list_empty() is dangerous as dd_finish_request() does not
         take dd->lock and may run concurrently with the insert and dispatch
         code.
      
      Fix these 2 problems by testing the write fifo list of all priority
      levels using the new helper dd_has_write_work(), and by testing each
      fifo list using list_empty_careful().
      
      Fixes: c807ab52 ("block/mq-deadline: Add I/O priority support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2820e5d0
  2. 23 Nov, 2022 9 commits
  3. 22 Nov, 2022 1 commit
  4. 21 Nov, 2022 3 commits
  5. 16 Nov, 2022 23 commits
  6. 14 Nov, 2022 2 commits
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · 5626196a
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.2/block
      
      Pull MD fixes from Song.
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/raid1: stop mdx_raid1 thread when raid1 array run failed
        md/raid5: use bdev_write_cache instead of open coding it
        md: fix a crash in mempool_free
        md/raid0, raid10: Don't set discard sectors for request queue
        md/bitmap: Fix bitmap chunk size overflow issues
        md: introduce md_ro_state
        md: factor out __md_set_array_info()
        lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE
        raid5-cache: use try_cmpxchg in r5l_wake_reclaim
        drivers/md/md-bitmap: check the return value of md_bitmap_get_counter()
      5626196a
    • Jiang Li's avatar
      md/raid1: stop mdx_raid1 thread when raid1 array run failed · b611ad14
      Jiang Li authored
      fail run raid1 array when we assemble array with the inactive disk only,
      but the mdx_raid1 thread were not stop, Even if the associated resources
      have been released. it will caused a NULL dereference when we do poweroff.
      
      This causes the following Oops:
          [  287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070
          [  287.594762] #PF: supervisor read access in kernel mode
          [  287.599912] #PF: error_code(0x0000) - not-present page
          [  287.605061] PGD 0 P4D 0
          [  287.607612] Oops: 0000 [#1] SMP NOPTI
          [  287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G     U            5.10.146 #0
          [  287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022
          [  287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod]
          [  287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ......
          [  287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202
          [  287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000
          [  287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800
          [  287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff
          [  287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800
          [  287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500
          [  287.692052] FS:  0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
          [  287.700149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [  287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0
          [  287.713033] Call Trace:
          [  287.715498]  raid1d+0x6c/0xbbb [raid1]
          [  287.719256]  ? __schedule+0x1ff/0x760
          [  287.722930]  ? schedule+0x3b/0xb0
          [  287.726260]  ? schedule_timeout+0x1ed/0x290
          [  287.730456]  ? __switch_to+0x11f/0x400
          [  287.734219]  md_thread+0xe9/0x140 [md_mod]
          [  287.738328]  ? md_thread+0xe9/0x140 [md_mod]
          [  287.742601]  ? wait_woken+0x80/0x80
          [  287.746097]  ? md_register_thread+0xe0/0xe0 [md_mod]
          [  287.751064]  kthread+0x11a/0x140
          [  287.754300]  ? kthread_park+0x90/0x90
          [  287.757974]  ret_from_fork+0x1f/0x30
      
      In fact, when raid1 array run fail, we need to do
      md_unregister_thread() before raid1_free().
      Signed-off-by: default avatarJiang Li <jiang.li@ugreen.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      b611ad14