1. 29 Nov, 2022 2 commits
    • Jens Axboe's avatar
      Merge tag 'nvme-6.2-2022-11-29' of git://git.infradead.org/nvme into for-6.2/block · 8613dec0
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "nvme updates for Linux 6.2
      
       - support some passthrough commands without CAP_SYS_ADMIN
         (Kanchan Joshi)
       - refactor PCIe probing and reset (Christoph Hellwig)
       - various fabrics authentication fixes and improvements (Sagi Grimberg)
       - avoid fallback to sequential scan due to transient issues
         (Uday Shankar)
       - implement support for the DEAC bit in Write Zeroes (Christoph Hellwig)
       - allow overriding the IEEE OUI and firmware revision in configfs for
         nvmet (Aleksandr Miloserdov)
       - force reconnect when number of queue changes in nvmet (Daniel Wagner)
       - minor fixes and improvements (Uros Bizjak, Joel Granados,
         Sagi Grimberg, Christoph Hellwig, Christophe JAILLET)"
      
      * tag 'nvme-6.2-2022-11-29' of git://git.infradead.org/nvme: (45 commits)
        nvmet: expose firmware revision to configfs
        nvmet: expose IEEE OUI to configfs
        nvme: rename the queue quiescing helpers
        nvmet: fix a memory leak in nvmet_auth_set_key
        nvme: return err on nvme_init_non_mdts_limits fail
        nvme: avoid fallback to sequential scan due to transient issues
        nvme-rdma: stop auth work after tearing down queues in error recovery
        nvme-tcp: stop auth work after tearing down queues in error recovery
        nvme-auth: have dhchap_auth_work wait for queues auth to complete
        nvme-auth: remove redundant auth_work flush
        nvme-auth: convert dhchap_auth_list to an array
        nvme-auth: check chap ctrl_key once constructed
        nvme-auth: no need to reset chap contexts on re-authentication
        nvme-auth: remove redundant deallocations
        nvme-auth: clear sensitive info right after authentication completes
        nvme-auth: guarantee dhchap buffers under memory pressure
        nvme-auth: don't keep long lived 4k dhchap buffer
        nvme-auth: remove redundant if statement
        nvme-auth: don't override ctrl keys before validation
        nvme-auth: don't ignore key generation failures when initializing ctrl keys
        ...
      8613dec0
    • Damien Le Moal's avatar
      block: mq-deadline: Rename deadline_is_seq_writes() · 3692fec8
      Damien Le Moal authored
      Rename deadline_is_seq_writes() to deadline_is_seq_write() (remove the
      "s" plural) to more correctly reflect the fact that this function tests
      a single request, not multiple requests.
      
      Fixes: 015d02f4 ("block: mq-deadline: Do not break sequential write streams to zoned HDDs")
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Link: https://lore.kernel.org/r/20221126025550.967914-2-damien.lemoal@opensource.wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3692fec8
  2. 25 Nov, 2022 1 commit
    • Ye Bin's avatar
      blk-mq: fix possible memleak when register 'hctx' failed · 4b7a21c5
      Ye Bin authored
      There's issue as follows when do fault injection test:
      unreferenced object 0xffff888132a9f400 (size 512):
        comm "insmod", pid 308021, jiffies 4324277909 (age 509.733s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 08 f4 a9 32 81 88 ff ff  ...........2....
          08 f4 a9 32 81 88 ff ff 00 00 00 00 00 00 00 00  ...2............
        backtrace:
          [<00000000e8952bb4>] kmalloc_node_trace+0x22/0xa0
          [<00000000f9980e0f>] blk_mq_alloc_and_init_hctx+0x3f1/0x7e0
          [<000000002e719efa>] blk_mq_realloc_hw_ctxs+0x1e6/0x230
          [<000000004f1fda40>] blk_mq_init_allocated_queue+0x27e/0x910
          [<00000000287123ec>] __blk_mq_alloc_disk+0x67/0xf0
          [<00000000a2a34657>] 0xffffffffa2ad310f
          [<00000000b173f718>] 0xffffffffa2af824a
          [<0000000095a1dabb>] do_one_initcall+0x87/0x2a0
          [<00000000f32fdf93>] do_init_module+0xdf/0x320
          [<00000000cbe8541e>] load_module+0x3006/0x3390
          [<0000000069ed1bdb>] __do_sys_finit_module+0x113/0x1b0
          [<00000000a1a29ae8>] do_syscall_64+0x35/0x80
          [<000000009cd878b0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fault injection context as follows:
       kobject_add
       blk_mq_register_hctx
       blk_mq_sysfs_register
       blk_register_queue
       device_add_disk
       null_add_dev.part.0 [null_blk]
      
      As 'blk_mq_register_hctx' may already add some objects when failed halfway,
      but there isn't do fallback, caller don't know which objects add failed.
      To solve above issue just do fallback when add objects failed halfway in
      'blk_mq_register_hctx'.
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20221117022940.873959-1-yebin@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4b7a21c5
  3. 24 Nov, 2022 5 commits
    • Ye Bin's avatar
      block: fix crash in 'blk_mq_elv_switch_none' · 90b0296e
      Ye Bin authored
      Syzbot found the following issue:
      general protection fault, probably for non-canonical address 0xdffffc000000001d: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
      CPU: 0 PID: 5234 Comm: syz-executor931 Not tainted 6.1.0-rc3-next-20221102-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__elevator_get block/elevator.h:94 [inline]
      RIP: 0010:blk_mq_elv_switch_none block/blk-mq.c:4593 [inline]
      RIP: 0010:__blk_mq_update_nr_hw_queues block/blk-mq.c:4658 [inline]
      RIP: 0010:blk_mq_update_nr_hw_queues+0x304/0xe40 block/blk-mq.c:4709
      RSP: 0018:ffffc90003cdfc08 EFLAGS: 00010206
      RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
      RDX: 000000000000001d RSI: 0000000000000002 RDI: 00000000000000e8
      RBP: ffff88801dbd0000 R08: ffff888027c89398 R09: ffffffff8de2e517
      R10: fffffbfff1bc5ca2 R11: 0000000000000000 R12: ffffc90003cdfc70
      R13: ffff88801dbd0008 R14: ffff88801dbd03f8 R15: ffff888027c89380
      FS:  0000555557259300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000005d84c8 CR3: 000000007a7cb000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       nbd_start_device+0x153/0xc30 drivers/block/nbd.c:1355
       nbd_start_device_ioctl drivers/block/nbd.c:1405 [inline]
       __nbd_ioctl drivers/block/nbd.c:1481 [inline]
       nbd_ioctl+0x5a1/0xbd0 drivers/block/nbd.c:1521
       blkdev_ioctl+0x36e/0x800 block/ioctl.c:614
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:870 [inline]
       __se_sys_ioctl fs/ioctl.c:856 [inline]
       __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      As after dd6f7f17 commit move '__elevator_get(qe->type)' before set
      'qe->type', so will lead to access wild pointer.
      To solve above issue get 'qe->type' after set 'qe->type'.
      
      Reported-by: syzbot+746a4eece09f86bc39d7@syzkaller.appspotmail.com
      Fixes:dd6f7f17("block: add proper helpers for elevator_type module refcount management")
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20221107033956.3276891-1-yebin@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      90b0296e
    • Wang ShaoBo's avatar
      drbd: destroy workqueue when drbd device was freed · 8692814b
      Wang ShaoBo authored
      A submitter workqueue is dynamically allocated by init_submitter()
      called by drbd_create_device(), we should destroy it when this
      device is not needed or destroyed.
      
      Fixes: 113fef9e ("drbd: prepare to queue write requests on a submit worker")
      Signed-off-by: default avatarWang ShaoBo <bobo.shaobowang@huawei.com>
      Link: https://lore.kernel.org/r/20221124015817.2729789-3-bobo.shaobowang@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8692814b
    • Wang ShaoBo's avatar
      drbd: remove call to memset before free device/resource/connection · 6e7b854e
      Wang ShaoBo authored
      This revert c2258ffc ("drbd: poison free'd device, resource and
      connection structs"), add memset is odd here for debugging, there are
      some methods to accurately show what happened, such as kdump.
      Signed-off-by: default avatarWang ShaoBo <bobo.shaobowang@huawei.com>
      Link: https://lore.kernel.org/r/20221124015817.2729789-2-bobo.shaobowang@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e7b854e
    • Damien Le Moal's avatar
      block: mq-deadline: Do not break sequential write streams to zoned HDDs · 015d02f4
      Damien Le Moal authored
      mq-deadline ensures an in order dispatching of write requests to zoned
      block devices using a per zone lock (a bit). This implies that for any
      purely sequential write workload, the drive is exercised most of the
      time at a maximum queue depth of one.
      
      However, when such sequential write workload crosses a zone boundary
      (when sequentially writing multiple contiguous zones), zone write
      locking may prevent the last write to one zone to be issued (as the
      previous write is still being executed) but allow the first write to the
      following zone to be issued (as that zone is not yet being writen and
      not locked). This result in an out of order delivery of the sequential
      write commands to the device every time a zone boundary is crossed.
      
      While such behavior does not break the sequential write constraint of
      zoned block devices (and does not generate any write error), some zoned
      hard-disks react badly to seeing these out of order writes, resulting in
      lower write throughput.
      
      This problem can be addressed by always dispatching the first request
      of a stream of sequential write requests, regardless of the zones
      targeted by these sequential writes. To do so, the function
      deadline_skip_seq_writes() is introduced and used in
      deadline_next_request() to select the next write command to issue if the
      target device is an HDD (blk_queue_nonrot() being false).
      deadline_fifo_request() is modified using the new
      deadline_earlier_request() and deadline_is_seq_write() helpers to ignore
      requests in the fifo list that have a preceding request in lba order
      that is sequential.
      
      With this fix, a sequential write workload executed with the following
      fio command:
      
      fio  --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \
           --size=68719476736  --ioengine=libaio --iodepth=32 --rw=write \
           --bs=65536
      
      results in an increase from 225 MB/s to 250 MB/s of the write throughput
      of an SMR HDD (11% increase).
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      015d02f4
    • Damien Le Moal's avatar
      block: mq-deadline: Fix dd_finish_request() for zoned devices · 2820e5d0
      Damien Le Moal authored
      dd_finish_request() tests if the per prio fifo_list is not empty to
      determine if request dispatching must be restarted for handling blocked
      write requests to zoned devices with a call to
      blk_mq_sched_mark_restart_hctx(). While simple, this implementation has
      2 problems:
      
      1) Only the priority level of the completed request is considered.
         However, writes to a zone may be blocked due to other writes to the
         same zone using a different priority level. While this is unlikely to
         happen in practice, as writing a zone with different IO priorirites
         does not make sense, nothing in the code prevents this from
         happening.
      2) The use of list_empty() is dangerous as dd_finish_request() does not
         take dd->lock and may run concurrently with the insert and dispatch
         code.
      
      Fix these 2 problems by testing the write fifo list of all priority
      levels using the new helper dd_has_write_work(), and by testing each
      fifo list using list_empty_careful().
      
      Fixes: c807ab52 ("block/mq-deadline: Add I/O priority support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2820e5d0
  4. 23 Nov, 2022 9 commits
  5. 22 Nov, 2022 1 commit
  6. 21 Nov, 2022 5 commits
  7. 18 Nov, 2022 1 commit
  8. 16 Nov, 2022 16 commits