1. 03 May, 2023 3 commits
    • Adrian Huang's avatar
      nvme-pci: clamp max_hw_sectors based on DMA optimized limitation · 3710e2b0
      Adrian Huang authored
      When running the fio test on a 448-core AMD server + a NVME disk,
      a soft lockup or a hard lockup call trace is shown:
      
      [soft lockup]
      watchdog: BUG: soft lockup - CPU#126 stuck for 23s! [swapper/126:0]
      RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x50
      ...
      Call Trace:
       <IRQ>
       fq_flush_timeout+0x7d/0xd0
       ? __pfx_fq_flush_timeout+0x10/0x10
       call_timer_fn+0x2e/0x150
       run_timer_softirq+0x48a/0x560
       ? __pfx_fq_flush_timeout+0x10/0x10
       ? clockevents_program_event+0xaf/0x130
       __do_softirq+0xf1/0x335
       irq_exit_rcu+0x9f/0xd0
       sysvec_apic_timer_interrupt+0xb4/0xd0
       </IRQ>
       <TASK>
       asm_sysvec_apic_timer_interrupt+0x1f/0x30
      ...
      
      Obvisouly, fq_flush_timeout spends over 20 seconds. Here is ftrace log:
      
                     |  fq_flush_timeout() {
                     |    fq_ring_free() {
                     |      put_pages_list() {
         0.170 us    |        free_unref_page_list();
         0.810 us    |      }
                     |      free_iova_fast() {
                     |        free_iova() {
       * 85622.66 us |          _raw_spin_lock_irqsave();
         2.860 us    |          remove_iova();
         0.600 us    |          _raw_spin_unlock_irqrestore();
         0.470 us    |          lock_info_report();
         2.420 us    |          free_iova_mem.part.0();
       * 85638.27 us |        }
       * 85638.84 us |      }
                     |      put_pages_list() {
         0.230 us    |        free_unref_page_list();
         0.470 us    |      }
         ...            ...
       $ 31017069 us |  }
      
      Most of cores are under lock contention for acquiring iova_rbtree_lock due
      to the iova flush queue mechanism.
      
      [hard lockup]
      NMI watchdog: Watchdog detected hard LOCKUP on cpu 351
      RIP: 0010:native_queued_spin_lock_slowpath+0x2d8/0x330
      
      Call Trace:
       <IRQ>
       _raw_spin_lock_irqsave+0x4f/0x60
       free_iova+0x27/0xd0
       free_iova_fast+0x4d/0x1d0
       fq_ring_free+0x9b/0x150
       iommu_dma_free_iova+0xb4/0x2e0
       __iommu_dma_unmap+0x10b/0x140
       iommu_dma_unmap_sg+0x90/0x110
       dma_unmap_sg_attrs+0x4a/0x50
       nvme_unmap_data+0x5d/0x120 [nvme]
       nvme_pci_complete_batch+0x77/0xc0 [nvme]
       nvme_irq+0x2ee/0x350 [nvme]
       ? __pfx_nvme_pci_complete_batch+0x10/0x10 [nvme]
       __handle_irq_event_percpu+0x53/0x1a0
       handle_irq_event_percpu+0x19/0x60
       handle_irq_event+0x3d/0x60
       handle_edge_irq+0xb3/0x210
       __common_interrupt+0x7f/0x150
       common_interrupt+0xc5/0xf0
       </IRQ>
       <TASK>
       asm_common_interrupt+0x2b/0x40
      ...
      
      ftrace shows fq_ring_free spends over 10 seconds [1]. Again, most of
      cores are under lock contention for acquiring iova_rbtree_lock due
      to the iova flush queue mechanism.
      
      [Root Cause]
      The root cause is that the max_hw_sectors_kb of nvme disk (mdts=10)
      is 4096kb, which streaming DMA mappings cannot benefit from the
      scalable IOVA mechanism introduced by the commit 9257b4a2
      ("iommu/iova: introduce per-cpu caching to iova allocation") if
      the length is greater than 128kb.
      
      To fix the lock contention issue, clamp max_hw_sectors based on
      DMA optimized limitation in order to leverage scalable IOVA mechanism.
      
      Note: The issue does not happen with another NVME disk (mdts = 5
      and max_hw_sectors_kb = 128)
      
      [1] https://gist.github.com/AdrianHuang/bf8ec7338204837631fbdaed25d19cc4Suggested-by: default avatarKeith Busch <kbusch@kernel.org>
      Reported-and-tested-by: default avatarJiwei Sun <sunjw10@lenovo.com>
      Signed-off-by: default avatarAdrian Huang <ahuang12@lenovo.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      3710e2b0
    • Hristo Venev's avatar
      nvme-pci: add quirk for missing secondary temperature thresholds · bd375fee
      Hristo Venev authored
      On Kingston KC3000 and Kingston FURY Renegade (both have the same PCI
      IDs) accessing temp3_{min,max} fails with an invalid field error (note
      that there is no problem setting the thresholds for temp1).
      
      This contradicts the NVM Express Base Specification 2.0b, page 292:
      
        The over temperature threshold and under temperature threshold
        features shall be implemented for all implemented temperature sensors
        (i.e., all Temperature Sensor fields that report a non-zero value).
      
      Define NVME_QUIRK_NO_SECONDARY_TEMP_THRESH that disables the thresholds
      for all but the composite temperature and set it for this device.
      Signed-off-by: default avatarHristo Venev <hristo@venev.name>
      Reviewed-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      bd375fee
    • Sagi Grimberg's avatar
      nvme-pci: add NVME_QUIRK_BOGUS_NID for HS-SSD-FUTURE 2048G · 1616d6c3
      Sagi Grimberg authored
      Add a quirk to fix HS-SSD-FUTURE 2048G SSD drives reporting duplicate
      nsids.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217384Reported-by: default avatarAndrey God <andreygod83@protonmail.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      1616d6c3
  2. 30 Apr, 2023 1 commit
  3. 28 Apr, 2023 9 commits
  4. 27 Apr, 2023 3 commits
  5. 25 Apr, 2023 2 commits
  6. 24 Apr, 2023 1 commit
  7. 20 Apr, 2023 2 commits
    • Zhong Jinghua's avatar
      nbd: fix incomplete validation of ioctl arg · 55793ea5
      Zhong Jinghua authored
      We tested and found an alarm caused by nbd_ioctl arg without verification.
      The UBSAN warning calltrace like below:
      
      UBSAN: Undefined behaviour in fs/buffer.c:1709:35
      signed integer overflow:
      -9223372036854775808 - 1 cannot be represented in type 'long long int'
      CPU: 3 PID: 2523 Comm: syz-executor.0 Not tainted 4.19.90 #1
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
       show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x170/0x1dc lib/dump_stack.c:118
       ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
       handle_overflow+0x188/0x1dc lib/ubsan.c:192
       __ubsan_handle_sub_overflow+0x34/0x44 lib/ubsan.c:206
       __block_write_full_page+0x94c/0xa20 fs/buffer.c:1709
       block_write_full_page+0x1f0/0x280 fs/buffer.c:2934
       blkdev_writepage+0x34/0x40 fs/block_dev.c:607
       __writepage+0x68/0xe8 mm/page-writeback.c:2305
       write_cache_pages+0x44c/0xc70 mm/page-writeback.c:2240
       generic_writepages+0xdc/0x148 mm/page-writeback.c:2329
       blkdev_writepages+0x2c/0x38 fs/block_dev.c:2114
       do_writepages+0xd4/0x250 mm/page-writeback.c:2344
      
      The reason for triggering this warning is __block_write_full_page()
      -> i_size_read(inode) - 1 overflow.
      inode->i_size is assigned in __nbd_ioctl() -> nbd_set_size() -> bytesize.
      We think it is necessary to limit the size of arg to prevent errors.
      
      Moreover, __nbd_ioctl() -> nbd_add_socket(), arg will be cast to int.
      Assuming the value of arg is 0x80000000000000001) (on a 64-bit machine),
      it will become 1 after the coercion, which will return unexpected results.
      
      Fix it by adding checks to prevent passing in too large numbers.
      Signed-off-by: default avatarZhong Jinghua <zhongjinghua@huawei.com>
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20230206145805.2645671-1-zhongjinghua@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      55793ea5
    • Ming Lei's avatar
      ublk: don't return 0 in case of any failure · 7c75661c
      Ming Lei authored
      Commit 2d786e66 ("block: ublk: switch to ioctl command encoding")
      starts to reset local variable of 'ret' as zero, then if any failure
      happens when handling the three IO commands, 0 can be returned to ublk
      server.
      
      Fix it by returning -EINVAL in case of command handling failure.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Fixes: 2d786e66 ("block: ublk: switch to ioctl command encoding")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20230420091104.1092972-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7c75661c
  8. 19 Apr, 2023 3 commits
    • Ondrej Kozina's avatar
      sed-opal: geometry feature reporting command · 9e05a259
      Ondrej Kozina authored
      Locking range start and locking range length
      attributes may be require to satisfy restrictions
      exposed by OPAL2 geometry feature reporting.
      
      Geometry reporting feature is described in TCG OPAL SSC,
      section 3.1.1.4 (ALIGN, LogicalBlockSize, AlignmentGranularity
      and LowestAlignedLBA).
      
      4.3.5.2.1.1 RangeStart Behavior:
      
      [ StartAlignment = (RangeStart modulo AlignmentGranularity) - LowestAlignedLBA ]
      
      When processing a Set method or CreateRow method on the Locking
      table for a non-Global Range row, if:
      
      a) the AlignmentRequired (ALIGN above) column in the LockingInfo
         table is TRUE;
      b) RangeStart is non-zero; and
      c) StartAlignment is non-zero, then the method SHALL fail and
         return an error status code INVALID_PARAMETER.
      
      4.3.5.2.1.2 RangeLength Behavior:
      
      If RangeStart is zero, then
      	[ LengthAlignment = (RangeLength modulo AlignmentGranularity) - LowestAlignedLBA ]
      
      If RangeStart is non-zero, then
      	[ LengthAlignment = (RangeLength modulo AlignmentGranularity) ]
      
      When processing a Set method or CreateRow method on the Locking
      table for a non-Global Range row, if:
      
      a) the AlignmentRequired (ALIGN above) column in the LockingInfo
         table is TRUE;
      b) RangeLength is non-zero; and
      c) LengthAlignment is non-zero, then the method SHALL fail and
         return an error status code INVALID_PARAMETER
      
      In userspace we stuck to logical block size reported by general
      block device (via sysfs or ioctl), but we can not read
      'AlignmentGranularity' or 'LowestAlignedLBA' anywhere else and
      we need to get those values from sed-opal interface otherwise
      we will not be able to report or avoid locking range setup
      INVALID_PARAMETER errors above.
      Signed-off-by: default avatarOndrej Kozina <okozina@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Tested-by: default avatarMilan Broz <gmazyland@gmail.com>
      Link: https://lore.kernel.org/r/20230411090931.9193-2-okozina@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9e05a259
    • Chaitanya Kulkarni's avatar
      null_blk: Always check queue mode setting from configfs · 63f8793e
      Chaitanya Kulkarni authored
      Make sure to check device queue mode in the null_validate_conf() and
      return error for NULL_Q_RQ as we don't allow legacy I/O path, without
      this patch we get OOPs when queue mode is set to 1 from configfs,
      following are repro steps :-
      
      modprobe null_blk nr_devices=0
      mkdir config/nullb/nullb0
      echo 1 > config/nullb/nullb0/memory_backed
      echo 4096 > config/nullb/nullb0/blocksize
      echo 20480 > config/nullb/nullb0/size
      echo 1 > config/nullb/nullb0/queue_mode
      echo 1 > config/nullb/nullb0/power
      
      Entering kdb (current=0xffff88810acdd080, pid 2372) on processor 42 Oops: (null)
      due to oops @ 0xffffffffc041c329
      CPU: 42 PID: 2372 Comm: sh Tainted: G           O     N 6.3.0-rc5lblk+ #5
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      RIP: 0010:null_add_dev.part.0+0xd9/0x720 [null_blk]
      Code: 01 00 00 85 d2 0f 85 a1 03 00 00 48 83 bb 08 01 00 00 00 0f 85 f7 03 00 00 80 bb 62 01 00 00 00 48 8b 75 20 0f 85 6d 02 00 00 <48> 89 6e 60 48 8b 75 20 bf 06 00 00 00 e8 f5 37 2c c1 48 8b 75 20
      RSP: 0018:ffffc900052cbde0 EFLAGS: 00010246
      RAX: 0000000000000001 RBX: ffff88811084d800 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888100042e00
      RBP: ffff8881053d8200 R08: ffffc900052cbd68 R09: ffff888105db2000
      R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000002
      R13: ffff888104765200 R14: ffff88810eec1748 R15: ffff88810eec1740
      FS:  00007fd445fd1740(0000) GS:ffff8897dfc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000060 CR3: 0000000166a00000 CR4: 0000000000350ee0
      DR0: ffffffff8437a488 DR1: ffffffff8437a489 DR2: ffffffff8437a48a
      DR3: ffffffff8437a48b DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       nullb_device_power_store+0xd1/0x120 [null_blk]
       configfs_write_iter+0xb4/0x120
       vfs_write+0x2ba/0x3c0
       ksys_write+0x5f/0xe0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      RIP: 0033:0x7fd4460c57a7
      Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      RSP: 002b:00007ffd3792a4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd4460c57a7
      RDX: 0000000000000002 RSI: 000055b43c02e4c0 RDI: 0000000000000001
      RBP: 000055b43c02e4c0 R08: 000000000000000a R09: 00007fd44615b4e0
      R10: 00007fd44615b3e0 R11: 0000000000000246 R12: 0000000000000002
      R13: 00007fd446198520 R14: 0000000000000002 R15: 00007fd446198700
       </TASK>
      Signed-off-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Reviewed-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarNitesh Shetty <nj.shetty@samsung.com>
      Link: https://lore.kernel.org/r/20230416220339.43845-1-kch@nvidia.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      63f8793e
    • Ming Lei's avatar
      block: ublk: switch to ioctl command encoding · 2d786e66
      Ming Lei authored
      All ublk commands(control, IO) should have taken ioctl command encoding
      from the beginning, because ioctl command encoding defines each code
      uniquely, so driver can figure out wrong command sent from userspace
      easily; 2) it might help security subsystem for audit uring cmd[1].
      
      Unfortunately we didn't do that way, and it could be one lesson for
      ublk driver.
      
      So switch to ioctl command encoding now, we still support commands encoded
      in old way, but they become legacy definition. Any new command should take
      ioctl encoding.
      
      See ublksrv code for switching to ioctl command encoding in [2].
      
      [1] https://lore.kernel.org/io-uring/CAHC9VhSVzujW9LOj5Km80AjU0EfAuukoLrxO6BEfnXeK_s6bAg@mail.gmail.com/
      [2] https://github.com/ming1/ubdsrv/commits/ioctl_cmd_encoding
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ken Kurematsu <k.kurematsu@nskint.co.jp>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230418131810.855959-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2d786e66
  9. 16 Apr, 2023 5 commits
    • Christoph Hellwig's avatar
      blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush · 26a42b61
      Christoph Hellwig authored
      Commit b12e5c6c accidentally changes blk_kick_flush to do a head
      insert into the requeue list, fix this up.
      
      Fixes: b12e5c6c ("blk-mq: pass a flags argument to blk_mq_add_to_requeue_list")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230416073553.966161-1-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      26a42b61
    • Colin Ian King's avatar
      block, bfq: Fix division by zero error on zero wsum · e53413f8
      Colin Ian King authored
      When the weighted sum is zero the calculation of limit causes
      a division by zero error. Fix this by continuing to the next level.
      
      This was discovered by running as root:
      
      stress-ng --ioprio 0
      
      Fixes divison by error oops:
      
      [  521.450556] divide error: 0000 [#1] SMP NOPTI
      [  521.450766] CPU: 2 PID: 2684464 Comm: stress-ng-iopri Not tainted 6.2.1-1280.native #1
      [  521.451117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
      [  521.451627] RIP: 0010:bfqq_request_over_limit+0x207/0x400
      [  521.451875] Code: 01 48 8d 0c c8 74 0b 48 8b 82 98 00 00 00 48 8d 0c c8 8b 85 34 ff ff ff 48 89 ca 41 0f af 41 50 48 d1 ea 48 98 48 01 d0 31 d2 <48> f7 f1 41 39 41 48 89 85 34 ff ff ff 0f 8c 7b 01 00 00 49 8b 44
      [  521.452699] RSP: 0018:ffffb1af84eb3948 EFLAGS: 00010046
      [  521.452938] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000
      [  521.453262] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1af84eb3978
      [  521.453584] RBP: ffffb1af84eb3a30 R08: 0000000000000001 R09: ffff8f88ab8a4ba0
      [  521.453905] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8f88ab8a4b18
      [  521.454224] R13: ffff8f8699093000 R14: 0000000000000001 R15: ffffb1af84eb3970
      [  521.454549] FS:  00005640b6b0b580(0000) GS:ffff8f88b3880000(0000) knlGS:0000000000000000
      [  521.454912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  521.455170] CR2: 00007ffcbcae4e38 CR3: 00000002e46de001 CR4: 0000000000770ee0
      [  521.455491] PKRU: 55555554
      [  521.455619] Call Trace:
      [  521.455736]  <TASK>
      [  521.455837]  ? bfq_request_merge+0x3a/0xc0
      [  521.456027]  ? elv_merge+0x115/0x140
      [  521.456191]  bfq_limit_depth+0xc8/0x240
      [  521.456366]  __blk_mq_alloc_requests+0x21a/0x2c0
      [  521.456577]  blk_mq_submit_bio+0x23c/0x6c0
      [  521.456766]  __submit_bio+0xb8/0x140
      [  521.457236]  submit_bio_noacct_nocheck+0x212/0x300
      [  521.457748]  submit_bio_noacct+0x1a6/0x580
      [  521.458220]  submit_bio+0x43/0x80
      [  521.458660]  ext4_io_submit+0x23/0x80
      [  521.459116]  ext4_do_writepages+0x40a/0xd00
      [  521.459596]  ext4_writepages+0x65/0x100
      [  521.460050]  do_writepages+0xb7/0x1c0
      [  521.460492]  __filemap_fdatawrite_range+0xa6/0x100
      [  521.460979]  file_write_and_wait_range+0xbf/0x140
      [  521.461452]  ext4_sync_file+0x105/0x340
      [  521.461882]  __x64_sys_fsync+0x67/0x100
      [  521.462305]  ? syscall_exit_to_user_mode+0x2c/0x1c0
      [  521.462768]  do_syscall_64+0x3b/0xc0
      [  521.463165]  entry_SYSCALL_64_after_hwframe+0x5a/0xc4
      [  521.463621] RIP: 0033:0x5640b6c56590
      [  521.464006] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 71 70 0e 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Link: https://lore.kernel.org/r/20230413133009.1605335-1-colin.i.king@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e53413f8
    • Akinobu Mita's avatar
      fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m · d325c162
      Akinobu Mita authored
      This fixes a build error when CONFIG_FAULT_INJECTION_CONFIGFS=y and
      CONFIG_CONFIGFS_FS=m.
      
      Since the fault-injection library cannot built as a module, avoid building
      configfs as a module.
      
      Fixes: 4668c7a2 ("fault-inject: allow configuration via configfs")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202304150025.K0hczLR4-lkp@intel.com/Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d325c162
    • Jens Axboe's avatar
      block: store bdev->bd_disk->fops->submit_bio state in bdev · 9f4107b0
      Jens Axboe authored
      We have a long chain of memory dereferencing just to whether or not
      this disk has a special submit_bio helper. As that's not necessarily
      the common case, add a bd_has_submit_bio state in the bdev to avoid
      traversing this memory dependency chain if we don't need to.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9f4107b0
    • Jens Axboe's avatar
      block: re-arrange the struct block_device fields for better layout · 3838c406
      Jens Axboe authored
      This moves struct device out-of-line as it's just used at open/close
      time, so we can keep some of the commonly used fields closer together.
      On a standard setup, it also reduces the size from 864 bytes to 848
      bytes. Yes, struct device is a pig...
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3838c406
  10. 14 Apr, 2023 11 commits
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · 310e9c85
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.4/block
      
      Pull MD updates from Song:
      
      "- md/bitmap: Optimal last page size, by Jon Derrick
       - Various raid10 fixes, by Yu Kuai and Li Nan
       - md: add error_handlers for raid0 and linear, by Mariusz Tkaczyk"
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/raid5: remove unused working_disks variable
        md/raid10: don't call bio_start_io_acct twice for bio which experienced read error
        md/raid10: fix memleak of md thread
        md/raid10: fix memleak for 'conf->bio_split'
        md/raid10: fix leak of 'r10bio->remaining' for recovery
        md/raid10: don't BUG_ON() in raise_barrier()
        md: fix soft lockup in status_resync
        md: add error_handlers for raid0 and linear
        md: Use optimal I/O size for last bitmap page
        md: Fix types in sb writer
        md: Move sb writer loop to its own function
        md/raid10: Fix typo in comment (replacment -> replacement)
        md: make kobj_type structures constant
        md/raid10: fix null-ptr-deref in raid10_sync_request
        md/raid10: fix task hung in raid10d
      310e9c85
    • Jens Axboe's avatar
      Merge tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme into for-6.4/block · d2a1d45c
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "nvme updates for Linux 6.4
      
       - drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas)
       - validate nvmet module parameters (Chaitanya Kulkarni)
       - fence TCP socket on receive error (Chris Leech)
       - fix async event trace event (Keith Busch)
       - minor cleanups (Chaitanya Kulkarni, zhenwei pi)
       - fix and cleanup nvmet Identify handling (Damien Le Moal,
         Christoph Hellwig)
       - fix double blk_mq_complete_request race in the timeout handler
         (Lei Yin)
       - fix irq locking in nvme-fcloop (Ming Lei)
       - remove queue mapping helper for rdma devices (Sagi Grimberg)"
      
      * tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme:
        nvme-fcloop: fix "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage"
        blk-mq-rdma: remove queue mapping helper for rdma devices
        nvme-rdma: minor cleanup in nvme_rdma_create_cq()
        nvme: fix double blk_mq_complete_request for timeout request with low probability
        nvme: fix async event trace event
        nvme-apple: return directly instead of else
        nvme-apple: return directly instead of else
        nvmet-tcp: validate idle poll modparam value
        nvmet-tcp: validate so_priority modparam value
        nvme-tcp: fence TCP socket on receive error
        nvmet: remove nvmet_req_cns_error_complete
        nvmet: rename nvmet_execute_identify_cns_cs_ns
        nvmet: fix Identify Identification Descriptor List handling
        nvmet: cleanup nvmet_execute_identify()
        nvmet: fix I/O Command Set specific Identify Controller
        nvmet: fix Identify Active Namespace ID list handling
        nvmet: fix Identify Controller handling
        nvmet: fix Identify Namespace handling
        nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns()
        nvme-pci: drop redundant pci_enable_pcie_error_reporting()
      d2a1d45c
    • Tom Rix's avatar
      md/raid5: remove unused working_disks variable · 7bc43612
      Tom Rix authored
      clang with W=1 reports
      drivers/md/raid5.c:7719:6: error: variable 'working_disks'
        set but not used [-Werror,-Wunused-but-set-variable]
              int working_disks = 0;
                  ^
      This variable is not used so remove it.
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230327132324.1769595-1-trix@redhat.com
      7bc43612
    • Yu Kuai's avatar
      md/raid10: don't call bio_start_io_acct twice for bio which experienced read error · 7cddb055
      Yu Kuai authored
      handle_read_error() will resumit r10_bio by raid10_read_request(), which
      will call bio_start_io_acct() again, while bio_end_io_acct() will only
      be called once.
      
      Fix the problem by don't account io again from handle_read_error().
      
      Fixes: 528bc2cf ("md/raid10: enable io accounting")
      Suggested-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230314012258.2395894-1-yukuai1@huaweicloud.com
      7cddb055
    • Yu Kuai's avatar
      md/raid10: fix memleak of md thread · f0ddb83d
      Yu Kuai authored
      In raid10_run(), if setup_conf() succeed and raid10_run() failed before
      setting 'mddev->thread', then in the error path 'conf->thread' is not
      freed.
      
      Fix the problem by setting 'mddev->thread' right after setup_conf().
      
      Fixes: 43a52123 ("md-cluster: choose correct label when clustered layout is not supported")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230310073855.1337560-7-yukuai1@huaweicloud.com
      f0ddb83d
    • Yu Kuai's avatar
      md/raid10: fix memleak for 'conf->bio_split' · c9ac2acd
      Yu Kuai authored
      In the error path of raid10_run(), 'conf' need be freed, however,
      'conf->bio_split' is missed and memory will be leaked.
      
      Since there are 3 places to free 'conf', factor out a helper to fix the
      problem.
      
      Fixes: fc9977dd ("md/raid10: simplify the splitting of requests.")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230310073855.1337560-6-yukuai1@huaweicloud.com
      c9ac2acd
    • Yu Kuai's avatar
      md/raid10: fix leak of 'r10bio->remaining' for recovery · 26208a7c
      Yu Kuai authored
      raid10_sync_request() will add 'r10bio->remaining' for both rdev and
      replacement rdev. However, if the read io fails, recovery_request_write()
      returns without issuing the write io, in this case, end_sync_request()
      is only called once and 'remaining' is leaked, cause an io hang.
      
      Fix the problem by decreasing 'remaining' according to if 'bio' and
      'repl_bio' is valid.
      
      Fixes: 24afd80d ("md/raid10: handle recovery of replacement devices.")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230310073855.1337560-5-yukuai1@huaweicloud.com
      26208a7c
    • Yu Kuai's avatar
      md/raid10: don't BUG_ON() in raise_barrier() · 9fdfe6d4
      Yu Kuai authored
      If raise_barrier() is called the first time in raid10_sync_request(), which
      means the first non-normal io is handled, raise_barrier() should wait for
      all dispatched normal io to be done. This ensures that normal io won't
      starve.
      
      However, BUG_ON() if this is broken is too aggressive. This patch replace
      BUG_ON() with WARN and fall back to not force.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230310073855.1337560-4-yukuai1@huaweicloud.com
      9fdfe6d4
    • Yu Kuai's avatar
      md: fix soft lockup in status_resync · 6efddf1e
      Yu Kuai authored
      status_resync() will calculate 'curr_resync - recovery_active' to show
      user a progress bar like following:
      
      [============>........]  resync = 61.4%
      
      'curr_resync' and 'recovery_active' is updated in md_do_sync(), and
      status_resync() can read them concurrently, hence it's possible that
      'curr_resync - recovery_active' can overflow to a huge number. In this
      case status_resync() will be stuck in the loop to print a large amount
      of '=', which will end up soft lockup.
      
      Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case,
      this way resync in progress will be reported to user.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com
      6efddf1e
    • Mariusz Tkaczyk's avatar
      md: add error_handlers for raid0 and linear · c31fea2f
      Mariusz Tkaczyk authored
      After the commit 9631abdb("md: Set MD_BROKEN for RAID1 and RAID10")
      MD_BROKEN must be set if array is failed because state_store() checks it.
      If it is set then -EBUSY is returned to userspace.
      
      For raid0 and linear MD_BROKEN is not set by error_handler(). As a result
      mdadm is unable to trigger clean-up actions. It is a regression.
      
      This patch adds appropriate error_handler for raid0 and linear. The
      error handler sets MD_BROKEN for this device.
      Reviewed-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com
      c31fea2f
    • Jon Derrick's avatar
      md: Use optimal I/O size for last bitmap page · 8745faa9
      Jon Derrick authored
      If the bitmap space has enough room, size the I/O for the last bitmap
      page write to the optimal I/O size for the storage device. The expanded
      write is checked that it won't overrun the data or metadata.
      
      The drive this was tested against has higher latencies when there are
      sub-4k writes due to device-side read-mod-writes of its atomic 4k write
      unit. This change helps increase performance by sizing the last bitmap
      page I/O for the device's preferred write unit, if it is given.
      
      Example Intel/Solidigm P5520
      Raid10, Chunk-size 64M, bitmap-size 57228 bits
      
      $ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1
              --assume-clean --bitmap=internal --bitmap-chunk=64M
      $ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60
      
      Without patch:
        write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets
      
      With patch:
        write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets
      
      Biosnoop:
      Without patch:
      Time        Process        PID     Device      LBA        Size      Lat
      1.410377    md0_raid10     6900    nvme0n1   W 16         4096      0.02
      1.410387    md0_raid10     6900    nvme2n1   W 16         4096      0.02
      1.410374    md0_raid10     6900    nvme3n1   W 16         4096      0.01
      1.410381    md0_raid10     6900    nvme1n1   W 16         4096      0.02
      1.410411    md0_raid10     6900    nvme1n1   W 115346512  4096      0.01
      1.410418    md0_raid10     6900    nvme0n1   W 115346512  4096      0.02
      1.410915    md0_raid10     6900    nvme2n1   W 24         3584      0.43 <--
      1.410935    md0_raid10     6900    nvme3n1   W 24         3584      0.45 <--
      1.411124    md0_raid10     6900    nvme1n1   W 24         3584      0.64 <--
      1.411147    md0_raid10     6900    nvme0n1   W 24         3584      0.66 <--
      1.411176    md0_raid10     6900    nvme3n1   W 2019022184 4096      0.01
      1.411189    md0_raid10     6900    nvme2n1   W 2019022184 4096      0.02
      
      With patch:
      Time        Process        PID     Device      LBA        Size      Lat
      5.747193    md0_raid10     727     nvme0n1   W 16         4096      0.01
      5.747192    md0_raid10     727     nvme1n1   W 16         4096      0.02
      5.747195    md0_raid10     727     nvme3n1   W 16         4096      0.01
      5.747202    md0_raid10     727     nvme2n1   W 16         4096      0.02
      5.747229    md0_raid10     727     nvme3n1   W 1196223704 4096      0.02
      5.747224    md0_raid10     727     nvme0n1   W 1196223704 4096      0.01
      5.747279    md0_raid10     727     nvme0n1   W 24         4096      0.01 <--
      5.747279    md0_raid10     727     nvme1n1   W 24         4096      0.02 <--
      5.747284    md0_raid10     727     nvme3n1   W 24         4096      0.02 <--
      5.747291    md0_raid10     727     nvme2n1   W 24         4096      0.02 <--
      5.747314    md0_raid10     727     nvme2n1   W 2234636712 4096      0.01
      5.747317    md0_raid10     727     nvme1n1   W 2234636712 4096      0.02
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJon Derrick <jonathan.derrick@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230224183323.638-4-jonathan.derrick@linux.dev
      8745faa9