1. 25 Jun, 2023 1 commit
  2. 23 Jun, 2023 10 commits
    • Jan Kara's avatar
      bcache: Fix bcache device claiming · 2c555598
      Jan Kara authored
      Commit 2736e8ee ("block: use the holder as indication for exclusive
      opens") introduced a change that blkdev_put() has to get exclusive
      holder of the bdev as an argument. However it overlooked that
      register_bdev() and register_cache() overwrite the bdev->bd_holder field
      in the block device to point to the real owning object which was not
      available at the time we called blkdev_get_by_path(). Messing with bdev
      internals like this is a layering violation and it also causes
      blkdev_put() to issue warning about mismatching holders.
      
      Fix bcache to reopen the block device with appropriate holder once it is
      available which also restores the behavior that multiple bcache caches
      cannot claim the same device which was broken by commit 29499ab0
      ("bcache: don't pass a stack address to blkdev_get_by_path").
      
      Fixes: 2736e8ee ("block: use the holder as indication for exclusive opens")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20230622164658.12861-2-jack@suse.czSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2c555598
    • Jan Kara's avatar
      bcache: Alloc holder object before async registration · abcc0cbd
      Jan Kara authored
      Allocate holder object (cache or cached_dev) before offloading the
      rest of the startup to async work. This will allow us to open the block
      block device with proper holder.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Link: https://lore.kernel.org/r/20230622164658.12861-1-jack@suse.czSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      abcc0cbd
    • Jens Axboe's avatar
      Merge tag 'md-next-20230623' of... · c36591f6
      Jens Axboe authored
      Merge tag 'md-next-20230623' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.5/block-late
      
      Pull MD fixes from Song.
      
      * tag 'md-next-20230623' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        raid10: avoid spin_lock from fastpath from raid10_unplug()
        md: fix 'delete_mutex' deadlock
        md: use mddev->external to select holder in export_rdev()
        md/raid1-10: fix casting from randomized structure in raid1_submit_write()
        md/raid10: fix the condition to call bio_end_io_acct()
      c36591f6
    • Yu Kuai's avatar
      raid10: avoid spin_lock from fastpath from raid10_unplug() · a8d5fdd4
      Yu Kuai authored
      Commit 0c0be98b ("md/raid10: prevent unnecessary calls to wake_up()
      in fast path") missed one place, for example, with:
      
      	fio -direct=1 -rw=write/randwrite -iodepth=1 ...
      
      Plug and unplug are called for each io, then wake_up() from raid10_unplug()
      will cause lock contention as well.
      
      Avoid this contention by using wake_up_barrier() instead of wake_up(),
      where spin_lock is not held if waitqueue is empty.
      
      Fio test script:
      
      [global]
      name=random reads and writes
      ioengine=libaio
      direct=1
      readwrite=randrw
      rwmixread=70
      iodepth=64
      buffered=0
      filename=/dev/md0
      size=1G
      runtime=30
      time_based
      randrepeat=0
      norandommap
      refill_buffers
      ramp_time=10
      bs=4k
      numjobs=400
      group_reporting=1
      [job1]
      
      Test result with ramdisk raid10(By Ali):
      
      	Before this patch	With this patch
      READ	IOPS=2033k		IOPS=3642k
      WRITE	IOPS=871k		IOPS=1561K
      
      By the way, in this scenario, blk_plug_cb() will be allocated and freed
      for each io, this seems need to be optimized as well.
      Reported-and-tested-by: default avatarAli Gholami Rudi <aligrudi@gmail.com>
      Closes: https://lore.kernel.org/all/20231606122233@laper.mirepesht/Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com
      a8d5fdd4
    • Yu Kuai's avatar
      md: fix 'delete_mutex' deadlock · 4934b640
      Yu Kuai authored
      Commit 3ce94ce5 ("md: fix duplicate filename for rdev") introduce a
      new lock 'delete_mutex', and trigger a new deadlock:
      
      t1: remove rdev			t2: sysfs writer
      
      rdev_attr_store			rdev_attr_store
       mddev_lock
       state_store
       md_kick_rdev_from_array
        lock delete_mutex
        list_add mddev->deleting
        unlock delete_mutex
       mddev_unlock
      				 mddev_lock
      				 ...
        lock delete_mutex
        kobject_del
        // wait for sysfs writers to be done
      				 mddev_unlock
      				 lock delete_mutex
      				 // wait for delete_mutex, deadlock
      
      'delete_mutex' is used to protect the list 'mddev->deleting', turns out
      that this list can be protected by 'reconfig_mutex' directly, and this
      lock can be removed.
      
      Fix this problem by removing the lock, and use 'reconfig_mutex' to
      protect the list. mddev_unlock() will move this list to a local list to
      be handled after 'reconfig_mutex' is dropped.
      
      Fixes: 3ce94ce5 ("md: fix duplicate filename for rdev")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230621142933.1395629-1-yukuai1@huaweicloud.com
      4934b640
    • Song Liu's avatar
      md: use mddev->external to select holder in export_rdev() · a1d76719
      Song Liu authored
      mdadm test "10ddf-create-fail-rebuild" triggers warnings like the following
      
      [  215.526357] ------------[ cut here ]------------
      [  215.527243] WARNING: CPU: 18 PID: 1264 at block/bdev.c:617 blkdev_put+0x269/0x350
      [  215.528334] Modules linked in:
      [  215.528806] CPU: 18 PID: 1264 Comm: mdmon Not tainted 6.4.0-rc2+ #768
      [  215.529863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      [  215.531464] RIP: 0010:blkdev_put+0x269/0x350
      [  215.532167] Code: ff ff 49 8d 7d 10 e8 56 bf b8 ff 4d 8b 65 10 49 8d bc
      24 58 05 00 00 e8 05 be b8 ff 41 83 ac 24 58 05 00 00 01 e9 44 ff ff ff
      <0f> 0b e9 52 fe ff ff 0f 0b e9 6b fe ff ff1
      [  215.534780] RSP: 0018:ffffc900040bfbf0 EFLAGS: 00010283
      [  215.535635] RAX: ffff888174001000 RBX: ffff88810b1c3b00 RCX: ffffffff819a4061
      [  215.536645] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88810b1c3ba0
      [  215.537657] RBP: ffff88810dbde800 R08: fffffbfff0fca983 R09: fffffbfff0fca983
      [  215.538674] R10: ffffc900040bfbf0 R11: fffffbfff0fca982 R12: ffff88810b1c3b38
      [  215.539687] R13: ffff88810b1c3b10 R14: ffff88810dbdecb8 R15: ffff88810b1c3b00
      [  215.540833] FS:  00007f2aabdff700(0000) GS:ffff888dfb400000(0000) knlGS:0000000000000000
      [  215.541961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  215.542775] CR2: 00007fa19a85d934 CR3: 000000010c076006 CR4: 0000000000370ee0
      [  215.543814] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  215.544840] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  215.545885] Call Trace:
      [  215.546257]  <TASK>
      [  215.546608]  export_rdev.isra.63+0x71/0xe0
      [  215.547338]  mddev_unlock+0x1b1/0x2d0
      [  215.547898]  array_state_store+0x28d/0x450
      [  215.548519]  md_attr_store+0xd7/0x150
      [  215.549059]  ? __pfx_sysfs_kf_write+0x10/0x10
      [  215.549702]  kernfs_fop_write_iter+0x1b9/0x260
      [  215.550351]  vfs_write+0x491/0x760
      [  215.550863]  ? __pfx_vfs_write+0x10/0x10
      [  215.551445]  ? __fget_files+0x156/0x230
      [  215.552053]  ksys_write+0xc0/0x160
      [  215.552570]  ? __pfx_ksys_write+0x10/0x10
      [  215.553141]  ? ktime_get_coarse_real_ts64+0xec/0x100
      [  215.553878]  do_syscall_64+0x3a/0x90
      [  215.554403]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
      [  215.555125] RIP: 0033:0x7f2aade11847
      [  215.555696] Code: c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec
      10 e8 1b fd ff ff 4c 89 e2 48 89 ee 89 df 41 89 c0 b8 01 00 00 00 0f 05
      <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 448
      [  215.558398] RSP: 002b:00007f2aabdfeba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
      [  215.559516] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007f2aade11847
      [  215.560515] RDX: 0000000000000005 RSI: 0000000000438b8b RDI: 0000000000000010
      [  215.561512] RBP: 0000000000438b8b R08: 0000000000000000 R09: 00007f2aaecf0060
      [  215.562511] R10: 000000000e3ba40b R11: 0000000000000293 R12: 0000000000000005
      [  215.563647] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000c70750
      [  215.564693]  </TASK>
      [  215.565029] irq event stamp: 15979
      [  215.565584] hardirqs last  enabled at (15991): [<ffffffff811a7432>] __up_console_sem+0x52/0x60
      [  215.566806] hardirqs last disabled at (16000): [<ffffffff811a7417>] __up_console_sem+0x37/0x60
      [  215.568022] softirqs last  enabled at (15716): [<ffffffff8277a2db>] __do_softirq+0x3eb/0x531
      [  215.569239] softirqs last disabled at (15711): [<ffffffff810d8f45>] irq_exit_rcu+0x115/0x160
      [  215.570434] ---[ end trace 0000000000000000 ]---
      
      This means export_rdev() calls blkdev_put with a different holder than the
      one used by blkdev_get_by_dev(). This is because mddev->major_version == -2
      is not a good check for external metadata. Fix this by using
      mddev->external instead.
      
      Also, do not clear mddev->external in md_clean(), as the flag might be used
      later in export_rdev().
      
      Fixes: 2736e8ee ("block: use the holder as indication for exclusive opens")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230617052405.305871-1-song@kernel.org
      a1d76719
    • Yu Kuai's avatar
      md/raid1-10: fix casting from randomized structure in raid1_submit_write() · b5a99602
      Yu Kuai authored
      Following build error triggered while build with clang version 17.0.0
      with W=1(this can't be reporduced with gcc 13.1.0):
      
      drivers/md/raid1-10.c:117:25: error: casting from randomized structure
      pointer type 'struct block_device *' to 'struct md_rdev *'
           117 |         struct md_rdev *rdev = (struct md_rdev *)bio->bi_bdev;
               |                                ^
      
      Fix this by casting 'bio->bi_bdev' to 'void *', as it used to be.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202306142042.fmjfmTF8-lkp@intel.com/
      Fixes: 8295efbe ("md/raid1-10: factor out a helper to submit normal write")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230616012136.3047071-1-yukuai1@huaweicloud.com
      b5a99602
    • Li Nan's avatar
      md/raid10: fix the condition to call bio_end_io_acct() · 125bfc7c
      Li Nan authored
      /sys/block/[device]/queue/iostats is used to control whether to count io
      stat. Write 0 to it will clear queue_flags QUEUE_FLAG_IO_STAT which means
      iostats is disabled. If we disable iostats and later endable it, the io
      issued during this period will be counted incorrectly, inflight will be
      decreased to -1.
      
        //T1 set iostats
        echo 0 > /sys/block/md0/queue/iostats
         clear QUEUE_FLAG_IO_STAT
      
      			//T2 issue io
      			if (QUEUE_FLAG_IO_STAT) -> false
      			 bio_start_io_acct
      			  inflight++
      
        echo 1 > /sys/block/md0/queue/iostats
         set QUEUE_FLAG_IO_STAT
      
      					//T3 io end
      					if (QUEUE_FLAG_IO_STAT) -> true
      					 bio_end_io_acct
      					  inflight--	-> -1
      
      Also, if iostats is enabled while issuing io but disabled while io end,
      inflight will never be decreased.
      
      Fix it by checking start_time when io end. If start_time is not 0, call
      bio_end_io_acct().
      
      Fixes: 528bc2cf ("md/raid10: enable io accounting")
      Signed-off-by: default avatarLi Nan <linan122@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230609094320.2397604-1-linan666@huaweicloud.com
      125bfc7c
    • Yu Kuai's avatar
      scsi/sg: don't grab scsi host module reference · fcaa174a
      Yu Kuai authored
      In order to prevent request_queue to be freed before cleaning up
      blktrace debugfs entries, commit db59133e ("scsi: sg: fix blktrace
      debugfs entries leakage") use scsi_device_get(), however,
      scsi_device_get() will also grab scsi module reference and scsi module
      can't be removed.
      
      It's reported that blktests can't unload scsi_debug after block/001:
      
      blktests (master) # ./check block
      block/001 (stress device hotplugging) [failed]
           +++ /root/blktests/results/nodev/block/001.out.bad 2023-06-19
            Running block/001
            Stressing sd
           +modprobe: FATAL: Module scsi_debug is in use.
      
      Fix this problem by grabbing request_queue reference directly, so that
      scsi host module can still be unloaded while request_queue will be
      pinged by sg device.
      Reported-by: default avatarChaitanya Kulkarni <chaitanyak@nvidia.com>
      Link: https://lore.kernel.org/all/1760da91-876d-fc9c-ab51-999a6f66ad50@nvidia.com/
      Fixes: db59133e ("scsi: sg: fix blktrace debugfs entries leakage")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230621160111.1433521-1-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fcaa174a
    • Jan Kara's avatar
      ext4: Fix warning in blkdev_put() · a42fb5a7
      Jan Kara authored
      ext4_blkdev_remove() passes a wrong holder pointer to blkdev_put() which
      triggers a warning there. Fix it.
      
      Fixes: 2736e8ee ("block: use the holder as indication for exclusive opens")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230622165107.13687-1-jack@suse.czSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a42fb5a7
  3. 22 Jun, 2023 2 commits
  4. 21 Jun, 2023 7 commits
  5. 20 Jun, 2023 11 commits
  6. 16 Jun, 2023 7 commits
  7. 15 Jun, 2023 2 commits
    • Mingzhe Zou's avatar
      bcache: fixup btree_cache_wait list damage · f0854489
      Mingzhe Zou authored
      We get a kernel crash about "list_add corruption. next->prev should be
      prev (ffff9c801bc01210), but was ffff9c77b688237c.
      (next=ffffae586d8afe68)."
      
      crash> struct list_head 0xffff9c801bc01210
      struct list_head {
        next = 0xffffae586d8afe68,
        prev = 0xffffae586d8afe68
      }
      crash> struct list_head 0xffff9c77b688237c
      struct list_head {
        next = 0x0,
        prev = 0x0
      }
      crash> struct list_head 0xffffae586d8afe68
      struct list_head struct: invalid kernel virtual address: ffffae586d8afe68  type: "gdb_readmem_callback"
      Cannot access memory at address 0xffffae586d8afe68
      
      [230469.019492] Call Trace:
      [230469.032041]  prepare_to_wait+0x8a/0xb0
      [230469.044363]  ? bch_btree_keys_free+0x6c/0xc0 [escache]
      [230469.056533]  mca_cannibalize_lock+0x72/0x90 [escache]
      [230469.068788]  mca_alloc+0x2ae/0x450 [escache]
      [230469.080790]  bch_btree_node_get+0x136/0x2d0 [escache]
      [230469.092681]  bch_btree_check_thread+0x1e1/0x260 [escache]
      [230469.104382]  ? finish_wait+0x80/0x80
      [230469.115884]  ? bch_btree_check_recurse+0x1a0/0x1a0 [escache]
      [230469.127259]  kthread+0x112/0x130
      [230469.138448]  ? kthread_flush_work_fn+0x10/0x10
      [230469.149477]  ret_from_fork+0x35/0x40
      
      bch_btree_check_thread() and bch_dirty_init_thread() may call
      mca_cannibalize() to cannibalize other cached btree nodes. Only one thread
      can do it at a time, so the op of other threads will be added to the
      btree_cache_wait list.
      
      We must call finish_wait() to remove op from btree_cache_wait before free
      it's memory address. Otherwise, the list will be damaged. Also should call
      bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up
      other waiters.
      
      Fixes: 8e710227 ("bcache: make bch_btree_check() to be multithreaded")
      Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMingzhe Zou <mingzhe.zou@easystack.cn>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f0854489
    • Zheng Wang's avatar
      bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent · 80fca8a1
      Zheng Wang authored
      In some specific situations, the return value of __bch_btree_node_alloc
      may be NULL. This may lead to a potential NULL pointer dereference in
      caller function like a calling chain :
      btree_split->bch_btree_node_alloc->__bch_btree_node_alloc.
      
      Fix it by initializing the return value in __bch_btree_node_alloc.
      
      Fixes: cafe5635 ("bcache: A block layer cache")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZheng Wang <zyytlz.wz@163.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      80fca8a1