Commits · 28be4fd310d146e9a43d7b1bb55cb7e9f5e06e88 · Kirill Smelkov / linux

06 Mar, 2024 2 commits

md: add a mddev_add_trace_msg helper · 28be4fd3

Christoph Hellwig authored Mar 03, 2024

Add a small wrapper around blk_add_trace_msg that hides some argument
dereferences and the check for a DM-mapped MD device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-3-hch@lst.de

28be4fd3

md: add a mddev_trace_remap helper · c396b90e

Christoph Hellwig authored Mar 03, 2024

Add a helper to trace bio remapping that hides some argument
dereferences and the check for a DM-mapped MD device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-2-hch@lst.de

c396b90e

05 Mar, 2024 11 commits

Merge branch 'dmraid-fix-6.9' into md-6.9 · 3a889fdc

Song Liu authored Mar 05, 2024

This is the second half of fixes for dmraid. The first half is available
at [1].

This set contains fixes:
 - reshape can start unexpected, cause data corruption, patch 1,5,6;
 - deadlocks that reshape concurrent with IO, patch 8;
 - a lockdep warning, patch 9;

For all the dmraid related tests in lvm2 suite, there is no new
regressions compared against 6.6 kernels (which is good baseline before
recent regressions).

[1] https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@mail.gmail.com/

* dmraid-fix-6.9:
  dm-raid: fix lockdep waring in "pers->hot_add_disk"
  dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape
  dm-raid: add a new helper prepare_suspend() in md_personality
  md/dm-raid: don't call md_reap_sync_thread() directly
  dm-raid: really frozen sync_thread during suspend
  md: add a new helper reshape_interrupted()
  md: export helper md_is_rdwr()
  md: export helpers to stop sync_thread
  md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume

3a889fdc

dm-raid: fix lockdep waring in "pers->hot_add_disk" · 95009ae9

Yu Kuai authored Mar 05, 2024

The lockdep assert is added by commit a448af25 ("md/raid10: remove
rcu protection to access rdev from conf") in print_conf(). And I didn't
notice that dm-raid is calling "pers->hot_add_disk" without holding
'reconfig_mutex'.

"pers->hot_add_disk" read and write many fields that is protected by
'reconfig_mutex', and raid_resume() already grab the lock in other
contex. Hence fix this problem by protecting "pers->host_add_disk"
with the lock.

Fixes: 9092c02d ("DM RAID: Add ability to restore transiently failed devices on resume")
Fixes: a448af25 ("md/raid10: remove rcu protection to access rdev from conf")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-10-yukuai1@huaweicloud.com

95009ae9

dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape · 41425f96

Yu Kuai authored Mar 05, 2024

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
   'reconfig_mutex', hence it can be cleared and reshape can continue by
   sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com

41425f96

dm-raid: add a new helper prepare_suspend() in md_personality · 5625ff8b

Yu Kuai authored Mar 05, 2024

There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-8-yukuai1@huaweicloud.com

5625ff8b

md/dm-raid: don't call md_reap_sync_thread() directly · cd32b27a

Yu Kuai authored Mar 05, 2024

Currently md_reap_sync_thread() is called from raid_message() directly
without holding 'reconfig_mutex', this is definitely unsafe because
md_reap_sync_thread() can change many fields that is protected by
'reconfig_mutex'.

However, hold 'reconfig_mutex' here is still problematic because this
will cause deadlock, for example, commit 130443d6 ("md: refactor
idle/frozen_sync_thread() to fix deadlock").

Fix this problem by using stop_sync_thread() to unregister sync_thread,
like md/raid did.

Fixes: be83651f ("DM RAID: Add message/status support for changing sync action")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-7-yukuai1@huaweicloud.com

cd32b27a

dm-raid: really frozen sync_thread during suspend · 16c4770c

Yu Kuai authored Mar 05, 2024

1) commit f52f5c71 ("md: fix stopping sync thread") remove
   MD_RECOVERY_FROZEN from __md_stop_writes() and doesn't realize that
   dm-raid relies on __md_stop_writes() to frozen sync_thread
   indirectly. Fix this problem by adding MD_RECOVERY_FROZEN in
   md_stop_writes(), and since stop_sync_thread() is only used for
   dm-raid in this case, also move stop_sync_thread() to
   md_stop_writes().
2) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen,
   it only prevent new sync_thread to start, and it can't stop the
   running sync thread; In order to frozen sync_thread, after seting the
   flag, stop_sync_thread() should be used.
3) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use
   it as condition for md_stop_writes() in raid_postsuspend() doesn't
   look correct. Consider that reentrant stop_sync_thread() do nothing,
   always call md_stop_writes() in raid_postsuspend().
4) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime,
   and if MD_RECOVERY_FROZEN is cleared while the array is suspended,
   new sync_thread can start unexpected. Fix this by disallow
   raid_message() to change sync_thread status during suspend.

Note that after commit f52f5c71 ("md: fix stopping sync thread"), the
test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(),
and with previous fixes, the test won't hang there anymore, however, the
test will still fail and complain that ext4 is corrupted. And with this
patch, the test won't hang due to stop_sync_thread() or fail due to ext4
is corrupted anymore. However, there is still a deadlock related to
dm-raid456 that will be fixed in following patches.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/all/e5e8afe2-e9a8-49a2-5ab0-958d4065c55e@redhat.com/
Fixes: 1af2048a ("dm raid: fix deadlock caused by premature md_stop_writes()")
Fixes: 9dbd1aa3 ("dm raid: add reshaping support to the target")
Fixes: f52f5c71 ("md: fix stopping sync thread")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-6-yukuai1@huaweicloud.com

16c4770c

md: add a new helper reshape_interrupted() · 503f9d43

Yu Kuai authored Mar 05, 2024

The helper will be used for dm-raid456 later to detect the case that
reshape can't make progress.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-5-yukuai1@huaweicloud.com

503f9d43

md: export helper md_is_rdwr() · 314e9af0

Yu Kuai authored Mar 05, 2024

There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-4-yukuai1@huaweicloud.com

314e9af0

md: export helpers to stop sync_thread · 7a2347e2

Yu Kuai authored Mar 05, 2024

Add new helpers:

  void md_idle_sync_thread(struct mddev *mddev);
  void md_frozen_sync_thread(struct mddev *mddev);
  void md_unfrozen_sync_thread(struct mddev *mddev);

The helpers will be used in dm-raid in later patches to fix regressions
and prevent calling md_reap_sync_thread() directly.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-3-yukuai1@huaweicloud.com

7a2347e2

md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume · 2f03d0c2

Yu Kuai authored Mar 05, 2024

After commit 9dbd1aa3 ("dm raid: add reshaping support to the
target") raid_ctr() will set MD_RECOVERY_FROZEN before md_run() and
expect to keep array frozen until resume. However, md_run() will clear
the flag by setting mddev->recovery to 0.

Before commit 1baae052 ("md: Don't ignore suspended array in
md_check_recovery()"), dm-raid actually relied on suspending to prevent
starting new sync_thread.

Fix this problem by keeping 'MD_RECOVERY_FROZEN' for dm-raid in
md_run().

Fixes: 1baae052 ("md: Don't ignore suspended array in md_check_recovery()")
Fixes: 9dbd1aa3 ("dm raid: add reshaping support to the target")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-2-yukuai1@huaweicloud.com

2f03d0c2

Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"" · 3445139e

Song Liu authored Jan 25, 2024

This reverts commit bed9e27b.

The original set [1][2] was expected to undo a suboptimal fix in [2], and
replace it with a better fix [1]. However, as reported by Dan Moulding [2]
causes an issue with raid5 with journal device.

Revert [2] for now to close the issue. We will follow up on another issue
reported by Juxiao Bi, as [2] is expected to fix it. We believe this is a
good trade-off, because the latter issue happens less freqently.

In the meanwhile, we will NOT revert [1], as it contains the right logic.

[1] commit d6e035aa ("md: bypass block throttle for superblock update")
[2] commit bed9e27b ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"")
Reported-by: Dan Moulding <dan@danm.net>
Closes: https://lore.kernel.org/linux-raid/20240123005700.9302-1-dan@danm.net/
Fixes: bed9e27b ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"")
Cc: stable@vger.kernel.org # v5.19+
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240125082131.788600-1-song@kernel.org

3445139e

01 Mar, 2024 20 commits

nbd: use the atomic queue limits API in nbd_set_size · 26828324

Christoph Hellwig authored Feb 29, 2024

Use queue_limits_start_update / queue_limits_commit_update to update
all the limits in one go and with proper sanity checking.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229143846.1047223-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

26828324

nbd: freeze the queue for queue limits updates · 242a49e5

Christoph Hellwig authored Feb 29, 2024

nbd currently updates the logical and physical block sizes as well
as the discard_sectors on a live queue. Freeze the queue first to
make sure there are not commands in flight that can see torn or
inconsistent limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229143846.1047223-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

242a49e5

nbd: don't clear discard_sectors in nbd_config_put · 7ea201f2

Christoph Hellwig authored Feb 29, 2024

nbd_config_put currently clears discard_sectors when unusing a device.
This is pretty odd behavior and different from the sector size
configuration which is simply left in places and then reconfigured when
nbd_set_size is as part of configuring the device. Change nbd_set_size
to clear discard_sectors if discard is not supported so that all the
queue limits changes are handled in one place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229143846.1047223-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

7ea201f2

pktcdvd: don't set max_hw_sectors on the underlying device · eabf5dfc

Christoph Hellwig authored Feb 29, 2024

pktcdvd sets max_hw_sectors on the queue of the underlying device that
it doesn't own (and doesn't reset it ever) since the driver was merged.
This can create all kinds of problems as the underlying driver doesn't
even know about it changing the limit.

As the state purpose is to not create I/Os larger than a single frame,
and pktcdvd never builds bios larger than that, just set REQ_NOMERGE
on the bios it submits so that largers I/Os never get built.

Note: I don't have packet writing hardware, so this is compile tested
only.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229144408.1047967-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

eabf5dfc

dm: use queue_limits_set · 8e0ef412

Christoph Hellwig authored Feb 28, 2024

Use queue_limits_set which validates the limits and takes care of
updating the readahead settings instead of directly assigning them to
the queue.  For that make sure all limits are actually updated before
the assignment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

8e0ef412

block: add a queue_limits_stack_bdev helper · c1373f1c

Christoph Hellwig authored Feb 28, 2024

Add a small wrapper around blk_stack_limits that allows passing a bdev
for the bottom device and prints an error in case of misaligned
device. The name fits into the new queue limits API and the intent is
to eventually replace disk_stack_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240228225653.947152-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

c1373f1c

block: add a queue_limits_set helper · 631d4efb

Christoph Hellwig authored Feb 28, 2024

Add a small wrapper around queue_limits_commit_update for stacking
drivers that don't want to update existing limits, but set an
entirely new set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240228225653.947152-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

631d4efb

Merge tag 'md-6.9-20240301' of... · 86b1e613

Jens Axboe authored Mar 01, 2024

Merge tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block

Pull MD updates from Song:

"The major changes are:

 1. Refactor raid1 read_balance, by Yu Kuai and Paul Luse.
 2. Clean up and fix for md_ioctl, by Li Nan.
 3. Other small fixes, by Gui-Dong Han and Heming Zhao."

* tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (22 commits)
  md/raid1: factor out helpers to choose the best rdev from read_balance()
  md/raid1: factor out the code to manage sequential IO
  md/raid1: factor out choose_bb_rdev() from read_balance()
  md/raid1: factor out choose_slow_rdev() from read_balance()
  md/raid1: factor out read_first_rdev() from read_balance()
  md/raid1-10: factor out a new helper raid1_should_read_first()
  md/raid1-10: add a helper raid1_check_read_range()
  md/raid1: fix choose next idle in read_balance()
  md/raid1: record nonrot rdevs while adding/removing rdevs to conf
  md/raid1: factor out helpers to add rdev to conf
  md: add a new helper rdev_has_badblock()
  md/raid5: fix atomicity violation in raid5_cache_count
  md/md-bitmap: fix incorrect usage for sb_index
  md: check mddev->pers before calling md_set_readonly()
  md: clean up openers check in do_md_stop() and md_set_readonly()
  md: sync blockdev before stopping raid or setting readonly
  md: factor out a helper to sync mddev
  md: Don't clear MD_CLOSING when the raid is about to stop
  md: return directly before setting did_set_md_closing
  md: clean up invalid BUG_ON in md_ioctl
  ...

86b1e613

Merge branch 'raid1-read_balance' into md-6.9 · e81faa91

Song Liu authored Feb 29, 2024

From: Yu Kuai <yukuai3@huawei.com>
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>

The original idea is that Paul want to optimize raid1 read
performance([1]), however, we think that the original code for
read_balance() is quite complex, and we don't want to add more
complexity. Hence we decide to refactor read_balance() first, to make
code cleaner and easier for follow up.

Before this patchset, read_balance() has many local variables and many
branches, it want to consider all the scenarios in one iteration. The
idea of this patch is to divide them into 4 different steps:

1) If resync is in progress, find the first usable disk, patch 5;
Otherwise:
2) Loop through all disks and skipping slow disks and disks with bad
blocks, choose the best disk, patch 10. If no disk is found:
3) Look for disks with bad blocks and choose the one with most number of
sectors, patch 8. If no disk is found:
4) Choose first found slow disk with no bad blocks, or slow disk with
most number of sectors, patch 7.

Note that step 3) and step 4) are super code path, and performance
should not be considered.

And after this patchset, we'll continue to optimize read_balance for
step 2), specifically how to choose the best rdev to read.

[1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/

Yu Kuai (11):
  md: add a new helper rdev_has_badblock()
  md/raid1: factor out helpers to add rdev to conf
  md/raid1: record nonrot rdevs while adding/removing rdevs to conf
  md/raid1: fix choose next idle in read_balance()
  md/raid1-10: add a helper raid1_check_read_range()
  md/raid1-10: factor out a new helper raid1_should_read_first()
  md/raid1: factor out read_first_rdev() from read_balance()
  md/raid1: factor out choose_slow_rdev() from read_balance()
  md/raid1: factor out choose_bb_rdev() from read_balance()
  md/raid1: factor out the code to manage sequential IO
  md/raid1: factor out helpers to choose the best rdev from
    read_balance()

e81faa91

md/raid1: factor out helpers to choose the best rdev from read_balance() · 0091c5a2

Yu Kuai authored Feb 29, 2024

The way that best rdev is chosen:

1) If the read is sequential from one rdev:
 - if rdev is rotational, use this rdev;
 - if rdev is non-rotational, use this rdev until total read length
   exceed disk opt io size;

2) If the read is not sequential:
 - if there is idle disk, use it, otherwise:
 - if the array has non-rotational disk, choose the rdev with minimal
   inflight IO;
 - if all the underlaying disks are rotational disk, choose the rdev
   with closest IO;

There are no functional changes, just to make code cleaner and prepare
for following refactor.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-12-yukuai1@huaweicloud.com

0091c5a2

md/raid1: factor out the code to manage sequential IO · ba58f57f

Yu Kuai authored Feb 29, 2024

There is no functional change for now, make read_balance() cleaner and
prepare to fix problems and refactor the handler of sequential IO.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-11-yukuai1@huaweicloud.com

ba58f57f

md/raid1: factor out choose_bb_rdev() from read_balance() · 9f3ced79

Yu Kuai authored Feb 29, 2024

read_balance() is hard to understand because there are too many status
and branches, and it's overlong.

This patch factor out the case to read the rdev with bad blocks from
read_balance(), there are no functional changes.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-10-yukuai1@huaweicloud.com

9f3ced79

md/raid1: factor out choose_slow_rdev() from read_balance() · dfa8ecd1

Yu Kuai authored Feb 29, 2024

read_balance() is hard to understand because there are too many status
and branches, and it's overlong.

This patch factor out the case to read the slow rdev from
read_balance(), there are no functional changes.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-9-yukuai1@huaweicloud.com

dfa8ecd1

md/raid1: factor out read_first_rdev() from read_balance() · 31a73331

Yu Kuai authored Feb 29, 2024

read_balance() is hard to understand because there are too many status
and branches, and it's overlong.

This patch factor out the case to read the first rdev from
read_balance(), there are no functional changes.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-8-yukuai1@huaweicloud.com

31a73331

md/raid1-10: factor out a new helper raid1_should_read_first() · f1092076

Yu Kuai authored Feb 29, 2024

If resync is in progress, read_balance() should find the first usable
disk, otherwise, data could be inconsistent after resync is done. raid1
and raid10 implement the same checking, hence factor out the checking
to make code cleaner.

Noted that raid1 is using 'mddev->recovery_cp', which is updated after
all resync IO is done, while raid10 is using 'conf->next_resync', which
is inaccurate because raid10 update it before submitting resync IO.
Fortunately, raid10 read IO can't concurrent with resync IO, hence there
is no problem. And this patch also switch raid10 to use
'mddev->recovery_cp'.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-7-yukuai1@huaweicloud.com

f1092076

md/raid1-10: add a helper raid1_check_read_range() · f29841ff

Yu Kuai authored Feb 29, 2024

The checking and handler of bad blocks appear many timers during
read_balance() in raid1 and raid10. This helper will be used in later
patches to simplify read_balance() a lot.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-6-yukuai1@huaweicloud.com

f29841ff

md/raid1: fix choose next idle in read_balance() · 257ac239

Yu Kuai authored Feb 29, 2024

Commit 12cee5a8 ("md/raid1: prevent merging too large request") add
the case choose next idle in read_balance():

read_balance:
 for_each_rdev
  if(next_seq_sect == this_sector || dist == 0)
  -> sequential reads
   best_disk = disk;
   if (...)
    choose_next_idle = 1
    continue;

 for_each_rdev
 -> iterate next rdev
  if (pending == 0)
   best_disk = disk;
   -> choose the next idle disk
   break;

  if (choose_next_idle)
   -> keep using this rdev if there are no other idle disk
   contine

However, commit 2e52d449 ("md/raid1: add failfast handling for reads.")
remove the code:

-               /* If device is idle, use it */
-               if (pending == 0) {
-                       best_disk = disk;
-                       break;
-               }

Hence choose next idle will never work now, fix this problem by
following:

1) don't set best_disk in this case, read_balance() will choose the best
   disk after iterating all the disks;
2) add 'pending' so that other idle disk will be chosen;
3) add a new local variable 'sequential_disk' to record the disk, and if
   there is no other idle disk, 'sequential_disk' will be chosen;

Fixes: 2e52d449 ("md/raid1: add failfast handling for reads.")
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-5-yukuai1@huaweicloud.com

257ac239

md/raid1: record nonrot rdevs while adding/removing rdevs to conf · 2c27d09d

Yu Kuai authored Feb 29, 2024

For raid1, each read will iterate all the rdevs from conf and check if
any rdev is non-rotational, then choose rdev with minimal IO inflight
if so, or rdev with closest distance otherwise.

Disk nonrot info can be changed through sysfs entry:

/sys/block/[disk_name]/queue/rotational

However, consider that this should only be used for testing, and user
really shouldn't do this in real life. Record the number of non-rotational
disks in conf, to avoid checking each rdev in IO fast path and simplify
read_balance() a little bit.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-4-yukuai1@huaweicloud.com

2c27d09d

md/raid1: factor out helpers to add rdev to conf · 969d6589

Yu Kuai authored Feb 29, 2024

There are no functional changes, just make code cleaner and prepare to
record disk non-rotational information while adding and removing rdev to
conf
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-3-yukuai1@huaweicloud.com

969d6589

md: add a new helper rdev_has_badblock() · 3a0f007b

Yu Kuai authored Feb 29, 2024

The current api is_badblock() must pass in 'first_bad' and
'bad_sectors', however, many caller just want to know if there are
badblocks or not, and these caller must define two local variable that
will never be used.

Add a new helper rdev_has_badblock() that will only return if there are
badblocks or not, remove unnecessary local variables and replace
is_badblock() with the new helper in many places.

There are no functional changes, and the new helper will also be used
later to refactor read_balance().
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-2-yukuai1@huaweicloud.com

3a0f007b

29 Feb, 2024 2 commits

ublk: add UBLK_CMD_DEL_DEV_ASYNC · 13fe8e68

Ming Lei authored Feb 23, 2024

The current command UBLK_CMD_DEL_DEV won't return until the device is
released, this way looks more reliable, but makes userspace more
difficult to implement, especially about orders: unmap command
buffer(which holds one ublkc reference), ublkc close,
io_uring_file_unregister, ublkb close.

Add UBLK_CMD_DEL_DEV_ASYNC so that device deletion won't wait release,
then userspace needn't worry about the above order. Actually both loop
and nbd is deleted in this async way.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240223075539.89945-3-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

13fe8e68

ublk: improve getting & putting ublk device · 1221b9e9

Ming Lei authored Feb 23, 2024

Firstly convert get_device() and put_device() into ublk_get_device()
and ublk_put_device().

Secondly annotate ublk_get_device() & ublk_put_device() as noinline
for trace, especially it is often to trigger device deletion hang
when incorrect order is used on ublkc mmap, ublkc close,
io_uring_sqe_unregister_file, ublkb close.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240223075539.89945-2-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

1221b9e9

28 Feb, 2024 1 commit

blk-mq: don't change nr_hw_queues and nr_maps for kdump kernel · ec30b461

Ming Lei authored Feb 28, 2024

For most of ARCHs, 'nr_cpus=1' is passed for kdump kernel, so
nr_hw_queues for each mapping is supposed to be 1 already.

More importantly, this way may cause trouble for driver, because blk-mq and
driver see different queue mapping since driver should setup hardware
queue setting before calling into allocating blk-mq tagset.

So not overriding nr_hw_queues and nr_maps for kdump kernel.

Cc: Wen Xiong <wenxiong@us.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240228040857.306483-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ec30b461

27 Feb, 2024 4 commits

md/raid5: fix atomicity violation in raid5_cache_count · dfd2bf43

Gui-Dong Han authored Jan 12, 2024

In raid5_cache_count():
    if (conf->max_nr_stripes < conf->min_nr_stripes)
        return 0;
    return conf->max_nr_stripes - conf->min_nr_stripes;
The current check is ineffective, as the values could change immediately
after being checked.

In raid5_set_cache_size():
    ...
    conf->min_nr_stripes = size;
    ...
    while (size > conf->max_nr_stripes)
        conf->min_nr_stripes = conf->max_nr_stripes;
    ...

Due to intermediate value updates in raid5_set_cache_size(), concurrent
execution of raid5_cache_count() and raid5_set_cache_size() may lead to
inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes.
The current checks are ineffective as values could change immediately
after being checked, raising the risk of conf->min_nr_stripes exceeding
conf->max_nr_stripes and potentially causing an integer overflow.

This possible bug is found by an experimental static analysis tool
developed by our team. This tool analyzes the locking APIs to extract
function pairs that can be concurrently executed, and then analyzes the
instructions in the paired functions to identify possible concurrency bugs
including data races and atomicity violations. The above possible bug is
reported when our tool analyzes the source code of Linux 6.2.

To resolve this issue, it is suggested to introduce local variables
'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the
values remain stable throughout the check. Adding locks in
raid5_cache_count() fails to resolve atomicity violations, as
raid5_set_cache_size() may hold intermediate values of
conf->min_nr_stripes while unlocked. With this patch applied, our tool no
longer reports the bug, with the kernel configuration allyesconfig for
x86_64. Due to the lack of associated hardware, we cannot test the patch
in runtime testing, and just verify it according to the code logic.

Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org
Signed-off-by: Gui-Dong Han <2045gemini@gmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240112071017.16313-1-2045gemini@gmail.comSigned-off-by: Song Liu <song@kernel.org>

dfd2bf43

ubd: open the backing files in ubd_add · fb5d1d38

Christoph Hellwig authored Feb 22, 2024

Opening the backing device only when the block device is opened is
a bit weird as no one configures block devices to not use them.
Opend them at add time, close them at remove time and remove the
now superflous opened counter as remove can simply check for
disk_openers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

fb5d1d38

ubd: remove the queue pointer in struct ubd · f3c17dcc

Christoph Hellwig authored Feb 22, 2024

No need for it now, everything goes through the gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

f3c17dcc

ubd: move set_disk_ro to ubd_add · 5e4e1ff8

Christoph Hellwig authored Feb 22, 2024

No need to delay this until open time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

5e4e1ff8