Commits · 87c971de8157b90494490d7c869a21b7f2123305 · Kirill Smelkov / linux

30 Jan, 2023 12 commits

block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqq · 87c971de

Kemeng Shi authored Jan 16, 2023

We jump to tag only for returning current rq. Return directly to
remove this tag.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

87c971de

block, bfq: remove redundant check in bfq_put_cooperator · 433d4b03

Kemeng Shi authored Jan 16, 2023

We have already avoided a circular list in bfq_setup_merge (see comments
in bfq_setup_merge() for details), so bfq_queue will not appear in it's
new_bfqq list. Just remove this check.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

433d4b03

block, bfq: remove unnecessary dereference to get async_bfqq · 86f8382e

Kemeng Shi authored Jan 16, 2023

The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

86f8382e

block, bfq: use helper macro RQ_BFQQ to get bfqq of request · 8ac2e43c

Kemeng Shi authored Jan 16, 2023

Use helper macro RQ_BFQQ to get bfqq of request.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8ac2e43c

block, bfq: initialize bfqq->decrease_time_jif correctly · 1c970450

Kemeng Shi authored Jan 16, 2023

Inject limit is updated or reset when time_is_before_eq_jiffies(
decrease_time_jif + several msecs) or think-time state changes.
decrease_time_jif is initialized to 0 and will be set to current jiffies
when inject limit is updated or reset. If the jiffies is slightly greater
than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as
time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the
think-time state never chages, then the injection will not work as expected
for long time.

To be more specific:
Function bfq_update_inject_limit maybe triggered when jiffies pasts
decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting
bfqd->wait_dispatch to true.
Function bfq_reset_inject_limit are called in two conditions:
1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in
function bfq_add_request.
2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or
bfq think-time state change from short to long.

Fix this by initializing bfqq->decrease_time_jif to current jiffies
to trigger service injection soon when service injection conditions
are met.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

1c970450

block, bfq: remove unsed parameter reason in bfq_bfqq_is_slow · bebeb9e5

Kemeng Shi authored Jan 16, 2023

Parameter reason is never used, just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

bebeb9e5

block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injection · 0c3e09e8

Kemeng Shi authored Jan 16, 2023

Function bfq_choose_bfqq_for_injection may temporarily raise inject limit
to one request if current inject_limit is 0 before search of the source
queue for injection. However the search below will reset inject limit to
bfqd->in_service_queue which is zero for raised inject limit. Then the
temporarily raised inject limit never works as expected.
Assigment limit to bfqd->in_service_queue in search is needed as limit
maybe overwriten to min_t(unsigned int, 1, limit) for condition that
a large in-flight request is on non-rotational devices in found queue.
So we need to reset limit to bfqd->in_service_queue for normal case.

Actually, we have already make sure bfqd->rq_in_driver is < limit before
search, then
 -Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int,
1, limit) is always 1. So we can simply check bfqd->rq_in_driver with
1 instead of result of min_t(unsigned int, 1, limit) for larget request in
non-rotational device case to avoid overwritting limit and the bug is gone.
 -For normal case, we have already check bfqd->rq_in_driver is < limit,
so we can return found bfqq unconditionally to remove unncessary check.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0c3e09e8

sbitmap: correct wake_batch recalculation to avoid potential IO hung · b5fcf787

Kemeng Shi authored Jan 17, 2023

Commit 180dccb0 ("blk-mq: fix tag_get wait task can't be awakened")
mentioned that in case of shared tags, there could be just one real
active hctx(queue) because of lazy detection of tag idle. Then driver tag
allocation may wait forever on this real active hctx(queue) if wake_batch
is > hctx_max_depth where hctx_max_depth is available tags depth for the
actve hctx(queue). However, the condition wake_batch > hctx_max_depth is
not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only
wake up one wait queue for each wake_batch even though there is only one
waiter in the woken wait queue. After this, there is only one tag to free
and wake_batch may not be reached anymore. Commit 180dccb0 ("blk-mq:
fix tag_get wait task can't be awakened") methioned that driver tag
allocation may wait forever. Actually, the inactive hctx(queue) will be
truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all
to wake one waiter per wait queue to break the hung. But IO hung for 30
seconds is also not acceptable. Set batch size to small enough that depth
of the shared hctx(queue) is enough to wake up all of the queues like
sbq_calc_wake_batch do to fix this potential IO hung.

Although hctx_max_depth will be clamped to at least 4 while wake_batch
recalculation does not do the clamp, the wake_batch will be always
recalculated to 1 when hctx_max_depth <= 4.

Fixes: 180dccb0 ("blk-mq: fix tag_get wait task can't be awakened")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-6-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b5fcf787

sbitmap: add sbitmap_find_bit to remove repeat code in __sbitmap_get/__sbitmap_get_shallow · 678418c6

Kemeng Shi authored Jan 17, 2023

There are three differences between __sbitmap_get and
__sbitmap_get_shallow when searching free bit:
1. __sbitmap_get_shallow limit number of bit to search per word.
__sbitmap_get has no such limit.
2. __sbitmap_get_shallow always searches with wrap set. __sbitmap_get set
wrap according to round_robin.
3. __sbitmap_get_shallow always searches from first bit in first word.
__sbitmap_get searches from first bit when round_robin is not set
otherwise searches from SB_NR_TO_BIT(sb, alloc_hint).

Add helper function sbitmap_find_bit function to do common search while
accept "limit depth per word", "wrap flag" and "first bit to
search" from caller to support the need of both __sbitmap_get and
__sbitmap_get_shallow.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-5-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

678418c6

sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code · 08470a98

Kemeng Shi authored Jan 17, 2023

Rewrite sbitmap_find_bit_in_index as following:
1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word
2. Accept "struct sbitmap_word *" directly instead of accepting
"struct sbitmap *" and "int index" to get "struct sbitmap_word *".
3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller
to support need of both __sbitmap_get_shallow and __sbitmap_get.

With helper function sbitmap_find_bit_in_word, we can remove repeat
code in __sbitmap_get_shallow to find bit considring deferred clear.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-4-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

08470a98

sbitmap: remove redundant check in __sbitmap_queue_get_batch · 903e86f3

Kemeng Shi authored Jan 17, 2023

Commit fbb564a5 ("lib/sbitmap: Fix invalid loop in
__sbitmap_queue_get_batch()") mentioned that "Checking free bits when
setting the target bits. Otherwise, it may reuse the busying bits."
This commit add check to make sure all masked bits in word before
cmpxchg is zero. Then the existing check after cmpxchg to check any
zero bit is existing in masked bits in word is redundant.

Actually, old value of word before cmpxchg is stored in val and we
will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg.
So we will not reuse busy bits methioned in commit fbb564a5
("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert
new-added check to remove redundant check.

Fixes: fbb564a5 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-3-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

903e86f3

sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow · f1591a8b

Kemeng Shi authored Jan 17, 2023

Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly
pointless and equivalent to setting alloc_hint to zero (because
SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So
simplify the logic.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-2-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f1591a8b

29 Jan, 2023 28 commits

blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy() · f1c006f1

Yu Kuai authored Jan 19, 2023

Currently parent pd can be freed before child pd:

t1: remove cgroup C1
blkcg_destroy_blkgs
 blkg_destroy
  list_del_init(&blkg->q_node)
  // remove blkg from queue list
  percpu_ref_kill(&blkg->refcnt)
   blkg_release
    call_rcu

t2: from t1
__blkg_release
 blkg_free
  schedule_work
			t4: deactivate policy
			blkcg_deactivate_policy
			 pd_free_fn
			 // parent of C1 is freed first
t3: from t2
 blkg_free_workfn
  pd_free_fn

If policy(for example, ioc_timer_fn() from iocost) access parent pd from
child pd after pd_offline_fn(), then UAF can be triggered.

Fix the problem by delaying 'list_del_init(&blkg->q_node)' from
blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to
synchronize blkg_free_workfn() and blkcg_deactivate_policy().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f1c006f1

blk-cgroup: support to track if policy is online · dfd6200a

Yu Kuai authored Jan 19, 2023

A new field 'online' is added to blkg_policy_data to fix following
2 problem:

1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT'
   failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again
   without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with
   it, and pd_offline_fn() will be called without pd_init_fn() and
   pd_online_fn(). This way null-ptr-deference can be triggered.

2) In order to synchronize pd_free_fn() from blkg_free_workfn() and
   blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be
   delayed to blkg_free_workfn(), hence pd_offline_fn() can be called
   first in blkg_destroy(), and then blkcg_deactivate_policy() will
   call it again, we must prevent it.

The new field 'online' will be set after pd_online_fn() and will be
cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only
be called if 'online' is set.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-3-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

dfd6200a

blk-cgroup: dropping parent refcount after pd_free_fn() is done · c7241bab

Yu Kuai authored Jan 19, 2023

Some cgroup policies will access parent pd through child pd even
after pd_offline_fn() is done. If pd_free_fn() for parent is called
before child, then UAF can be triggered. Hence it's better to guarantee
the order of pd_free_fn().

Currently refcount of parent blkg is dropped in __blkg_release(), which
is before pd_free_fn() is called in blkg_free_work_fn() while
blkg_free_work_fn() is called asynchronously.

This patch make sure pd_free_fn() called from removing cgroup is ordered
by delaying dropping parent refcount after calling pd_free_fn() for
child.

BTW, pd_free_fn() will also be called from blkcg_deactivate_policy()
from deleting device, and following patches will guarantee the order.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c7241bab

blk-mq: cleanup unused methods: blk_mq_hw_sysfs_store · b3678103

Zhong Jinghua authored Jan 28, 2023

We found that the blk_mq_hw_sysfs_store interface has no place to use.
The object default_hw_ctx_attrs using blk_mq_hw_sysfs_ops only uses
the show method and does not use the store method.

Since this patch:
4a46f05e ("blk-mq: move hctx and ctx counters from sysfs to debugfs")
moved the store method to debugfs, the store method is not used anymore.

So let me do some tiny work to clean up unused code.
Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230128030419.2780298-1-zhongjinghua@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b3678103

s390/dcssblk:: don't call bio_split_to_limits · 9607cd36

Christoph Hellwig authored Jan 23, 2023

s390 iterates over the bio using bio_for_each_segment and doesn't need
any bio splitting.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20230123075356.60847-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

9607cd36

ps3vram: remove bio splitting · 1bf7a749

Christoph Hellwig authored Jan 23, 2023

ps3vram iterates over the bio one segment, that is page aligned and max
page sized chunk, a time. Because of that there is no point in
calling bio_split_to_limits, or explicitly setting the default limits
that are only used by bio_split_to_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Geoff Levand <geoff@infradead.org>
Link: https://lore.kernel.org/r/20230123074718.57951-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

1bf7a749

block: treat poll queue enter similarly to timeouts · 33391eec

Jens Axboe authored Jan 20, 2023

We ran into an issue where a production workload would randomly grind to
a halt and not continue until the pending IO had timed out. This turned
out to be a complicated interaction between queue freezing and polled
IO:

1) You have an application that does polled IO. At any point in time,
   there may be polled IO pending.

2) You have a monitoring application that issues a passthrough command,
   which is marked with side effects such that it needs to freeze the
   queue.

3) Passthrough command is started, which calls blk_freeze_queue_start()
   on the device. At this point the queue is marked frozen, and any
   attempt to enter the queue will fail (for non-blocking) or block.

4) Now the driver calls blk_mq_freeze_queue_wait(), which will return
   when the queue is quiesced and pending IO has completed.

5) The pending IO is polled IO, but any attempt to poll IO through the
   normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to
   bio_queue_enter() as the queue is frozen. Rather than poll and
   complete IO, the polling threads will sit in a tight loop attempting
   to poll, but failing to enter the queue to do so.

The end result is that progress for either application will be stalled
until all pending polled IO has timed out. This causes obvious huge
latency issues for the application doing polled IO, but also long delays
for passthrough command.

Fix this by treating queue enter for polled IO just like we do for
timeouts. This allows quick quiesce of the queue as we still poll and
complete this IO, while still disallowing queueing up new IO.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

33391eec

blk-iocost: change div64_u64 to DIV64_U64_ROUND_UP in ioc_refresh_params() · b3260329

Li Nan authored Jan 17, 2023

vrate_min is calculated by DIV64_U64_ROUND_UP, but vrate_max is calculated
by div64_u64. Vrate_min may be 1 greater than vrate_max if the input
values min and max of cost.qos are equal.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-6-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b3260329

blk-iocost: fix divide by 0 error in calc_lcoefs() · 984af1e6

Li Nan authored Jan 17, 2023

echo max of u64 to cost.model can cause divide by 0 error.

  # echo 8:0 rbps=18446744073709551615 > /sys/fs/cgroup/io.cost.model

  divide error: 0000 [#1] PREEMPT SMP
  RIP: 0010:calc_lcoefs+0x4c/0xc0
  Call Trace:
   <TASK>
   ioc_refresh_params+0x2b3/0x4f0
   ioc_cost_model_write+0x3cb/0x4c0
   ? _copy_from_iter+0x6d/0x6c0
   ? kernfs_fop_write_iter+0xfc/0x270
   cgroup_file_write+0xa0/0x200
   kernfs_fop_write_iter+0x17d/0x270
   vfs_write+0x414/0x620
   ksys_write+0x73/0x160
   __x64_sys_write+0x1e/0x30
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

calc_lcoefs() uses the input value of cost.model in DIV_ROUND_UP_ULL,
overflow would happen if bps plus IOC_PAGE_SIZE is greater than
ULLONG_MAX, it can cause divide by 0 error.

Fix the problem by setting basecost
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-5-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

984af1e6

blk-iocost: read params inside lock in sysfs apis · 35198e32

Yu Kuai authored Jan 17, 2023

Otherwise, user might get abnormal values if params is updated
concurrently.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-4-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

35198e32

blk-iocost: don't allow to configure bio based device · 235a5a83

Yu Kuai authored Jan 17, 2023

iocost is based on rq_qos, which can only work for request based device,
thus it doesn't make sense to configure iocost for bio based device.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-3-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

235a5a83

blk-iocost: check return value of match_u64() · 7b7c5ae4

Yu Kuai authored Jan 17, 2023

This patch fixs that the return value of match_u64() from ioc_qos_write()
is not checked,
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-2-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

7b7c5ae4

blk-iocost: avoid 64-bit division in ioc_timer_fn · 5f2779df

Arnd Bergmann authored Jan 18, 2023

The behavior of 'enum' types has changed in gcc-13, so now the
UNBUSY_THR_PCT constant is interpreted as a 64-bit number because
it is defined as part of the same enum definition as some other
constants that do not fit within a 32-bit integer. This in turn
leads to some inefficient code on 32-bit architectures as well
as a link error:

arm-linux-gnueabi/bin/arm-linux-gnueabi-ld: block/blk-iocost.o: in function `ioc_timer_fn':
blk-iocost.c:(.text+0x68e8): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: blk-iocost.c:(.text+0x6908): undefined reference to `__aeabi_uldivmod'

Split the enum definition to keep the 64-bit timing constants in
a separate enum type from those constants that can clearly fit
within a smaller type.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230118080706.3303186-1-arnd@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

5f2779df

block: ublk: fix doc build warning · 464544fb

Ming Lei authored Jan 18, 2023

Fix the following warning:

Documentation/block/ublk.rst:157: WARNING: Enumerated list ends without a blank line; unexpected unindent.
Documentation/block/ublk.rst:171: WARNING: Enumerated list ends without a blank line; unexpected unindent.

Fixes: 56f5160bc1b8 ("ublk_drv: add mechanism for supporting unprivileged ublk device")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230118042318.127900-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

464544fb

block: introduce bdev_zone_no helper · d67ea690

Pankaj Raghav authored Jan 10, 2023

Add a generic bdev_zone_no() helper to calculate zone number for a
given sector in a block device. This helper internally uses disk_zone_no()
to find the zone number.

Use the helper bdev_zone_no() to calculate nr of zones. This lets us
make modifications to the math if needed in one place.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230110143635.77300-4-p.raghav@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d67ea690

block: add a new helper bdev_{is_zone_start, offset_from_zone_start} · e29b2100

Pankaj Raghav authored Jan 10, 2023

Instead of open coding to check for zone start, add a helper to improve
readability and store the logic in one place.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230110143635.77300-3-p.raghav@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e29b2100

block: remove superfluous check for request queue in bdev_is_zoned() · fea127b3

Pankaj Raghav authored Jan 10, 2023

Remove the superfluous request queue check in bdev_is_zoned() as
bdev_get_queue() can never return NULL.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230110143635.77300-2-p.raghav@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

fea127b3

block: extend bio-cache for non-polled requests · 7e2e355d

Anuj Gupta authored Jan 17, 2023

This patch modifies the present check, so that bio-cache is not limited
to iopoll.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230117120638.72254-3-anuj20.g@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

7e2e355d

nvme: set REQ_ALLOC_CACHE for uring-passthru request · 888545cb

Anuj Gupta authored Jan 17, 2023

This patch sets REQ_ALLOC_CACHE flag for uring-passthru requests.
This is a prep-patch so that normal / IRQ-driven uring-passthru
I/Os can also leverage bio-cache.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230117120638.72254-2-anuj20.g@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

888545cb

ublk_drv: add mechanism for supporting unprivileged ublk device · 4093cb5a

Ming Lei authored Jan 06, 2023

unprivileged ublk device is helpful for container use case, such
as: ublk device created in one unprivileged container can be controlled
and accessed by this container only.

Implement this feature by adding flag of UBLK_F_UNPRIVILEGED_DEV, and if
this flag isn't set, any control command has been run from privileged
user. Otherwise, any control command can be sent from any unprivileged
user, but the user has to be permitted to access the ublk char device
to be controlled.

In case of UBLK_F_UNPRIVILEGED_DEV:

1) for command UBLK_CMD_ADD_DEV, it is always allowed, and user needs
to provide owner's uid/gid in this command, so that udev can set correct
ownership for the created ublk device, since the device owner uid/gid
can be queried via command of UBLK_CMD_GET_DEV_INFO.

2) for other control commands, they can only be run successfully if the
current user is allowed to access the specified ublk char device, for
running the permission check, path of the ublk char device has to be
provided by these commands.

Also add one control of command UBLK_CMD_GET_DEV_INFO2 which always
include the char dev path in payload since userspace may not have
knowledge if this device is created in unprivileged mode.

For applying this mechanism, system administrator needs to take
the following policies:

1) chmod 0666 /dev/ublk-control

2) change ownership of ublkcN & ublkbN
- chown owner_uid:owner_gid /dev/ublkcN
- chown owner_uid:owner_gid /dev/ublkbN

Both can be done via one simple udev rule.

Userspace:

	https://github.com/ming1/ubdsrv/tree/unprivileged-ublk

'ublk add -t $TYPE --un_privileged=1' is for creating one un-privileged
ublk device if the user is un-privileged.

Link: https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-7-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4093cb5a

ublk_drv: add module parameter of ublks_max for limiting max allowed ublk dev · 403ebc87

Ming Lei authored Jan 06, 2023

Prepare for supporting unprivileged ublk device by limiting max number
ublk devices added. Otherwise too many ublk devices could be added by
un-trusted user, which can be thought as one DoS.
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-6-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

403ebc87

ublk_drv: add device parameter UBLK_PARAM_TYPE_DEVT · abb864d3

Ming Lei authored Jan 06, 2023

Userspace side only knows device ID, but the associated path of ublkc* and
ublkb* could be changed by udev, and that depends on userspace's policy, so
add parameter of UBLK_PARAM_TYPE_DEVT for retrieving major/minor of the
ublkc* and ublkb*, then user may figure out major/minor of the ublk disks
he/she owns. With major/minor, it is easy to find the device node path.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-5-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

abb864d3

ublk_drv: move ublk_get_device_from_id into ublk_ctrl_uring_cmd · bfbcef03

Ming Lei authored Jan 06, 2023

It is annoying for each control command handler to get/put ublk
device and deal with failure.

Control command handler is simplified a lot by moving
ublk_get_device_from_id into ublk_ctrl_uring_cmd().
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-4-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

bfbcef03

ublk_drv: don't probe partitions if the ubq daemon isn't trusted · 73a166d9

Ming Lei authored Jan 06, 2023

If any ubq daemon is unprivileged, the ublk char device is allowed
for unprivileged user actually, and we can't trust the current user,
so not probe partitions.

Fixes: 71f28f31 ("ublk_drv: add io_uring based userspace block driver")
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-3-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

73a166d9

ublk_drv: remove nr_aborted_queues from ublk_device · ed878d1c

Ming Lei authored Jan 06, 2023

No one uses 'nr_aborted_queues' any more, so remove it.
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-2-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ed878d1c

block: don't allow multiple bios for IOCB_NOWAIT issue · 67d59247

Jens Axboe authored Jan 16, 2023

If we're doing a large IO request which needs to be split into multiple
bios for issue, then we can run into the same situation as the below
marked commit fixes - parts will complete just fine, one or more parts
will fail to allocate a request. This will result in a partially
completed read or write request, where the caller gets EAGAIN even though
parts of the IO completed just fine.

Do the same for large bios as we do for splits - fail a NOWAIT request
with EAGAIN. This isn't technically fixing an issue in the below marked
patch, but for stable purposes, we should have either none of them or
both.

This depends on: 613b1488 ("block: handle bio_split_to_limits() NULL return")

Cc: stable@vger.kernel.org # 5.15+
Fixes: 9cea62b2 ("block: don't allow splitting of a REQ_NOWAIT bio")
Link: https://github.com/axboe/liburing/issues/766Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

67d59247

drbd: drbd_insert_interval(): Clarify comment · 2bb34fa6

Andreas Gruenbacher authored Jan 13, 2023

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-9-christoph.boehmwalder@linbit.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

2bb34fa6

drbd: interval tree: make removing an "empty" interval a no-op · 2990ca29

Lars Ellenberg authored Jan 13, 2023

Trying to remove an "empty" (just initialized, or "cleared") interval
from the tree, this results in an endless loop.

As we typically protect the tree with a spinlock_irq,
the result is a hung system.

Be nice to error cleanup code paths, ignore removal of empty intervals.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-8-christoph.boehmwalder@linbit.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

2990ca29