- 30 Jan, 2023 12 commits
-
-
Kemeng Shi authored
We jump to tag only for returning current rq. Return directly to remove this tag. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
We have already avoided a circular list in bfq_setup_merge (see comments in bfq_setup_merge() for details), so bfq_queue will not appear in it's new_bfqq list. Just remove this check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
Use helper macro RQ_BFQQ to get bfqq of request. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
Inject limit is updated or reset when time_is_before_eq_jiffies( decrease_time_jif + several msecs) or think-time state changes. decrease_time_jif is initialized to 0 and will be set to current jiffies when inject limit is updated or reset. If the jiffies is slightly greater than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the think-time state never chages, then the injection will not work as expected for long time. To be more specific: Function bfq_update_inject_limit maybe triggered when jiffies pasts decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting bfqd->wait_dispatch to true. Function bfq_reset_inject_limit are called in two conditions: 1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in function bfq_add_request. 2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or bfq think-time state change from short to long. Fix this by initializing bfqq->decrease_time_jif to current jiffies to trigger service injection soon when service injection conditions are met. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
Parameter reason is never used, just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit to one request if current inject_limit is 0 before search of the source queue for injection. However the search below will reset inject limit to bfqd->in_service_queue which is zero for raised inject limit. Then the temporarily raised inject limit never works as expected. Assigment limit to bfqd->in_service_queue in search is needed as limit maybe overwriten to min_t(unsigned int, 1, limit) for condition that a large in-flight request is on non-rotational devices in found queue. So we need to reset limit to bfqd->in_service_queue for normal case. Actually, we have already make sure bfqd->rq_in_driver is < limit before search, then -Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int, 1, limit) is always 1. So we can simply check bfqd->rq_in_driver with 1 instead of result of min_t(unsigned int, 1, limit) for larget request in non-rotational device case to avoid overwritting limit and the bug is gone. -For normal case, we have already check bfqd->rq_in_driver is < limit, so we can return found bfqq unconditionally to remove unncessary check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
Commit 180dccb0 ("blk-mq: fix tag_get wait task can't be awakened") mentioned that in case of shared tags, there could be just one real active hctx(queue) because of lazy detection of tag idle. Then driver tag allocation may wait forever on this real active hctx(queue) if wake_batch is > hctx_max_depth where hctx_max_depth is available tags depth for the actve hctx(queue). However, the condition wake_batch > hctx_max_depth is not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only wake up one wait queue for each wake_batch even though there is only one waiter in the woken wait queue. After this, there is only one tag to free and wake_batch may not be reached anymore. Commit 180dccb0 ("blk-mq: fix tag_get wait task can't be awakened") methioned that driver tag allocation may wait forever. Actually, the inactive hctx(queue) will be truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all to wake one waiter per wait queue to break the hung. But IO hung for 30 seconds is also not acceptable. Set batch size to small enough that depth of the shared hctx(queue) is enough to wake up all of the queues like sbq_calc_wake_batch do to fix this potential IO hung. Although hctx_max_depth will be clamped to at least 4 while wake_batch recalculation does not do the clamp, the wake_batch will be always recalculated to 1 when hctx_max_depth <= 4. Fixes: 180dccb0 ("blk-mq: fix tag_get wait task can't be awakened") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116205059.3821738-6-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
There are three differences between __sbitmap_get and __sbitmap_get_shallow when searching free bit: 1. __sbitmap_get_shallow limit number of bit to search per word. __sbitmap_get has no such limit. 2. __sbitmap_get_shallow always searches with wrap set. __sbitmap_get set wrap according to round_robin. 3. __sbitmap_get_shallow always searches from first bit in first word. __sbitmap_get searches from first bit when round_robin is not set otherwise searches from SB_NR_TO_BIT(sb, alloc_hint). Add helper function sbitmap_find_bit function to do common search while accept "limit depth per word", "wrap flag" and "first bit to search" from caller to support the need of both __sbitmap_get and __sbitmap_get_shallow. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116205059.3821738-5-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
Rewrite sbitmap_find_bit_in_index as following: 1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word 2. Accept "struct sbitmap_word *" directly instead of accepting "struct sbitmap *" and "int index" to get "struct sbitmap_word *". 3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller to support need of both __sbitmap_get_shallow and __sbitmap_get. With helper function sbitmap_find_bit_in_word, we can remove repeat code in __sbitmap_get_shallow to find bit considring deferred clear. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116205059.3821738-4-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
Commit fbb564a5 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") mentioned that "Checking free bits when setting the target bits. Otherwise, it may reuse the busying bits." This commit add check to make sure all masked bits in word before cmpxchg is zero. Then the existing check after cmpxchg to check any zero bit is existing in masked bits in word is redundant. Actually, old value of word before cmpxchg is stored in val and we will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg. So we will not reuse busy bits methioned in commit fbb564a5 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert new-added check to remove redundant check. Fixes: fbb564a5 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116205059.3821738-3-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Kemeng Shi authored
Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly pointless and equivalent to setting alloc_hint to zero (because SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So simplify the logic. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116205059.3821738-2-shikemeng@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 29 Jan, 2023 28 commits
-
-
Yu Kuai authored
Currently parent pd can be freed before child pd: t1: remove cgroup C1 blkcg_destroy_blkgs blkg_destroy list_del_init(&blkg->q_node) // remove blkg from queue list percpu_ref_kill(&blkg->refcnt) blkg_release call_rcu t2: from t1 __blkg_release blkg_free schedule_work t4: deactivate policy blkcg_deactivate_policy pd_free_fn // parent of C1 is freed first t3: from t2 blkg_free_workfn pd_free_fn If policy(for example, ioc_timer_fn() from iocost) access parent pd from child pd after pd_offline_fn(), then UAF can be triggered. Fix the problem by delaying 'list_del_init(&blkg->q_node)' from blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to synchronize blkg_free_workfn() and blkcg_deactivate_policy(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
A new field 'online' is added to blkg_policy_data to fix following 2 problem: 1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT' failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with it, and pd_offline_fn() will be called without pd_init_fn() and pd_online_fn(). This way null-ptr-deference can be triggered. 2) In order to synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be delayed to blkg_free_workfn(), hence pd_offline_fn() can be called first in blkg_destroy(), and then blkcg_deactivate_policy() will call it again, we must prevent it. The new field 'online' will be set after pd_online_fn() and will be cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only be called if 'online' is set. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-3-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Some cgroup policies will access parent pd through child pd even after pd_offline_fn() is done. If pd_free_fn() for parent is called before child, then UAF can be triggered. Hence it's better to guarantee the order of pd_free_fn(). Currently refcount of parent blkg is dropped in __blkg_release(), which is before pd_free_fn() is called in blkg_free_work_fn() while blkg_free_work_fn() is called asynchronously. This patch make sure pd_free_fn() called from removing cgroup is ordered by delaying dropping parent refcount after calling pd_free_fn() for child. BTW, pd_free_fn() will also be called from blkcg_deactivate_policy() from deleting device, and following patches will guarantee the order. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Zhong Jinghua authored
We found that the blk_mq_hw_sysfs_store interface has no place to use. The object default_hw_ctx_attrs using blk_mq_hw_sysfs_ops only uses the show method and does not use the store method. Since this patch: 4a46f05e ("blk-mq: move hctx and ctx counters from sysfs to debugfs") moved the store method to debugfs, the store method is not used anymore. So let me do some tiny work to clean up unused code. Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230128030419.2780298-1-zhongjinghua@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
s390 iterates over the bio using bio_for_each_segment and doesn't need any bio splitting. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20230123075356.60847-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
ps3vram iterates over the bio one segment, that is page aligned and max page sized chunk, a time. Because of that there is no point in calling bio_split_to_limits, or explicitly setting the default limits that are only used by bio_split_to_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Geoff Levand <geoff@infradead.org> Link: https://lore.kernel.org/r/20230123074718.57951-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We ran into an issue where a production workload would randomly grind to a halt and not continue until the pending IO had timed out. This turned out to be a complicated interaction between queue freezing and polled IO: 1) You have an application that does polled IO. At any point in time, there may be polled IO pending. 2) You have a monitoring application that issues a passthrough command, which is marked with side effects such that it needs to freeze the queue. 3) Passthrough command is started, which calls blk_freeze_queue_start() on the device. At this point the queue is marked frozen, and any attempt to enter the queue will fail (for non-blocking) or block. 4) Now the driver calls blk_mq_freeze_queue_wait(), which will return when the queue is quiesced and pending IO has completed. 5) The pending IO is polled IO, but any attempt to poll IO through the normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to bio_queue_enter() as the queue is frozen. Rather than poll and complete IO, the polling threads will sit in a tight loop attempting to poll, but failing to enter the queue to do so. The end result is that progress for either application will be stalled until all pending polled IO has timed out. This causes obvious huge latency issues for the application doing polled IO, but also long delays for passthrough command. Fix this by treating queue enter for polled IO just like we do for timeouts. This allows quick quiesce of the queue as we still poll and complete this IO, while still disallowing queueing up new IO. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Li Nan authored
vrate_min is calculated by DIV64_U64_ROUND_UP, but vrate_max is calculated by div64_u64. Vrate_min may be 1 greater than vrate_max if the input values min and max of cost.qos are equal. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-6-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Li Nan authored
echo max of u64 to cost.model can cause divide by 0 error. # echo 8:0 rbps=18446744073709551615 > /sys/fs/cgroup/io.cost.model divide error: 0000 [#1] PREEMPT SMP RIP: 0010:calc_lcoefs+0x4c/0xc0 Call Trace: <TASK> ioc_refresh_params+0x2b3/0x4f0 ioc_cost_model_write+0x3cb/0x4c0 ? _copy_from_iter+0x6d/0x6c0 ? kernfs_fop_write_iter+0xfc/0x270 cgroup_file_write+0xa0/0x200 kernfs_fop_write_iter+0x17d/0x270 vfs_write+0x414/0x620 ksys_write+0x73/0x160 __x64_sys_write+0x1e/0x30 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd calc_lcoefs() uses the input value of cost.model in DIV_ROUND_UP_ULL, overflow would happen if bps plus IOC_PAGE_SIZE is greater than ULLONG_MAX, it can cause divide by 0 error. Fix the problem by setting basecost Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-5-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Otherwise, user might get abnormal values if params is updated concurrently. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-4-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
iocost is based on rq_qos, which can only work for request based device, thus it doesn't make sense to configure iocost for bio based device. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-3-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
This patch fixs that the return value of match_u64() from ioc_qos_write() is not checked, Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-2-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Arnd Bergmann authored
The behavior of 'enum' types has changed in gcc-13, so now the UNBUSY_THR_PCT constant is interpreted as a 64-bit number because it is defined as part of the same enum definition as some other constants that do not fit within a 32-bit integer. This in turn leads to some inefficient code on 32-bit architectures as well as a link error: arm-linux-gnueabi/bin/arm-linux-gnueabi-ld: block/blk-iocost.o: in function `ioc_timer_fn': blk-iocost.c:(.text+0x68e8): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: blk-iocost.c:(.text+0x6908): undefined reference to `__aeabi_uldivmod' Split the enum definition to keep the 64-bit timing constants in a separate enum type from those constants that can clearly fit within a smaller type. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230118080706.3303186-1-arnd@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Fix the following warning: Documentation/block/ublk.rst:157: WARNING: Enumerated list ends without a blank line; unexpected unindent. Documentation/block/ublk.rst:171: WARNING: Enumerated list ends without a blank line; unexpected unindent. Fixes: 56f5160bc1b8 ("ublk_drv: add mechanism for supporting unprivileged ublk device") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230118042318.127900-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pankaj Raghav authored
Add a generic bdev_zone_no() helper to calculate zone number for a given sector in a block device. This helper internally uses disk_zone_no() to find the zone number. Use the helper bdev_zone_no() to calculate nr of zones. This lets us make modifications to the math if needed in one place. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20230110143635.77300-4-p.raghav@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pankaj Raghav authored
Instead of open coding to check for zone start, add a helper to improve readability and store the logic in one place. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230110143635.77300-3-p.raghav@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pankaj Raghav authored
Remove the superfluous request queue check in bdev_is_zoned() as bdev_get_queue() can never return NULL. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20230110143635.77300-2-p.raghav@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Anuj Gupta authored
This patch modifies the present check, so that bio-cache is not limited to iopoll. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20230117120638.72254-3-anuj20.g@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Anuj Gupta authored
This patch sets REQ_ALLOC_CACHE flag for uring-passthru requests. This is a prep-patch so that normal / IRQ-driven uring-passthru I/Os can also leverage bio-cache. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20230117120638.72254-2-anuj20.g@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
unprivileged ublk device is helpful for container use case, such as: ublk device created in one unprivileged container can be controlled and accessed by this container only. Implement this feature by adding flag of UBLK_F_UNPRIVILEGED_DEV, and if this flag isn't set, any control command has been run from privileged user. Otherwise, any control command can be sent from any unprivileged user, but the user has to be permitted to access the ublk char device to be controlled. In case of UBLK_F_UNPRIVILEGED_DEV: 1) for command UBLK_CMD_ADD_DEV, it is always allowed, and user needs to provide owner's uid/gid in this command, so that udev can set correct ownership for the created ublk device, since the device owner uid/gid can be queried via command of UBLK_CMD_GET_DEV_INFO. 2) for other control commands, they can only be run successfully if the current user is allowed to access the specified ublk char device, for running the permission check, path of the ublk char device has to be provided by these commands. Also add one control of command UBLK_CMD_GET_DEV_INFO2 which always include the char dev path in payload since userspace may not have knowledge if this device is created in unprivileged mode. For applying this mechanism, system administrator needs to take the following policies: 1) chmod 0666 /dev/ublk-control 2) change ownership of ublkcN & ublkbN - chown owner_uid:owner_gid /dev/ublkcN - chown owner_uid:owner_gid /dev/ublkbN Both can be done via one simple udev rule. Userspace: https://github.com/ming1/ubdsrv/tree/unprivileged-ublk 'ublk add -t $TYPE --un_privileged=1' is for creating one un-privileged ublk device if the user is un-privileged. Link: https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-7-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Prepare for supporting unprivileged ublk device by limiting max number ublk devices added. Otherwise too many ublk devices could be added by un-trusted user, which can be thought as one DoS. Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-6-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Userspace side only knows device ID, but the associated path of ublkc* and ublkb* could be changed by udev, and that depends on userspace's policy, so add parameter of UBLK_PARAM_TYPE_DEVT for retrieving major/minor of the ublkc* and ublkb*, then user may figure out major/minor of the ublk disks he/she owns. With major/minor, it is easy to find the device node path. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-5-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
It is annoying for each control command handler to get/put ublk device and deal with failure. Control command handler is simplified a lot by moving ublk_get_device_from_id into ublk_ctrl_uring_cmd(). Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-4-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
If any ubq daemon is unprivileged, the ublk char device is allowed for unprivileged user actually, and we can't trust the current user, so not probe partitions. Fixes: 71f28f31 ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-3-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
No one uses 'nr_aborted_queues' any more, so remove it. Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-2-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If we're doing a large IO request which needs to be split into multiple bios for issue, then we can run into the same situation as the below marked commit fixes - parts will complete just fine, one or more parts will fail to allocate a request. This will result in a partially completed read or write request, where the caller gets EAGAIN even though parts of the IO completed just fine. Do the same for large bios as we do for splits - fail a NOWAIT request with EAGAIN. This isn't technically fixing an issue in the below marked patch, but for stable purposes, we should have either none of them or both. This depends on: 613b1488 ("block: handle bio_split_to_limits() NULL return") Cc: stable@vger.kernel.org # 5.15+ Fixes: 9cea62b2 ("block: don't allow splitting of a REQ_NOWAIT bio") Link: https://github.com/axboe/liburing/issues/766Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Andreas Gruenbacher authored
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-9-christoph.boehmwalder@linbit.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
Trying to remove an "empty" (just initialized, or "cleared") interval from the tree, this results in an endless loop. As we typically protect the tree with a spinlock_irq, the result is a hung system. Be nice to error cleanup code paths, ignore removal of empty intervals. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-8-christoph.boehmwalder@linbit.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-