- 28 Feb, 2022 11 commits
-
-
Sagi Grimberg authored
ida_simple_[get|remove] are wrappers anyways. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
ida_simple_[get|remove] are wrappers anyways. Also, use ida_alloc_min with the ns_ida as namespace enumeration starts with 1. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
Allow block device to be configured in the buffered I/O mode by using the file backend. In this way now we can use cache for the block device namespace which shows significant performance improvement. We update the block device ns enable function and return early when buffered_io flag is set. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
Instead of calling vfs_getattr() use i_size_read() to read the size of file so we can read the size of not only file type but also block type with one call. This is needed to implement buffered_io support for the NVMeOF block device backend. We also change return type of function nvmet_file_ns_revalidate() from int to void, since this function does not return any meaning value. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
Braces are not required for enum value NVME_SC_CONNECT_INVALID_PARAM when used on the switch-case statement, remove the braces. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
Remove zeroout memeset call & zeroout local variable cmd at the time of declaration in nvmf_ref_read32() similar to what we have done in nvmf_reg_read64(), nvmf_reg_write32(), nvmf_connect_admin_queue(), and nvmf_connect_io_queue(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
Loop variable i will never have a negative value, so use unsigned int type instaed of int. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
Loop variable i will never have a negative value, so use unsigned int type instaed of int. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
In function nvme_execute_rq() we don't use gendisk parameter at all. Remove the unsed parameter and adjust the calls. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
It is not a good practice to have a semicolon at the end of the function definition. Remove it from nvme_pr_type(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Qinghua Jin authored
subsytem -> subsystem Signed-off-by: Qinghua Jin <qhjin.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
- 27 Feb, 2022 13 commits
-
-
Chaitanya Kulkarni authored
Remove goto labels and use direct returns as error unwinding code only needs to free t_page variable if we alloc_pages() call fails as having two labels for one kfree() can be avoided easily. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220222152852.26043-3-kch@nvidia.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
Only caller of null_alloc_page() is null_insert_page() unconditionally sets only parameter to GFP_NOIO and that is statically hard-coded in null_blk. There is no point in having statically hardcoded function parameter. Remove the unnecessary parameter gfp_flags and adjust the code, so it can retain existing behavior null_alloc_page() with GFP_NOIO. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220222152852.26043-2-kch@nvidia.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
Only caller of alloc_cmd() is null_submit_bio() unconditionally sets second parameter to true and that is statically hard-coded in null_blk. There is no point in having statically hardcoded function parameter. Remove the unnecessary parameter can_wait and adjust the code so it can retain existing behavior of waiting when we don't get valid nullb_cmd from __alloc_cmd() in alloc_cmd(). The restructured code avoids multiple return statements, multiple calls to __alloc_cmd() and resulting a fast path call to prepare_to_wait() due to removal of first alloc_cmd() call. Follow the pattern that we have in bio_alloc() to set the structure members in the structure allocation function in alloc_cmd() and pass bio to initialize newly allocated cmd->bio member. Follow the pattern in copy_to_nullb() to use result of one function call (null_cache_active()) to be used as a parameter to another function call (null_insert_page()), use result of alloc_cmd() as a first parameter to the null_handle_cmd() in null_submit_bio() function. This allow us to remove the local variable cmd on stack in null_submit_bio() that is in fast path. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220216172945.31124-2-kch@nvidia.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
Instead of hardcoding queue depth allow user to set the hw queue depth using module parameter. Set default value to 128 to retain the existing behavior. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-5-kch@nvidia.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
The local variable file is used to pass it to the vfs_fsync(). We can get away with using lo->lo_backing_file instead of storing in a local variable which is not used anywhere else. No functional change in this patch. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-4-kch@nvidia.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
The local variable q is used to pass it to the blk_queue_discard(). We can get away with using lo->lo_queue instead of storing in a local variable which is not used anywhere else. No functional change in this patch. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-3-kch@nvidia.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
sprintf does not know the PAGE_SIZE maximum of the temporary buffer used for outputting sysfs content and it's possible to overrun the PAGE_SIZE buffer length. Use a generic sysfs_emit function that knows the size of the temporary buffer and ensures that no overrun is done for offset attribute in loop_attr_[offset|sizelimit|autoclear|partscan|dio]_show() callbacks. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-2-kch@nvidia.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
The function nullb_device_power_store() returns -ENOMEM when null_add_dev() fails. null_add_dev() can fail with return value other than -ENOMEM such as -EINVAL when Zoned Block Device option is used, see : nullb_device_power_store() null_add_dev() null_init_zoned_dev() return -EINVAL; When trying to load the module having -ENOMEM value returned on the command line creates confusion when pleanty of memory is free on the machine. Instead of hardcoding -ENOMEM return the value of null_add_dev() function. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220215115951.15945-1-kch@nvidia.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Colin Ian King authored
The phrase "has still" should be "still has" to clean up the grammar. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20220208114656.61629-1-colin.i.king@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christophe JAILLET authored
According to lib/idr.c, The IDA handles its own locking. It is safe to call any of the IDA functions without synchronisation in your code. so the 'ida_lock' mutex can just be removed. It is here only to protect some ida_simple_get()/ida_simple_remove() calls. While at it, switch to ida_alloc_XXX()/ida_free() instead to ida_simple_get()/ida_simple_remove(). The latter is deprecated and more verbose. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/7f9eccd8b1fce1bac45ac9b01a78cf72f54c0a61.1644266862.git.christophe.jaillet@wanadoo.frSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Gioh Kim authored
On client side, the device is a network device. There is no reason to set rotational even-if the target device on server is rotational. Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Link: https://lore.kernel.org/r/20220114155855.984144-3-haris.iqbal@ionos.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Gioh Kim authored
This patch fix the "CHECK:BRACES: braces {} should be used on all arms of this statement" warning from checkpatch Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Link: https://lore.kernel.org/r/20220114155855.984144-2-haris.iqbal@ionos.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
As Luis reported, losetup currently doesn't properly create the loop device without this if the device node already exists because old scripts created it manually. So default to y for now and remove the aggressive removal schedule. Reported-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220225181440.1351591-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 22 Feb, 2022 1 commit
-
-
Zhang Wensheng authored
When the inflight IOs are slow and no new IOs are issued, we expect iostat could manifest the IO hang problem. However after commit 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting"), io_tick and time_in_queue will not be updated until the end of IO, and the avgqu-sz and %util columns of iostat will be zero. Because it has using stat.nsecs accumulation to express time_in_queue which is not suitable to change, and may %util will express the status better when io hang occur. To fix io_ticks, we use update_io_ticks and inflight to update io_ticks when diskstats_show and part_stat_show been called. Fixes: 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting") Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220217064247.4041435-1-zhangwensheng5@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 18 Feb, 2022 3 commits
-
-
Yu Kuai authored
Our test report a UAF: [ 2073.019181] ================================================================== [ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168 [ 2073.019191] Write of size 8 at addr ffff8000ccf64128 by task rmmod/72584 [ 2073.019192] [ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5 [ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 2073.019200] Call trace: [ 2073.019203] dump_backtrace+0x0/0x310 [ 2073.019206] show_stack+0x28/0x38 [ 2073.019210] dump_stack+0xec/0x15c [ 2073.019216] print_address_description+0x68/0x2d0 [ 2073.019220] kasan_report+0x238/0x2f0 [ 2073.019224] __asan_store8+0x88/0xb0 [ 2073.019229] __bfq_put_async_bfqq+0xa0/0x168 [ 2073.019233] bfq_put_async_queues+0xbc/0x208 [ 2073.019236] bfq_pd_offline+0x178/0x238 [ 2073.019240] blkcg_deactivate_policy+0x1f0/0x420 [ 2073.019244] bfq_exit_queue+0x128/0x178 [ 2073.019249] blk_mq_exit_sched+0x12c/0x160 [ 2073.019252] elevator_exit+0xc8/0xd0 [ 2073.019256] blk_exit_queue+0x50/0x88 [ 2073.019259] blk_cleanup_queue+0x228/0x3d8 [ 2073.019267] null_del_dev+0xfc/0x1e0 [null_blk] [ 2073.019274] null_exit+0x90/0x114 [null_blk] [ 2073.019278] __arm64_sys_delete_module+0x358/0x5a0 [ 2073.019282] el0_svc_common+0xc8/0x320 [ 2073.019287] el0_svc_handler+0xf8/0x160 [ 2073.019290] el0_svc+0x10/0x218 [ 2073.019291] [ 2073.019294] Allocated by task 14163: [ 2073.019301] kasan_kmalloc+0xe0/0x190 [ 2073.019305] kmem_cache_alloc_node_trace+0x1cc/0x418 [ 2073.019308] bfq_pd_alloc+0x54/0x118 [ 2073.019313] blkcg_activate_policy+0x250/0x460 [ 2073.019317] bfq_create_group_hierarchy+0x38/0x110 [ 2073.019321] bfq_init_queue+0x6d0/0x948 [ 2073.019325] blk_mq_init_sched+0x1d8/0x390 [ 2073.019330] elevator_switch_mq+0x88/0x170 [ 2073.019334] elevator_switch+0x140/0x270 [ 2073.019338] elv_iosched_store+0x1a4/0x2a0 [ 2073.019342] queue_attr_store+0x90/0xe0 [ 2073.019348] sysfs_kf_write+0xa8/0xe8 [ 2073.019351] kernfs_fop_write+0x1f8/0x378 [ 2073.019359] __vfs_write+0xe0/0x360 [ 2073.019363] vfs_write+0xf0/0x270 [ 2073.019367] ksys_write+0xdc/0x1b8 [ 2073.019371] __arm64_sys_write+0x50/0x60 [ 2073.019375] el0_svc_common+0xc8/0x320 [ 2073.019380] el0_svc_handler+0xf8/0x160 [ 2073.019383] el0_svc+0x10/0x218 [ 2073.019385] [ 2073.019387] Freed by task 72584: [ 2073.019391] __kasan_slab_free+0x120/0x228 [ 2073.019394] kasan_slab_free+0x10/0x18 [ 2073.019397] kfree+0x94/0x368 [ 2073.019400] bfqg_put+0x64/0xb0 [ 2073.019404] bfqg_and_blkg_put+0x90/0xb0 [ 2073.019408] bfq_put_queue+0x220/0x228 [ 2073.019413] __bfq_put_async_bfqq+0x98/0x168 [ 2073.019416] bfq_put_async_queues+0xbc/0x208 [ 2073.019420] bfq_pd_offline+0x178/0x238 [ 2073.019424] blkcg_deactivate_policy+0x1f0/0x420 [ 2073.019429] bfq_exit_queue+0x128/0x178 [ 2073.019433] blk_mq_exit_sched+0x12c/0x160 [ 2073.019437] elevator_exit+0xc8/0xd0 [ 2073.019440] blk_exit_queue+0x50/0x88 [ 2073.019443] blk_cleanup_queue+0x228/0x3d8 [ 2073.019451] null_del_dev+0xfc/0x1e0 [null_blk] [ 2073.019459] null_exit+0x90/0x114 [null_blk] [ 2073.019462] __arm64_sys_delete_module+0x358/0x5a0 [ 2073.019467] el0_svc_common+0xc8/0x320 [ 2073.019471] el0_svc_handler+0xf8/0x160 [ 2073.019474] el0_svc+0x10/0x218 [ 2073.019475] [ 2073.019479] The buggy address belongs to the object at ffff8000ccf63f00 which belongs to the cache kmalloc-1024 of size 1024 [ 2073.019484] The buggy address is located 552 bytes inside of 1024-byte region [ffff8000ccf63f00, ffff8000ccf64300) [ 2073.019486] The buggy address belongs to the page: [ 2073.019492] page:ffff7e000333d800 count:1 mapcount:0 mapping:ffff8000c0003a00 index:0x0 compound_mapcount: 0 [ 2073.020123] flags: 0x7ffff0000008100(slab|head) [ 2073.020403] raw: 07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00 [ 2073.020409] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000 [ 2073.020411] page dumped because: kasan: bad access detected [ 2073.020412] [ 2073.020414] Memory state around the buggy address: [ 2073.020420] ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020424] ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020428] >ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020430] ^ [ 2073.020434] ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020438] ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020439] ================================================================== The same problem exist in mainline as well. This is because oom_bfqq is moved to a non-root group, thus root_group is freed earlier. Thus fix the problem by don't move oom_bfqq. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220129015924.3958918-4-yukuai3@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Moving bfqq to it's parent bfqg is pointless. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20220129015924.3958918-3-yukuai3@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Use bfq_group() instead, which do the same thing. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220129015924.3958918-2-yukuai3@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 17 Feb, 2022 12 commits
-
-
Yahu Gao authored
The return value is ioprio * BFQ_WEIGHT_CONVERSION_COEFF or 0. What we want is ioprio or 0. Correct this by changing the calculation. Signed-off-by: Yahu Gao <gaoyahu19@gmail.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220107065859.25689-1-gaoyahu19@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
David Jeffery authored
When blk_mq_delay_run_hw_queues sets an hctx to run in the future, it can reset the delay length for an already pending delayed work run_work. This creates a scenario where multiple hctx may have their queues set to run, but if one runs first and finds nothing to do, it can reset the delay of another hctx and stall the other hctx's ability to run requests. To avoid this I/O stall when an hctx's run_work is already pending, leave it untouched to run at its current designated time rather than extending its delay. The work will still run which keeps closed the race calling blk_mq_delay_run_hw_queues is needed for while also avoiding the I/O stall. Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220131203337.GA17666@redhatSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Implement the ->free_disk method to free the virtio_blk structure only once the last gendisk reference goes away instead of keeping a local refcount. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20220215094514.3828912-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Implement the ->free_disk method to free the msb_data structure only once the last gendisk reference goes away instead of keeping a local refcount. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220215094514.3828912-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use set_disk_ro to propagate the read-only state to the block layer instead of checking for it in ->open and leaking a reference in case of a read-only device. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220215094514.3828912-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Implement the ->free_disk method to free the msb_data structure only once the last gendisk reference goes away instead of keeping a local refcount. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220215094514.3828912-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Add a method to notify the driver that the gendisk is about to be freed. This allows drivers to tie the lifetime of their private data to that of the gendisk and thus deal with device removal races without expensive synchronization and boilerplate code. A new flag is added so that ->free_disk is only called after a successful call to add_disk, which significantly simplifies the error handling path during probing. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220215094514.3828912-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Revert commit 4f1e9630 ("blk-throtl: optimize IOPS throttle for large IO scenarios") since we have another easier way to address this issue and get better iops throttling result. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220216044514.2903784-9-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
We need to throttle split bio in case of IOPS limit even though the split bio has been marked as BIO_THROTTLED since block layer accounts split bio actually. If only throughput throttle is setup, no need to throttle any more if BIO_THROTTLED is set since we have accounted & considered the whole bio bytes already. Add one flag of THROTL_TG_HAS_IOPS_LIMIT for serving this purpose. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220216044514.2903784-8-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Commit 111be883 ("block-throttle: avoid double charge") marks bio as BIO_THROTTLED unconditionally if __blk_throtl_bio() is called on this bio, then this bio won't be called into __blk_throtl_bio() any more. This way is to avoid double charge in case of bio splitting. It is reasonable for read/write throughput limit, but not reasonable for IOPS limit because block layer provides io accounting against split bio. Chunguang Xu has already observed this issue and fixed it in commit 4f1e9630 ("blk-throtl: optimize IOPS throttle for large IO scenarios"). However, that patch only covers bio splitting in __blk_queue_split(), and we have other kind of bio splitting, such as bio_split() & submit_bio_noacct() and other ways. This patch tries to fix the issue in one generic way by always charging the bio for iops limit in blk_throtl_bio(). This way is reasonable: re-submission & fast-cloned bio is charged if it is submitted to same disk/queue, and BIO_THROTTLED will be cleared if bio->bi_bdev is changed. This new approach can get much more smooth/stable iops limit compared with commit 4f1e9630 ("blk-throtl: optimize IOPS throttle for large IO scenarios") since that commit can't throttle current split bios actually. Also this way won't cause new double bio iops charge in blk_throtl_dispatch_work_fn() in which blk_throtl_bio() won't be called any more. Reported-by: Ning Li <lining2020x@163.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Chunguang Xu <brookxu@tencent.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220216044514.2903784-7-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Now submit_bio_checks() is only called by submit_bio_noacct(), so merge it into submit_bio_noacct(). Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220216044514.2903784-6-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
The bio has been checked already before throttling, so no need to check it again before dispatching it from throttle queue. Add a helper of submit_bio_noacct_nocheck() for this purpose. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220216044514.2903784-5-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-