- 24 Sep, 2020 8 commits
-
-
Christoph Hellwig authored
Just checking SB_I_CGROUPWB for cgroup writeback support is enough. Either the file system allocates its own bdi (e.g. btrfs), in which case it is known to support cgroup writeback, or the bdi comes from the block layer, which always supports cgroup writeback. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Drivers shouldn't really mess with the readahead size, as that is a VM concept. Instead set it based on the optimal I/O size by lifting the algorithm from the md driver when registering the disk. Also set bdi->io_pages there as well by applying the same scheme based on max_sectors. To ensure the limits work well for stacking drivers a new helper is added to update the readahead limits from the block limits, which is also called from disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The raid5 and raid10 drivers currently update the read-ahead size, but not the optimal I/O size on reshape. To prepare for deriving the read-ahead size from the optimal I/O size make sure it is updated as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Set up a readahead size by default, as very few users have a good reason to change it. This means code, ecryptfs, and orangefs now set up the values while they were previously missing it, while ubifs, mtd and vboxsf manually set it to 0 to avoid readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd] Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
aoe forces a larger readahead size, but any reason to do larger I/O is not limited to readahead. Also set the optimal I/O size, and remove the local constants in favor of just using SZ_2G. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Inherit the optimal I/O size setting just like the readahead window, as any reason to do larger I/O does not apply to just readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Ever since the switch to blk-mq, a lower device not used for VM writeback will not be marked congested, so the check will never trigger. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The last user of SB_I_MULTIROOT is disappeared with commit f2aedb71 ("NFS: Add fs_context support.") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 23 Sep, 2020 18 commits
-
-
Christoph Hellwig authored
There are no users outside the core block code left now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use blkdev_get_by_dev instead of bdget + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
swap_type_of is used for two entirely different purposes: (1) check what swap type a given device/offset corresponds to (2) find the first available swap device that can be written to Mixing both in a single function creates an unreadable mess. Create two separate functions instead, and switch both to pass a dev_t instead of a struct block_device to further simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Just check the dev_t to help simplifying the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use blkdev_get_by_dev instead of bdgrab + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use blkdev_get_by_dev instead of igrab (aka open coded bdgrab) + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use blkdev_get_by_dev instead of bdget_disk + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Turn binding into a normal dev_t as the struct block device doesn't buy us anything and use blkdev_open_by_dev to actually open it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use blkdev_get_by_dev instead of bdgrab + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Replace bdget + blkdev_get by blkdev_get_by_dev. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Remove code which has been dead since the initial commit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use blkdev_get_by_dev instead of bdgrab + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use blkdev_get_by_dev instead of open coding it using bdget_disk + blkdev_get, and split the code to read the partition table into a separate helper to make it a little more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
We can only scan for partitions on the whole disk, so move the flag from struct block_device to struct gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Mike Snitzer authored
It is possible, albeit more unlikely, for a block device to have a non power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors, which results in a full-stripe size of 1280K. This causes the RAID6's io_opt to be advertised as 1280K, and a stacked device _could_ then be made to use a blocksize, aka chunk_sectors, that matches non power-of-2 io_opt of underlying RAID6 -- resulting in stacked device's chunk_sectors being a non power-of-2). Update blk_queue_chunk_sectors() and blk_max_size_offset() to accommodate drivers that need a non power-of-2 chunk_sectors. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Mike Snitzer authored
Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using lcm_not_zero() rather than min_not_zero() -- otherwise the final 'chunk_sectors' could result in sub-optimal alignment of IO to component devices in the IO stack. Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size' then it is a bug in the driver and the device should be flagged as 'misaligned'. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
bmd is allocated using kmalloc in bio_alloc_map_data, so make sure is_null_mapped is properly initialized to false for the !null_mapped case. Fixes: f3256075 ("block: remove the BIO_NULL_MAPPED flag") Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Julia Lawall authored
sg_init_table zeroes its first argument, so the allocation of that argument doesn't have to. the semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ x = - kzalloc + kmalloc (...) ... sg_init_table(x,...) // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 15 Sep, 2020 5 commits
-
-
Baolin Wang authored
Do not need check the bps or iops limitation if bps or iops is unlimited. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Baolin Wang authored
The tg_may_dispatch() will call tg_with_in_bps_limit() and tg_with_in_iops_limit() to check if we can dispatch a bio or not, which will calculate bps/iops limitation multiple times. But tg_may_dispatch() is always called under queue lock, which means the bps/iops limitation will not change in tg_may_dispatch(). So we can calculate the bps/iops limitation only once, and pass them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to avoid calculating bps/iops limitation repeatedly. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Baolin Wang authored
The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only variables, thus better to use readable macros instead of static variables, which can also save some spaces for .bss area. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Baolin Wang authored
Use readable READ/WRITE macros instead of magic numbers. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Baolin Wang authored
Fix some comments' typos. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 14 Sep, 2020 1 commit
-
-
Tejun Heo authored
adjust_inuse_and_calc_cost() is responsible for reducing the amount of donated weights dynamically in period as the budget runs low. Because we don't want to do full donation calculation in period, we keep latching up inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the resulting hweight_inuse is satisfactory. Unfortunately, the adj_step calculation was reading the active weight before acquiring ioc->lock. Because the current thread could have lost race to activate the iocg to another thread before entering this function, it may read the active weight as zero before acquiring ioc->lock. When this happens, the adj_step is calculated as zero and the incremental adjustment loop becomes an infinite one. Fix it by fetching the active weight after acquiring ioc->lock. Fixes: b0853ab4 ("blk-iocost: revamp in-period donation snapbacks") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 11 Sep, 2020 6 commits
-
-
Tejun Heo authored
Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but all hweight calculations round up and thus it may end up >= WEIGHT_ONE triggering divide-by-zero and other issues. Bound the value to avoid surprises. Fixes: e08d02aa ("blk-iocost: implement Andy's method for donation weight updates") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Song Liu authored
This enables proper statistics in /proc/diskstats for bcache partitions. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Song Liu authored
This enables proper statistics in /proc/diskstats for md partitions. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Song Liu authored
These functions can be used to enable iostat for partitions on devices like md, bcache. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
NVMe shares tagset between fabric queue and admin queue or between connect_q and NS queue, so hctx_may_queue() can be called to allocate request for these queues. Tags can be reserved in these tagset. Before error recovery, there is often lots of in-flight requests which can't be completed, and new reserved request may be needed in error recovery path. However, hctx_may_queue() can always return false because there is too many in-flight requests which can't be completed during error handling. Finally, nothing can proceed. Fix this issue by always allowing reserved tag allocation in hctx_may_queue(). This is reasonable because reserved tags are supposed to always be available. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: David Milburn <dmilburn@redhat.com> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tian Tao authored
scsi/sg.h is included more than once, Remove the one that isn't necessary. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 10 Sep, 2020 2 commits
-
-
Xianting Tian authored
The test and the explaination of the patch as bellow. Before test we added more debug code in blkg_async_bio_workfn(): int count = 0 if (bios.head && bios.head->bi_next) { need_plug = true; blk_start_plug(&plug); } while ((bio = bio_list_pop(&bios))) { /*io_punt is a sysctl user interface to control the print*/ if(io_punt) { printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n", current->comm, current->pid, bio->bi_iter.bi_sector, (bio->bi_iter.bi_size)>>9, count++, need_plug); } submit_bio(bio); } if (need_plug) blk_finish_plug(&plug); Steps that need to be set to trigger *PUNT* io before testing: mount -t btrfs -o compress=lzo /dev/sda6 /btrfs mount -t cgroup2 nodev /cgroup2 mkdir /cgroup2/cg3 echo "+io" > /cgroup2/cgroup.subtree_control echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s echo $$ > /cgroup2/cg3/cgroup.procs Then use dd command to test btrfs PUNT io in current shell: dd if=/dev/zero of=/btrfs/file bs=64K count=100000 Test hardware environment as below: [root@localhost btrfs]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel With above debug code, test command and test environment, I did the tests under 3 different system loads, which are triggered by stress: 1, Run 64 threads by command "stress -c 64 &" [53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1 [53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1 [53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1 [53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1 [53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1 [53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1 ... ... [53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1 [53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1 [53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1 [53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1 [53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1 [53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1 [53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1 2, Run 32 threads by command "stress -c 32 &" [50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1 [50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1 [50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1 [50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1 [50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1 [50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1 ... ... [50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1 [50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1 [50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1 [50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1 [50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1 [50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1 [50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1 [50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1 3, Don't run thread by stress [50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0 [50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0 [50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0 [50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0 [50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0 [50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0 [50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0 [50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0 [50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0 [50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0 [50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0 [50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0 [50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1 [50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1 [50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0 [50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0 [50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0 [50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0 [50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0 [50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0 [50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0 Analysis of above 3 test results with different system load: >From above test, we can see more and more continuous bios can be plugged with system load increasing. When run "stress -c 64 &", 310 continuous bios are plugged; When run "stress -c 32 &", 260 continuous bios are plugged; When don't run stress, at most only 2 continuous bios are plugged, in most cases, bio_list only contains one single bio. How to explain above phenomenon: We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is scheduled, it depends on the system load. When system load is low, the workqueue will be quickly scheduled, and the bio in bio_list will be quickly processed in blkg_async_bio_workfn(), so there is less chance that the same io submit thread can add multiple continuous bios to bio_list before workqueue is scheduled to run. The analysis aligned with above test "3". When system load is high, there is some delay before the workqueue can be scheduled to run, the higher the system load the greater the delay. So there is more chance that the same io submit thread can add multiple continuous bios to bio_list. Then when the workqueue is scheduled to run, there are more continuous bios in bio_list, which will be processed in blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2". According to test, we can get io performance improved with the patch, especially when system load is higher. Another optimazition is to use the plug only when bio_list contains at least 2 bios. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Remove the now unused check_disk_change helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-