- 25 Jul, 2020 15 commits
-
-
Coly Li authored
We have struct cache_sb_disk for on-disk super block already, it is unnecessary to keep the in-memory super block format exactly mapping to the on-disk struct layout. This patch adds code comments to notice that struct cache_sb is not exactly mapping to cache_sb_disk, and removes the useless member csum and pad[5]. Although struct cache_sb does not belong to uapi, but there are still some on-disk format related macros reference it and it is unncessary to get rid of such dependency now. So struct cache_sb will continue to stay in include/uapi/linux/bache.h for now. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
Setting sb->first_bucket and checking sb->keys indeed are only for cache device, it does not make sense to do them in read_super() for backing device too. This patch moves the related code piece into read_super_common() explicitly for cache device and avoid the confusion. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
The new added super block version BCACHE_SB_VERSION_BDEV_WITH_FEATURES (5) BCACHE_SB_VERSION_CDEV_WITH_FEATURES value (6), is for the feature set bits. Devices have super block version equal to the new version will have three new members for feature set bits in the on-disk super block, __le64 feature_compat; __le64 feature_incompat; __le64 feature_ro_compat; They are used for further new features which may introduce on-disk format change, and avoid unncessary super block version increase. The very basic features handling code skeleton is also initialized in this patch. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
In register_cache_set(), c is pointer to struct cache_set, and ca is pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this registering cache has up to date version and other members, the in- memory version and other members should be updated to the newer value. But current implementation makes a cache set only has a single cache device, so the above assumption works well except for a special case. The execption is when a cache device new created and both ca->sb.seq and c->sb.seq are 0, because the super block is never flushed out yet. In the location for the following if() check, 2156 if (ca->sb.seq > c->sb.seq) { 2157 c->sb.version = ca->sb.version; 2158 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16); 2159 c->sb.flags = ca->sb.flags; 2160 c->sb.seq = ca->sb.seq; 2161 pr_debug("set version = %llu\n", c->sb.version); 2162 } c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0, the if() check will fail (because both values are 0), and the cache set version, set_uuid, flags and seq won't be updated. The above problem is hiden for current code, because the bucket size is compatible among different super block version. And the next time when running cache set again, ca->sb.seq will be larger than 0 and cache set super block version will be updated properly. But if the large bucket feature is enabled, sb->bucket_size is the low 16bits of the bucket size. For a power of 2 value, when the actual bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then read_super_common() will fail because the if() check to is_power_of_2(sb->bucket_size) is false. This is how the long time hidden bug is triggered. This patch modifies the if() check to the following way, 2156 if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) { Then cache set's version, set_uuid, flags and seq will always be updated corectly including for a new created cache device. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
In bch_cache_set_alloc() there is a big if() checks combined by 11 items together. When this big if() statement fails, it is difficult to tell exactly which item fails indeed. This patch disassembles this big if() checks into 11 single if() checks, which makes code debug more easier. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
The improperly set bucket or block size will trigger error in read_super_common(). For large bucket size, a more accurate error message for invalid bucket or block size is necessary. This patch disassembles the combined if() checks into multiple single if() check, and provide more accurate error message for each check failure condition. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
Later patches will introduce feature set bits to on-disk super block and increase super block version. Current code in read_super() which reads common part of super block for version BCACHE_SB_VERSION_CDEV and version BCACHE_SB_VERSION_CDEV_WITH_UUID will be shared with the new version. Therefore this patch moves the reusable part into read_super_common(), this preparation patch will make later patches more simplier and only focus on new feature set bits. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
offset_to_stripe() returns the stripe number (in type unsigned int) from an offset (in type uint64_t) by the following calculation, do_div(offset, d->stripe_size); For large capacity backing device (e.g. 18TB) with small stripe size (e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual returned value which caller receives is 536870912, due to the overflow. Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are in range [0, bcache_dev->nr_stripes - 1]. This patch adds a upper limition check in offset_to_stripe(): the max valid stripe number should be less than bcache_device->nr_stripes. If the calculated stripe number from do_div() is equal to or larger than bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes is less than INT_MAX, exceeding upper limitation doesn't mean overflow, therefore -EOVERFLOW is not used as error code.) This patch also changes nr_stripes' type of struct bcache_device from 'unsigned int' to 'int', and return value type of offset_to_stripe() from 'unsigned int' to 'int', to match their exact data ranges. All locations where bcache_device->nr_stripes and offset_to_stripe() are referenced also get updated for the above type change. Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
For some block devices which large capacity (e.g. 8TB) but small io_opt size (e.g. 8 sectors), in bcache_device_init() the stripes number calcu- lated by, DIV_ROUND_UP_ULL(sectors, d->stripe_size); might be overflow to the unsigned int bcache_device->nr_stripes. This patch uses the uint64_t variable to store DIV_ROUND_UP_ULL() and after the value is checked to be available in unsigned int range, sets it to bache_device->nr_stripes. Then the overflow is avoided. Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Gustavo A. R. Silva authored
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Gustavo A. R. Silva authored
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Xu Wang authored
Remove unneeded variable i in bch_dirty_init_thread(). Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Xu Wang authored
Using for_each_clear_bit() to simplify the code. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
There are some meta data of bcache are allocated by multiple pages, and they are used as bio bv_page for I/Os to the cache device. for example cache_set->uuids, cache->disk_buckets, journal_write->data, bset_tree->data. For such meta data memory, all the allocated pages should be treated as a single memory block. Then the memory management and underlying I/O code can treat them more clearly. This patch adds __GFP_COMP flag to all the location allocating >0 order pages for the above mentioned meta data. Then their pages are treated as compound pages now. Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jean Delvare authored
registraion -> registration Fixes: 0c8d3fce ("bcache: configure the asynchronous registertion to be experimental") Signed-off-by: Jean Delvare <jdelvare@suse.de> Reviewed-by: Coly Li <colyli@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 24 Jul, 2020 1 commit
-
-
Jens Axboe authored
Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.9/drivers Pull MD fix from Song. * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid5: use do_div() for 64 bit divisions in raid5_sync_request
-
- 23 Jul, 2020 1 commit
-
-
Yufen Yu authored
We get compilation error on 32-bit architectures (e.g. m68k), as: ERROR: modpost: "__udivdi3" [drivers/md/raid456.ko] undefined! Since 'sync_blocks' is defined as u64, use do_div() to fix this error. Fixes: c911c46c ("md/raid456: convert macro STRIPE_* to RAID5_STRIPE_*") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
- 22 Jul, 2020 8 commits
-
-
Jens Axboe authored
Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.9/drivers Pull MD for 5.9 from Song. * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid10: avoid deadlock on recovery. raid: md_p.h: drop duplicated word in a comment md-cluster: fix rmmod issue when md_cluster convert bitmap to none md-cluster: fix safemode_delay value when converting to clustered bitmap md/raid5: support config stripe_size by sysfs entry md/raid5: set default stripe_size as 4096 md/raid456: convert macro STRIPE_* to RAID5_STRIPE_* raid5: remove the meaningless check in raid5_make_request raid5: put the comment of clear_batch_ready to the right place raid5: call clear_batch_ready before set STRIPE_ACTIVE md: raid10: Fix compilation warning md: raid5: Fix compilation warning md: raid5-cache: Remove set but unused variable md: Fix compilation warning
-
Vitaly Mayatskikh authored
When disk failure happens and the array has a spare drive, resync thread kicks in and starts to refill the spare. However it may get blocked by a retry thread that resubmits failed IO to a mirror and itself can get blocked on a barrier raised by the resync thread. Acked-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Vitaly Mayatskikh <vmayatskikh@digitalocean.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Randy Dunlap authored
Drop the doubled word "the" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Song Liu <songliubraving@fb.com>
-
Zhao Heming authored
update_array_info misses calling module_put when removing cluster bitmap. steps to reproduce: ``` node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb mdadm: array /dev/md0 started. node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 1 dlm 212992 14 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster node1 # mdadm -G /dev/md0 -b none node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 1 <== should be zero dlm 212992 9 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster node1 # mdadm -G /dev/md0 -b clustered node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 2 <== increase dlm 212992 14 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster node1 # mdadm -G /dev/md0 -b none node1 # mdadm -G /dev/md0 -b clustered node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 3 <== increase dlm 212992 14 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster ``` Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Zhao Heming authored
When array convert to clustered bitmap, the safe_mode_delay doesn't clean and vice versa. the /sys/block/mdX/md/safe_mode_delay keep original value after changing bitmap type. In safe_delay_store(), the code forbids setting mddev->safemode_delay when array is clustered. So in cluster-md env, the expected safemode_delay value should be 0. Reproducible steps: ``` node1 # mdadm --zero-superblock /dev/sd{b,c,d} node1 # mdadm -C /dev/md0 -b internal -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc node1 # cat /sys/block/md0/md/safe_mode_delay 0.204 node1 # mdadm -G /dev/md0 -b none node1 # mdadm --grow /dev/md0 --bitmap=clustered node1 # cat /sys/block/md0/md/safe_mode_delay 0.204 <== doesn't change, should ZERO for cluster-md node1 # mdadm --zero-superblock /dev/sd{b,c,d} node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc node1 # cat /sys/block/md0/md/safe_mode_delay 0.000 node1 # mdadm -G /dev/md0 -b none node1 # cat /sys/block/md0/md/safe_mode_delay 0.000 <== doesn't change, should default value ``` Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Yufen Yu authored
Adding a new 'stripe_size' sysfs entry to set and show stripe_size. stripe_size should not be bigger than PAGE_SIZE, and it requires to be multiple of 4096. We can adjust stripe_size by writing value into sysfs entry, likely, set stripe_size as 16KB: echo 16384 > /sys/block/md1/md/stripe_size Show current stripe_size value: cat /sys/block/md1/md/stripe_size For PAGE_SIZE is equal to 4096, 'stripe_size' can just be read. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Yufen Yu authored
In RAID5, if issued bio size is bigger than stripe_size, it will be split in the unit of stripe_size and process them one by one. Even for size less then stripe_size, RAID5 also request data from disk at least of stripe_size. Nowdays, stripe_size is equal to the value of PAGE_SIZE. Since filesystem usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE as 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data while RAID5 issue IO at least stripe_size (64KB) each time. That will waste resource of disk bandwidth and compute xor. To avoding the waste, we want to make stripe_size configurable. This patch just set default stripe_size as 4096. User can also set the value bigger than 4KB for some special requirements, such as we know the issued io size is more than 4KB. To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE. 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default configure on /mnt directory. Then, trying to test it by dbench with command: dbench -D /mnt -t 1000 10. Result show as: 'stripe_size = 64KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9805011 0.021 64.728 Close 7202525 0.001 0.120 Rename 415213 0.051 44.681 Unlink 1980066 0.079 93.147 Deltree 240 1.793 6.516 Mkdir 120 0.004 0.007 Qpathinfo 8887512 0.007 37.114 Qfileinfo 1557262 0.001 0.030 Qfsinfo 1629582 0.012 0.152 Sfileinfo 798756 0.040 57.641 Find 3436004 0.019 57.782 WriteX 4887239 0.021 57.638 ReadX 15370483 0.005 37.818 LockX 31934 0.003 0.022 UnlockX 31933 0.001 0.021 Flush 687205 13.302 530.088 Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms ------------------------------------------------------- 'stripe_size = 4KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 11999166 0.021 36.380 Close 8814128 0.001 0.122 Rename 508113 0.051 29.169 Unlink 2423242 0.070 38.141 Deltree 300 1.885 7.155 Mkdir 150 0.004 0.006 Qpathinfo 10875921 0.007 35.485 Qfileinfo 1905837 0.001 0.032 Qfsinfo 1994304 0.012 0.125 Sfileinfo 977450 0.029 26.489 Find 4204952 0.019 9.361 WriteX 5981890 0.019 27.804 ReadX 18809742 0.004 33.491 LockX 39074 0.003 0.025 UnlockX 39074 0.001 0.014 Flush 841022 10.712 458.848 Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms ------------------------------------------------------- It show that setting stripe_size as 4KB has higher thoughput, i.e. (376.777 vs 307.799) and has smaller latency than that setting as 64KB. 2) We try to evaluate IO throughput for /dev/md5 by fio with config: [4KB randwrite] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=4KB rw=randwrite [64KB write] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=1MB rw=write The result as follow: + + | stripe_size(64KB) | stripe_size(4KB) +----------------------------------------------------+ 4KB randwrite | 15MB/s | 100MB/s +----------------------------------------------------+ 1MB write | 1000MB/s | 700MB/s The result show that when size of io is bigger than 4KB (64KB), 64KB stripe_size has much higher IOPS. But for 4KB randwrite, that means, size of io issued to device are smaller, 4KB stripe_size have better performance. Normally, default value (4096) can get relatively good performance. But if each issued io is bigger than 4096, setting value more than 4096 may get better performance. Here, we just set default stripe_size as 4096, and we will try to support setting different stripe_size by sysfs interface in the following patch. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Yufen Yu authored
Convert macro STRIPE_SIZE, STRIPE_SECTORS and STRIPE_SHIFT to RAID5_STRIPE_SIZE(), RAID5_STRIPE_SECTORS() and RAID5_STRIPE_SHIFT(). This patch is prepare for the following adjustable stripe_size. It will not change any existing functionality. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
- 16 Jul, 2020 7 commits
-
-
Guoqing Jiang authored
We can't guarntee the batched stripe to be set with STRIPE_HANDLE since there are lots of functions could set the flag, such as sync_request, ops_complete_* and end_{read,write}_request etc. Also clear_batch_ready called in handle_stripe ensures the batched list can't continue to be handled by handle_stripe. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Guoqing Jiang authored
To make people understand the function well, let's put the comment to the right place. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Guoqing Jiang authored
We tried to only put the head sh of batch list to handle_list, then the handle_stripe doesn't handle other members in the batch list. However, we still got the calltrace in break_stripe_batch_list. [593764.644269] stripe state: 2003 kernel: [593764.644299] ------------[ cut here ]------------ kernel: [593764.644308] WARNING: CPU: 12 PID: 856 at drivers/md/raid5.c:4625 break_stripe_batch_list+0x203/0x240 [raid456] [...] kernel: [593764.644363] Call Trace: kernel: [593764.644370] handle_stripe+0x907/0x20c0 [raid456] kernel: [593764.644376] ? __wake_up_common_lock+0x89/0xc0 kernel: [593764.644379] handle_active_stripes.isra.57+0x35f/0x570 [raid456] kernel: [593764.644382] ? raid5_wakeup_stripe_thread+0x96/0x1f0 [raid456] kernel: [593764.644385] raid5d+0x480/0x6a0 [raid456] kernel: [593764.644390] ? md_thread+0x11f/0x160 kernel: [593764.644392] md_thread+0x11f/0x160 kernel: [593764.644394] ? wait_woken+0x80/0x80 kernel: [593764.644396] kthread+0xfc/0x130 kernel: [593764.644398] ? find_pers+0x70/0x70 kernel: [593764.644399] ? kthread_create_on_node+0x70/0x70 kernel: [593764.644401] ret_from_fork+0x1f/0x30 As we can see, the stripe was set with STRIPE_ACTIVE and STRIPE_HANDLE, and only handle_stripe could set those flags then return. And since the stipe was already in the batch list, we need to return earlier before set the two flags. And after dig a little about git history especially commit 3664847d ("md/raid5: fix a race condition in stripe batch"), it seems the batched stipe still could be handled by handle_stipe, then handle_stipe needs to return earlier if clear_batch_ready to return true. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Damien Le Moal authored
Remove the if statement around the call to sysfs_link_rdev() in raid10_start_reshape() to avoid the compilation warning: warning: suggest braces around empty body in an ‘if’ statement when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Damien Le Moal authored
Remove the if statement around the calls to sysfs_link_rdev() to avoid the compilation warning "suggest braces around empty body in an ‘if’ statement" when compiling with W=1. Also fix function description comments to avoid kdoc format warnings. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Damien Le Moal authored
Remove the variable offset in r5c_tree_index() to avoid a "set but not used" compilation warning when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Damien Le Moal authored
Remove the if statement around the calls to sysfs_link_rdev() to avoid the compilation warnings: warning: suggest braces around empty body in an ‘if’ statement when compiling with W=1. For the call to sysfs_create_link() generating the same warning, use the err variable to store the function result, avoiding triggering another warning as the function is declared as 'warn_unused_result'. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
- 15 Jul, 2020 8 commits
-
-
Niklas Cassel authored
Add a new max_active zones definition in the sysfs documentation. This definition will be common for all devices utilizing the zoned block device support in the kernel. Export max_active_zones according to this new definition for NVMe Zoned Namespace devices, ZAC ATA devices (which are treated as SCSI devices by the kernel), and ZBC SCSI devices. Add the new max_active_zones member to struct request_queue, rather than as a queue limit, since this property cannot be split across stacking drivers. For SCSI devices, even though max active zones is not part of the ZBC/ZAC spec, export max_active_zones as 0, signifying "no limit". Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Niklas Cassel authored
Add a new max_open_zones definition in the sysfs documentation. This definition will be common for all devices utilizing the zoned block device support in the kernel. Export max open zones according to this new definition for NVMe Zoned Namespace devices, ZAC ATA devices (which are treated as SCSI devices by the kernel), and ZBC SCSI devices. Add the new max_open_zones member to struct request_queue, rather than as a queue limit, since this property cannot be split across stacking drivers. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.9/drivers Pull MD fixes from Song. * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md-cluster: fix wild pointer of unlock_all_bitmaps() md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes md: fix deadlock causing by sysfs_notify md: improve io stats accounting md: raid0/linear: fix dereference before null check on pointer mddev
-
Gustavo A. R. Silva authored
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. Also, remove unnecessary variable _datasize_. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Stefan Haberland authored
During initialization of the DASD DIAG driver a request is issued that has a bio structure that resides on the stack. With virtually mapped kernel stacks this bio address might be in virtual storage which is unsuitable for usage with the diag250 call. In this case the device can not be set online using the DIAG discipline and fails with -EOPNOTSUP. In the system journal the following error message is presented: dasd: X.X.XXXX Setting the DASD online with discipline DIAG failed with rc=-95 Fix by allocating the bio structure instead of having it on the stack. Fixes: ce3dc447 ("s390: add support for virtually mapped kernel stacks") Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Cc: stable@vger.kernel.org #4.20 Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Zhao Heming authored
reproduction steps: ``` node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb node1 # mdadm -G /dev/md0 -b none mdadm: failed to remove clustered bitmap. node1 # mdadm -S --scan ^C <==== mdadm hung & kernel crash ``` kernel stack: ``` [ 335.230657] general protection fault: 0000 [#1] SMP NOPTI [...] [ 335.230848] Call Trace: [ 335.230873] ? unlock_all_bitmaps+0x5/0x70 [md_cluster] [ 335.230886] unlock_all_bitmaps+0x3d/0x70 [md_cluster] [ 335.230899] leave+0x10f/0x190 [md_cluster] [ 335.230932] ? md_super_wait+0x93/0xa0 [md_mod] [ 335.230947] ? leave+0x5/0x190 [md_cluster] [ 335.230973] md_cluster_stop+0x1a/0x30 [md_mod] [ 335.230999] md_bitmap_free+0x142/0x150 [md_mod] [ 335.231013] ? _cond_resched+0x15/0x40 [ 335.231025] ? mutex_lock+0xe/0x30 [ 335.231056] __md_stop+0x1c/0xa0 [md_mod] [ 335.231083] do_md_stop+0x160/0x580 [md_mod] [ 335.231119] ? 0xffffffffc05fb078 [ 335.231148] md_ioctl+0xa04/0x1930 [md_mod] [ 335.231165] ? filename_lookup+0xf2/0x190 [ 335.231179] blkdev_ioctl+0x93c/0xa10 [ 335.231205] ? _cond_resched+0x15/0x40 [ 335.231214] ? __check_object_size+0xd4/0x1a0 [ 335.231224] block_ioctl+0x39/0x40 [ 335.231243] do_vfs_ioctl+0xa0/0x680 [ 335.231253] ksys_ioctl+0x70/0x80 [ 335.231261] __x64_sys_ioctl+0x16/0x20 [ 335.231271] do_syscall_64+0x65/0x1f0 [ 335.231278] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Song Liu authored
In recovery, if we process too much data, raid5-cache may set MD_SB_CHANGE_PENDING, which causes spinning in handle_stripe(). Fix this issue by clearing the bit before flushing data only stripes. This issue was initially discussed in [1]. [1] https://www.spinics.net/lists/raid/msg64409.htmlSigned-off-by: Song Liu <songliubraving@fb.com>
-
Junxiao Bi authored
The following deadlock was captured. The first process is holding 'kernfs_mutex' and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device, this pending bio list would be flushed by second process 'md127_raid1', but it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace sysfs_notify() can fix it. There were other sysfs_notify() invoked from io path, removed all of them. PID: 40430 TASK: ffff8ee9c8c65c40 CPU: 29 COMMAND: "probe_file" #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06 #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6 #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs] #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs] #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs] #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs] #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs] #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs] #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7 #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93 #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968 #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2 #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445 #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a #16 [ffffb87c4df37958] new_slab at ffffffff9a251430 #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85 #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89 #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10 #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854 #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377 #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118 #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83 #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619 #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566 #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787 #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949 #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad RIP: 00007f617a5f2905 RSP: 00007f607334f838 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f6064044b20 RCX: 00007f617a5f2905 RDX: 00007f6064044b20 RSI: 00007f6064044b20 RDI: 00007f6064005890 RBP: 00007f6064044aa0 R8: 0000000000000030 R9: 000000000000011c R10: 0000000000000013 R11: 0000000000000246 R12: 00007f606417e6d0 R13: 00007f6064044aa0 R14: 00007f6064044b10 R15: 00000000ffffffff ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b PID: 927 TASK: ffff8f15ac5dbd80 CPU: 42 COMMAND: "md127_raid1" #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06 #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013 #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83 #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696 #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5 #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1] #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348 #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005 #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-