- 04 Mar, 2024 40 commits
-
-
Naohiro Aota authored
We disable offloading checksum to workqueues and do it synchronously when the checksum algorithm is fast. However, as reported in the link below, RAID0 with multiple devices may suffer from the sync checksum, because "fast checksum" is still not fast enough to catch up with RAID0 writing. We don't have an effective way to determine whether to offload or not, for now add a sysfs knob so this can be debugged. This is intentionally under CONFIG_BTRFS_DEBUG so ti's not exposed to users as it may be removed in the future agin. Introduce fs_devices->offload_csum_mode, so that a btrfs developer can change the behavior by writing to /sys/fs/btrfs/<uuid>/offload_csum. The default is "auto" which is the same as the previous behavior. Or, you can set "on" or "off" (or "y" or "n" whatever kstrtobool() accepts) to always/never offload checksum. More benchmark need to be collected with this knob to implement a proper criteria to enable/disable checksum offloading. Link: https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] I have got at least two crash report for RAID6 syndrome generation, no matter if it's AVX2 or SSE2, they all seems to have a similar calltrace with corrupted RAX: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI Workqueue: btrfs-rmw rmw_rbio_work [btrfs] RIP: 0010:raid6_sse21_gen_syndrome+0x9e/0x130 [raid6_pq] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248 RDX: 0000000000000000 RSI: ffffa0f74cfa3238 RDI: 0000000000000000 Call Trace: <TASK> rmw_rbio+0x5c8/0xa80 [btrfs] process_one_work+0x1c7/0x3d0 worker_thread+0x4d/0x380 kthread+0xf3/0x120 ret_from_fork+0x2c/0x50 </TASK> [CAUSE] The cause is not known. Recently I also hit this in AVX512 path, and that's even in v5.15 backport, which doesn't have any of my RAID56 rework. Furthermore according to the registers: RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248 The RAX register is showing the number of stripes (including PQ), which is not correct (0). But the remaining two registers are all sane. - RBX is the sectorsize For x86_64 it should always be 4K and matches the output. - RCX is the pointers array Which is from rbio->finish_pointers, and it looks like a sane kernel address. [WORKAROUND] For now, I can only add extra debug ASSERT()s before we call raid6 gen_syndrome() helper and hopes to catch the problem. The debug requires both CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT enabled. My current guess is some use-after-free, but every report is only having corrupted RAX but seemingly valid pointers doesn't make much sense. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_free_tree_block(), we are always initializing a delayed reference to drop the given extent buffer but we only use if it does not belong to a log root tree. So we are doing unnecessary work here and increasing the duration of a critical section as this is normally called while holding a lock on the parent tree block (if any) and while holding a log transaction open. So initialize the delayed reference only if the extent buffer is not from a log tree, avoiding unnecessary work and making the code also a bit easier to follow. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
During an incremental send, before determining if we need to send a hole (write operations full of zeroes) we will search for the last extent's end offset if we are at the first slot of a leaf and the last processed extent's end offset is smaller then the current extent's start offset. However we are repeating this search in case we had the last extent's end offset undefined (set to the (u64)-1 value) when we entered maybe_send_hole(), wasting time. So avoid this duplicated search by combining the two conditions that trigger a search for the last extent's end offset into a single if statement. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The validation of vol args v2 name in snapshot and device remove ioctls is not done properly. A terminating NUL is written to the end of the buffer unconditionally, assuming that this would be the last place in case the buffer is used completely. This does not communicate back the actual error (either an invalid or too long path). Factor out all such cases and use a helper to do the verification, simply look for NUL in the buffer. There's no expected practical change, the size of buffer is 4088, this is enough for most paths or names. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The validation of vol args name in several ioctls is not done properly. a terminating NUL is written to the end of the buffer unconditionally, assuming that this would be the last place in case the buffer is used completely. This does not communicate back the actual error (either an invalid or too long path). Factor out all such cases and use a helper to do the verification, simply look for NUL in the buffer. There's no expected practical change, the size of buffer is 4088, this is enough for most paths or names. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The function btrfs_transaction_in_commit() is no longer used, its last use was removed in commit 11aeb97b ("btrfs: don't arbitrarily slow down delalloc if we're committing"), so just remove it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Neal Gompa authored
The IS_ENABLED() macro already guarantees the result will be a suitable boolean return value ("1" for enabled, and "0" for disabled). Thus, it seems that the "!!" used right before is unnecessary to force the 0/1 values. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Neal Gompa <neal@gompa.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The purpose of the BUG_ON is not clear. The helper btrfs_grab_root() could return a NULL in case args->root would be a NULL or if there are zero references. Then we check if the root pointer stored in the inode still exists. The whole call chain is for iget: btrfs_iget btrfs_iget_path btrfs_iget_locked iget5_locked btrfs_init_locked_inode which is called from many contexts where we the root pointer is used and we can safely assume has enough references. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Checking extent item size in add_inline_refs() is redundant, we do that already in tree-checker after reading the extent buffer and it won't change under normal circumstances. It was added long ago in 8da6d581 ("Btrfs: added btrfs_find_all_roots()") and does not seem to have a clear purpose. Similar case in extent_from_logical(), added in a542ad1b ("btrfs: added helper functions to iterate backrefs"). Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The BUG_ON is deep in the qgroup code where we can expect that it exists. A NULL pointer would cause a crash. It was added long ago in 550d7a2e ("btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents()."). It maybe made sense back then as the quota enable/disable state machine was not that robust as it is nowadays, so we can just delete it. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The only caller do_walk_down() of btrfs_qgroup_trace_subtree() validates the value of level and uses it several times before it's passed as an argument. Same for root_eb that's called 'next' in the caller. Change both BUG_ONs to assertions as this is to assure proper interface use rather than real errors. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There's only one caller of tree_move_down() that does not pass level 0 so the assertion is better suited here. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Change BUG_ON to proper error handling if building the path buffer fails. The pointers are not printed so we don't accidentally leak kernel addresses. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Change BUG_ON to proper error handling when an unexpected inode number is encountered. As the comment says this should never happen. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Change BUG_ON to a proper error handling in the unlikely case of seeing data when the command is started. This is supposed to be reset when the command is finished (send_cmd, send_encoded_extent). Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The may_destroy_subvol() looks up a root by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a root id. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The find_first_extent_item() helper looks up an extent item by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a extent item offset. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The extent_from_logical() helper looks up an extent item by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a extent item offset. The same error is already handled in btrfs_backref_iter_start() so add a comment for consistency. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Same comment was added to this type of error, unify that and drop the assertion as we'd find out quickly that something is wrong after returning -EUCLEAN. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The memory allocation error in add_async_extent() is not handled properly, return an error and push the BUG_ON to the caller. Handling it there is not trivial so at least make it visible. Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The "do_list" variable has a rather confusing name, so remove it and directly use btrfs_is_free_space_inode() instead. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The "do_list" variable is only used once, plus its name/meaning is a bit confusing, so remove it and directory use btrfs_is_free_space_inode(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When adding or removing and inode to/from the root's delalloc list, instead of using a BUG_ON() to validate list emptiness, use ASSERT() since this is to check logic errors rather than real errors. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The merge and split callbacks for an inode's io tree are supposed to be called while the io tree's spinlock is being held, so that the given extent_state records are stable, not modified or freed while the callbacks are using them. So add lockdep assertions in the callbacks. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When setting and clearing a delalloc range, at btrfs_set_delalloc_extent() and btrfs_clear_delalloc_extent(), we are adding/removing the inode to/from the root's list of delalloc inodes while under the protection of the inode's lock. This however is not needed, we can add and remove the inode to the root's list without holding the inode's lock because here we are under the protection of the io tree's lock, reducing the size of the critical section delimited by the inode's lock. The inode's lock is used in many other places such as when finishing an ordered extent (when calling btrfs_update_inode_bytes() or btrfs_delalloc_release_metadata(), or decreasing the number of outstanding extents) or when reserving space when doing a buffered or direct IO write (calls to functions from delalloc-space.c). So move the inode add/remove operations to the root's list of delalloc inodes to outside the critical section delimited by the inode's lock. This also allows us to get rid of the BTRFS_INODE_IN_DELALLOC_LIST flag since we can rely on the inode's delalloc bytes counter to determine if the inode is or is not in the list. The following fio based test, that exercises IO to multiple files in the same subvolume, was used to test: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 MOUNT_OPTIONS="-o ssd" mkfs.btrfs -f $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio --direct=0 --ioengine=sync --thread --directory=$MNT \ --invalidate=1 --group_reporting=1 \ --new_group --rw=randwrite --size=50m --numjobs=200 \ --bs=4k --fsync_on_close=0 --fallocate=none --end_fsync=0 \ --name=foo --filename_format=FioWorkloads.\$jobnum umount $MNT The test was run on a non-debug kernel (Debian's default kernel config) against a 16G null block device. Result before this patch: WRITE: bw=81.9MiB/s (85.9MB/s), 81.9MiB/s-81.9MiB/s (85.9MB/s-85.9MB/s), io=9.77GiB (10.5GB), run=122136-122136msec Result after this patch: WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=9.77GiB (10.5GB), run=115180-115180msec Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The function btrfs_add_delalloc_inodes() adds a single inode its root's list of delalloc inodes, so it doesn't make any sense at all for the function's name to be plural. Rename it to the singular form btrfs_add_delalloc_inode() to avoid any confusion. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
This function requires the delalloc lock of the inode's root to be held, so assert it's held. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
There's no need to pass a root argument to __btrfs_del_delalloc_inode() and btrfs_del_delalloc_inode(), we can just pass the inode since the root is always the root associated to that inode. Some remove the root argument from these functions. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
There's no need to pass a root argument to btrfs_add_delalloc_inodes(), we can just pass the inode since the root is always the root associated to the inode in the context it's called. So remove it and have the single caller pass only the inode. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Do a cleanup in the rest of the headers: - add forward declarations for types referenced by pointers - add includes when types need them This fixes potential compilation problems if the headers are reordered or the missing includes are not provided indirectly. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Do a cleanup in more headers: - add forward declarations for types referenced by pointers - add includes when types need them This fixes potential compilation problems if the headers are reordered or the missing includes are not provided indirectly. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Do a cleanup in the short headers: - add forward declarations for types referenced by pointers - add includes when types need them This fixes potential compilation problems if the headers are reordered or the missing includes are not provided indirectly. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The fs_info and sectorsize remain the same during the loops, no need to set them on each iteration. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Add convenience helpers to get a fs_info from a page or folio pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct page, folio, btrfs_root and btrfs_fs_info. The latter can't be static inlines as this would create loop between ctree.h <-> fs.h, or the headers would have to be restructured. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Add convenience helpers to get a struct btrfs_inode from a page or folio pointer instead of open coding the chain or intermediate BTRFS_I. This is implemented as a macro (still with type checking) so we don't need full definitions of struct page or address_space. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Allocate fs_info and root to have a valid fs_info pointer in case it's dereferenced by a helper outside of tests, like find_lock_delalloc_range(). Signed-off-by: David Sterba <dsterba@suse.com>
-
Lijuan Li authored
__btrfs_add_free_space is only used in free-space-cache.c, so mark it static. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The recommended pattern for transaction abort after error is to place it right after the error is handled. That way it's easier to locate where it failed and help debugging. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-