- 05 Dec, 2022 40 commits
-
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The extent_io_tree::private_data was meant to be a preparatory work for the metadata inode rework but that never materialized. Now it's used only for an inode so it's better to change the appropriate type and rename it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
All callers except one pass NULL, so the parameter can be dropped and the inode::io_tree initialization can be open coded. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The btrfs_writepage_fixup structure is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The btrfs_dio_private structure is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The async bio submit is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
After previous patches the unused parameters can be removed from btree_submit_bio_start and btrfs_submit_bio_start as they don't need to conform to the extent_submit_bio_start_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There's a callback function parameter for btrfs_wq_submit_bio that can be one of: metadata, buffered data, direct io data. The callback abstraction is unnecessary as we have all functions available. Replace the parameter with a command that leads to a direct call in run_one_async_start. The called functions can be then simplified and we can also remove the extent_submit_bio_start_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Compression and direct io don't work together so the compression parameter can be dropped after previous patch that changed the call to direct. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There's a function pointer passed to btrfs_repair_one_sector that will submit the right bio for repair. However there are only two callbacks, for buffered and for direct IO. This can be simplified to a bool-based switch and call either function, indirect calls in this case is an unnecessary abstraction. This allows to remove the submit_bio_hook_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
In zoned mode the sequential status of zone can be also tracked in the runtime flags of block group. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We already have flags in block group to track various status bits, convert needs_free_space as well and reduce size of btrfs_block_group. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We have a 64bit compatible helper to check if a value is a power of two, use it instead of open coding it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The copy_page helper may use an optimized version for full page copy (eg. on s390 there's a special instruction for that), there's one more left to convert. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
After the previous patchset which is comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source we have now much better performance when doing backref walking in the send code, so we can increase the current limit from 64 to 1024 references. This limit is still a bit conservative because there are still edge cases where backref walking will be too slow and spend a lot of cpu time, some IO reading b+tree nodes/leaves and memory. The goal is to eventually get rid of any limit, but for now bump it as it benefits users with extents shared more than 64 times and up to 1024 times, allowing for more deduplication at the destination without having to run a dedupe tool after a receive. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When doing backref walking to determine a source range to clone from, it is worthless to collect and resolve our own data backref, as we can't obviously use it as a clone source and it represents the range we want to clone into. Collecting the backref implies doing the extra work to resolve it, doing the search for a file extent item in a subvolume tree, etc. Skipping the data backref is valid as long as we only have the send root as the single clone root, otherwise the leaf with the file extent item may be accessible from another clone root due to shared subtrees created by snapshots, and therefore we have to collect the backref and resolve it. So add a callback to the backref walking code to guide it to skip data backrefs. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source The following test was run on non-debug kernel (Debian's default kernel config) before and after applying the patchset: $ cat test-send-many-shared-extents.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT num_files=50000 num_clones_per_file=50 for ((i = 1; i <= $num_files; i++)); do xfs_io -f -c "pwrite 0 64K" $MNT/file_$i > /dev/null echo -ne "\r$i files created..." done echo btrfs subvolume snapshot -r $MNT $MNT/snap1 cloned=0 for ((i = 1; i <= $num_clones_per_file; i++)); do for ((j = 1; j <= $num_files; j++)); do cp --reflink=always $MNT/file_$j $MNT/file_${j}_clone_${i} cloned=$((cloned + 1)) echo -ne "\r$cloned / $((num_files * num_clones_per_file)) clone operations" done done echo btrfs subvolume snapshot -r $MNT $MNT/snap2 # Unmount and mount again to clear all cached metadata (and data). umount $DEV mount $DEV $MNT start=$(date +%s%N) btrfs send $MNT/snap2 > /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000000 )) echo -e "\nFull send took $dur seconds" # Unmount and mount again to clear all cached metadata (and data). umount $DEV mount $DEV $MNT start=$(date +%s%N) btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000000 )) echo -e "\nIncremental send took $dur seconds" umount $MNT Before applying the patchset: (...) Full send took 1108 seconds (...) Incremental send took 1135 seconds After applying the whole patchset: (...) Full send took 268 seconds (-75.8%) (...) Incremental send took 316 seconds (-72.2%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At find_extent_clone() we search twice for the extent item corresponding to the data extent that the current file extent items points to: 1) Once with a call to extent_from_logical(); 2) Once again during backref walking, through iterate_extent_inodes() which eventually leads to find_parent_nodes() where we will search again the extent tree for the same extent item. The extent tree can be huge, so doing this one extra search for every extent we want to send adds up and it's expensive. The first call is there since the send code was introduced and it accomplishes two things: 1) Check that the extent is flagged as a data extent in the extent tree. But it can not be anything else, otherwise we wouldn't have a file extent item in the send root pointing to it. This was probably added to catch bugs in the early days where send was yet too young and the interaction with everything else was far from perfect; 2) Check how many direct references there are on the extent, and if there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the backred walking as it may take too long and slowdown send. So improve on this by having a callback in the backref walking code that is called when it finds the extent item in the extent tree, and have those checks done in the callback. When the callback returns anything different from 0, it stops the backref walking code. This way we do a single search on the extent tree for the extent item of our data extent. Also, before this change we were only checking the number of references on the data extent against SEND_MAX_EXTENT_REFS, but after starting backref walking we will end up resolving backrefs for extent buffers in the path from a leaf having a file extent item pointing to our data extent, up to roots of trees from which the extent buffer is accessible from, due to shared subtrees resulting from snapshoting. We were therefore allowing for the possibility for send taking too long due to some node in the path from the leaf to a root node being shared too many times. After this change we check for reference counts being greater than SEND_MAX_EXTENT_REFS for both data extents and metadata extents. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source Performance test results are in the changelog of patch 17/17. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When looking for a clone source for an extent, we are iterating over all the backreferences for an extent. This is often a waste of time, because once we find a good clone source we could stop immediately instead of continuing backref walking, which is expensive. Basically what happens currently is this: 1) Call iterate_extent_inodes() to iterate over all the backreferences; 2) It calls btrfs_find_all_leafs() which in turn calls the main function to walk over backrefs and collect them - find_parent_nodes(); 3) Then we collect all the references for our target data extent from the extent tree (and delayed refs if any), add them to the rb trees, resolve all the indirect backreferences and search for all the file extent items in fs trees, building a list of inodes for each one of them (struct extent_inode_elem); 4) Then back at iterate_extent_inodes() we find all the roots associated to each found leaf, and call the callback __iterate_backrefs defined at send.c for each inode in the inode list associated to each leaf. Some times one the first backreferences we find in a fs tree is optimal to satisfy the clone operation that send wants to perform, and in that case we could stop immediately and avoid resolving all the remaining indirect backreferences (search fs trees for the respective file extent items, etc). This possibly if when we find a fs tree leaf with a file extent item we are able to know what are all the roots that can lead to the leaf - this is now possible after the previous patch in the series that adds a cache that maps leaves to a list of roots. So we can now shortcircuit backref walking during send, by having the callback we pass to iterate_extent_inodes() to be called when we find a file extent item for an indirect backreference, and have it return a special value when it found a suitable backreference and it does not need to look for more backreferences. This change does that. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source Performance test results are in the changelog of patch 17/17. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
During a send operation, when doing backref walking to determine which inodes/offsets/roots we can clone from, the most repetitive and expensive step is to map each leaf that has file extent items pointing to the target data extent to the IDs of the roots from which the leaves are accessible, which happens at iterate_extent_inodes(). That step requires finding every parent node of a leaf, then the parent of each parent, and so on until we reach a root node. So it's a naturally expensive operation, and repetitive because each leaf can have hundreds of file extent items (for a nodesize of 16K, that can be slightly over 200 file extent items). There's also temporal locality, as we process all file extent items from a leave before moving the next leaf. This change caches the mapping of leaves to root IDs, to avoid repeating those computations over and over again. The cache is limited to a maximum of 128 entries, with each entry being a struct with a size of 128 bytes, so the maximum cache size is 16K plus any nodes internally allocated by the maple tree that is used to index pointers to those structs. The cache is invalidated whenever we detect relocation happened since we started filling the cache, because if relocation happened then extent buffers for leaves and nodes of the trees used by a send operation may have been reallocated. This cache also allows for another important optimization that is introduced in the next patch in the series. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source Performance test results are in the changelog of patch 17/17. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The ulist_next() iterator function does not need to change the given ulist so make it const. This will allow the next patch in the series to pass a ulist to a function that does not need, and should not, modify the ulist. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At iterate_extent_inodes() we collect a ulist of leaves for a given extent with a call to btrfs_find_all_leafs() and then we enter a loop where we iterate over all the collected leaves. Each iteration of that loop does a call to btrfs_find_all_roots_safe(), to determine all roots from which a leaf is accessible, and that results in allocating and releasing a ulist to store the root IDs. Instead of allocating and releasing the roots ulist on every iteration, allocate a ulist before entering the loop and keep using it on each iteration, reinitializing the ulist at the end of each iteration. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The public backref walking functions have quite a lot of arguments that are passed down the call stack to find_parent_nodes(), the core function of the backref walking code. The next patches in series will need to add even arguments to these functions that should be passed not only to find_parent_nodes(), but also to other functions used by the later (directly or even lower in the call stack). So create a structure to hold all these arguments and state used by the main backref walking function, find_parent_nodes(), and use it as the argument for the public backref walking functions iterate_extent_inodes(), btrfs_find_all_leafs() and btrfs_find_all_roots(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-