- 05 Dec, 2022 40 commits
-
-
David Sterba authored
There's a callback function parameter for btrfs_wq_submit_bio that can be one of: metadata, buffered data, direct io data. The callback abstraction is unnecessary as we have all functions available. Replace the parameter with a command that leads to a direct call in run_one_async_start. The called functions can be then simplified and we can also remove the extent_submit_bio_start_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Compression and direct io don't work together so the compression parameter can be dropped after previous patch that changed the call to direct. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There's a function pointer passed to btrfs_repair_one_sector that will submit the right bio for repair. However there are only two callbacks, for buffered and for direct IO. This can be simplified to a bool-based switch and call either function, indirect calls in this case is an unnecessary abstraction. This allows to remove the submit_bio_hook_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
In zoned mode the sequential status of zone can be also tracked in the runtime flags of block group. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We already have flags in block group to track various status bits, convert needs_free_space as well and reduce size of btrfs_block_group. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We have a 64bit compatible helper to check if a value is a power of two, use it instead of open coding it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The copy_page helper may use an optimized version for full page copy (eg. on s390 there's a special instruction for that), there's one more left to convert. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
After the previous patchset which is comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source we have now much better performance when doing backref walking in the send code, so we can increase the current limit from 64 to 1024 references. This limit is still a bit conservative because there are still edge cases where backref walking will be too slow and spend a lot of cpu time, some IO reading b+tree nodes/leaves and memory. The goal is to eventually get rid of any limit, but for now bump it as it benefits users with extents shared more than 64 times and up to 1024 times, allowing for more deduplication at the destination without having to run a dedupe tool after a receive. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When doing backref walking to determine a source range to clone from, it is worthless to collect and resolve our own data backref, as we can't obviously use it as a clone source and it represents the range we want to clone into. Collecting the backref implies doing the extra work to resolve it, doing the search for a file extent item in a subvolume tree, etc. Skipping the data backref is valid as long as we only have the send root as the single clone root, otherwise the leaf with the file extent item may be accessible from another clone root due to shared subtrees created by snapshots, and therefore we have to collect the backref and resolve it. So add a callback to the backref walking code to guide it to skip data backrefs. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source The following test was run on non-debug kernel (Debian's default kernel config) before and after applying the patchset: $ cat test-send-many-shared-extents.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT num_files=50000 num_clones_per_file=50 for ((i = 1; i <= $num_files; i++)); do xfs_io -f -c "pwrite 0 64K" $MNT/file_$i > /dev/null echo -ne "\r$i files created..." done echo btrfs subvolume snapshot -r $MNT $MNT/snap1 cloned=0 for ((i = 1; i <= $num_clones_per_file; i++)); do for ((j = 1; j <= $num_files; j++)); do cp --reflink=always $MNT/file_$j $MNT/file_${j}_clone_${i} cloned=$((cloned + 1)) echo -ne "\r$cloned / $((num_files * num_clones_per_file)) clone operations" done done echo btrfs subvolume snapshot -r $MNT $MNT/snap2 # Unmount and mount again to clear all cached metadata (and data). umount $DEV mount $DEV $MNT start=$(date +%s%N) btrfs send $MNT/snap2 > /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000000 )) echo -e "\nFull send took $dur seconds" # Unmount and mount again to clear all cached metadata (and data). umount $DEV mount $DEV $MNT start=$(date +%s%N) btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000000 )) echo -e "\nIncremental send took $dur seconds" umount $MNT Before applying the patchset: (...) Full send took 1108 seconds (...) Incremental send took 1135 seconds After applying the whole patchset: (...) Full send took 268 seconds (-75.8%) (...) Incremental send took 316 seconds (-72.2%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At find_extent_clone() we search twice for the extent item corresponding to the data extent that the current file extent items points to: 1) Once with a call to extent_from_logical(); 2) Once again during backref walking, through iterate_extent_inodes() which eventually leads to find_parent_nodes() where we will search again the extent tree for the same extent item. The extent tree can be huge, so doing this one extra search for every extent we want to send adds up and it's expensive. The first call is there since the send code was introduced and it accomplishes two things: 1) Check that the extent is flagged as a data extent in the extent tree. But it can not be anything else, otherwise we wouldn't have a file extent item in the send root pointing to it. This was probably added to catch bugs in the early days where send was yet too young and the interaction with everything else was far from perfect; 2) Check how many direct references there are on the extent, and if there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the backred walking as it may take too long and slowdown send. So improve on this by having a callback in the backref walking code that is called when it finds the extent item in the extent tree, and have those checks done in the callback. When the callback returns anything different from 0, it stops the backref walking code. This way we do a single search on the extent tree for the extent item of our data extent. Also, before this change we were only checking the number of references on the data extent against SEND_MAX_EXTENT_REFS, but after starting backref walking we will end up resolving backrefs for extent buffers in the path from a leaf having a file extent item pointing to our data extent, up to roots of trees from which the extent buffer is accessible from, due to shared subtrees resulting from snapshoting. We were therefore allowing for the possibility for send taking too long due to some node in the path from the leaf to a root node being shared too many times. After this change we check for reference counts being greater than SEND_MAX_EXTENT_REFS for both data extents and metadata extents. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source Performance test results are in the changelog of patch 17/17. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When looking for a clone source for an extent, we are iterating over all the backreferences for an extent. This is often a waste of time, because once we find a good clone source we could stop immediately instead of continuing backref walking, which is expensive. Basically what happens currently is this: 1) Call iterate_extent_inodes() to iterate over all the backreferences; 2) It calls btrfs_find_all_leafs() which in turn calls the main function to walk over backrefs and collect them - find_parent_nodes(); 3) Then we collect all the references for our target data extent from the extent tree (and delayed refs if any), add them to the rb trees, resolve all the indirect backreferences and search for all the file extent items in fs trees, building a list of inodes for each one of them (struct extent_inode_elem); 4) Then back at iterate_extent_inodes() we find all the roots associated to each found leaf, and call the callback __iterate_backrefs defined at send.c for each inode in the inode list associated to each leaf. Some times one the first backreferences we find in a fs tree is optimal to satisfy the clone operation that send wants to perform, and in that case we could stop immediately and avoid resolving all the remaining indirect backreferences (search fs trees for the respective file extent items, etc). This possibly if when we find a fs tree leaf with a file extent item we are able to know what are all the roots that can lead to the leaf - this is now possible after the previous patch in the series that adds a cache that maps leaves to a list of roots. So we can now shortcircuit backref walking during send, by having the callback we pass to iterate_extent_inodes() to be called when we find a file extent item for an indirect backreference, and have it return a special value when it found a suitable backreference and it does not need to look for more backreferences. This change does that. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source Performance test results are in the changelog of patch 17/17. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
During a send operation, when doing backref walking to determine which inodes/offsets/roots we can clone from, the most repetitive and expensive step is to map each leaf that has file extent items pointing to the target data extent to the IDs of the roots from which the leaves are accessible, which happens at iterate_extent_inodes(). That step requires finding every parent node of a leaf, then the parent of each parent, and so on until we reach a root node. So it's a naturally expensive operation, and repetitive because each leaf can have hundreds of file extent items (for a nodesize of 16K, that can be slightly over 200 file extent items). There's also temporal locality, as we process all file extent items from a leave before moving the next leaf. This change caches the mapping of leaves to root IDs, to avoid repeating those computations over and over again. The cache is limited to a maximum of 128 entries, with each entry being a struct with a size of 128 bytes, so the maximum cache size is 16K plus any nodes internally allocated by the maple tree that is used to index pointers to those structs. The cache is invalidated whenever we detect relocation happened since we started filling the cache, because if relocation happened then extent buffers for leaves and nodes of the trees used by a send operation may have been reallocated. This cache also allows for another important optimization that is introduced in the next patch in the series. This change is part of a patchset comprised of the following patches: 01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs() 02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes() 03/17 btrfs: fix ulist leaks in error paths of qgroup self tests 04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests 05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone 06/17 btrfs: send: update comment at find_extent_clone() 07/17 btrfs: send: drop unnecessary backref context field initializations 08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source 09/17 btrfs: send: optimize clone detection to increase extent sharing 10/17 btrfs: use a single argument for extent offset in backref walking functions 11/17 btrfs: use a structure to pass arguments to backref walking functions 12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() 13/17 btrfs: constify ulist parameter of ulist_next() 14/17 btrfs: send: cache leaf to roots mapping during backref walking 15/17 btrfs: send: skip unnecessary backref iterations 16/17 btrfs: send: avoid double extent tree search when finding clone source 17/17 btrfs: send: skip resolution of our own backref when finding clone source Performance test results are in the changelog of patch 17/17. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The ulist_next() iterator function does not need to change the given ulist so make it const. This will allow the next patch in the series to pass a ulist to a function that does not need, and should not, modify the ulist. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At iterate_extent_inodes() we collect a ulist of leaves for a given extent with a call to btrfs_find_all_leafs() and then we enter a loop where we iterate over all the collected leaves. Each iteration of that loop does a call to btrfs_find_all_roots_safe(), to determine all roots from which a leaf is accessible, and that results in allocating and releasing a ulist to store the root IDs. Instead of allocating and releasing the roots ulist on every iteration, allocate a ulist before entering the loop and keep using it on each iteration, reinitializing the ulist at the end of each iteration. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The public backref walking functions have quite a lot of arguments that are passed down the call stack to find_parent_nodes(), the core function of the backref walking code. The next patches in series will need to add even arguments to these functions that should be passed not only to find_parent_nodes(), but also to other functions used by the later (directly or even lower in the call stack). So create a structure to hold all these arguments and state used by the main backref walking function, find_parent_nodes(), and use it as the argument for the public backref walking functions iterate_extent_inodes(), btrfs_find_all_leafs() and btrfs_find_all_roots(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The interface for find_parent_nodes() has two extent offset related arguments: 1) One u64 pointer argument for the extent offset; 2) One boolean argument to tell if the extent offset should be ignored or not. These are confusing, becase the extent offset pointer can be NULL and in some cases callers pass a NULL value as a way to tell the backref walking code to ignore offsets in file extent items (and simply consider all file extent items that point to the target data extent). The boolean argument was added in commit c995ab3c ("btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents"), but it was never really necessary, it was enough if it could find a way to get a NULL value passed to the "extent_item_pos" argument of find_parent_nodes(). The arguments are also passed to functions called by find_parent_nodes() and respective helper functions, which further makes everything more complicated than needed. Then we have several backref walking related functions that end up calling find_parent_nodes(), either directly or through some other function that they call, and for many we have to use an "extent_item_pos" (u64) argument and a boolean "ignore_offset" argument too. This is confusing and not really necessary. So use a single argument to specify the extent offset, as a simple u64 and not as a pointer, but using a special value of (u64)-1, defined as a documented constant, to indicate when the extent offset should be ignored. This is also preparation work for the upcoming patches in the series that add other arguments to find_parent_nodes() and other related functions that use it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Currently send does not do the best decisions when it comes to decide between multiple clone sources, which results in clone operations for partial extent ranges, which has the following disadvantages: 1) We get less shared extents at the destination; 2) We have to read more data during the send operation and emit more write commands. Besides not being optimal behaviour, it also breaks user expectations and is often reported by users, with a recent example in the Link tag at the bottom of this change log. Part of the reason for this non-optimal behaviour is that the backref walking code does not provide information about the length of the file extent items that were found for each backref, so send is blind about which backref is the best to chose as a cloning source. The other existing reasons are just silliness, namely always prefering the inode with the lowest number when multiple are found for the same root and when we can clone from multiple roots, always prefer the send root over any of the other clone roots. This does not make any sense since any inode or root is fine and as good as any other inode/root. Fix this by making backref walking pass information about the number of bytes referenced by each file extent item and then have send's backref callback pick the inode with the highest number of bytes for each root. Finally select the root from which we can clone more bytes from. Example reproducer: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo cp --reflink=always $MNT/foo $MNT/bar cp --reflink=always $MNT/foo $MNT/baz sync # Overwrite the second half of file foo. xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo sync echo echo "*** fiemap in the original filesystem ***" echo xfs_io -c "fiemap -v" $MNT/foo xfs_io -c "fiemap -v" $MNT/bar xfs_io -c "fiemap -v" $MNT/baz echo btrfs filesystem du $MNT btrfs subvolume snapshot -r $MNT $MNT/snap btrfs send -f /tmp/send_stream $MNT/snap umount $MNT mkfs.btrfs -f $DEV &> /dev/null mount $DEV $MNT btrfs receive -f /tmp/send_stream $MNT echo echo "*** fiemap in the new filesystem ***" echo xfs_io -r -c "fiemap -v" $MNT/snap/foo xfs_io -r -c "fiemap -v" $MNT/snap/bar xfs_io -r -c "fiemap -v" $MNT/snap/baz echo btrfs filesystem du $MNT rm -f /tmp/send_stream rm -f /tmp/snap.fssum umount $MNT Before this change: $ ./test.sh (...) *** fiemap in the original filesystem *** /mnt/sdi/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 Total Exclusive Set shared Filename 2.00MiB 1.00MiB - /mnt/sdi/foo 2.00MiB 0.00B - /mnt/sdi/bar 2.00MiB 0.00B - /mnt/sdi/baz 6.00MiB 1.00MiB 2.00MiB /mnt/sdi Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap' At subvol /mnt/sdi/snap At subvol snap *** fiemap in the new filesystem *** /mnt/sdi/snap/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/snap/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 32768..34815 2048 0x1 Total Exclusive Set shared Filename 2.00MiB 0.00B - /mnt/sdi/snap/foo 2.00MiB 1.00MiB - /mnt/sdi/snap/bar 2.00MiB 1.00MiB - /mnt/sdi/snap/baz 6.00MiB 2.00MiB - /mnt/sdi/snap 6.00MiB 2.00MiB 2.00MiB /mnt/sdi We end up with two 1M extents that are not shared for files bar and baz. After this change: $ ./test.sh (...) *** fiemap in the original filesystem *** /mnt/sdi/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 Total Exclusive Set shared Filename 2.00MiB 1.00MiB - /mnt/sdi/foo 2.00MiB 0.00B - /mnt/sdi/bar 2.00MiB 0.00B - /mnt/sdi/baz 6.00MiB 1.00MiB 2.00MiB /mnt/sdi Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap' At subvol /mnt/sdi/snap At subvol snap *** fiemap in the new filesystem *** /mnt/sdi/snap/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x2001 /mnt/sdi/snap/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x2001 Total Exclusive Set shared Filename 2.00MiB 0.00B - /mnt/sdi/snap/foo 2.00MiB 0.00B - /mnt/sdi/snap/bar 2.00MiB 0.00B - /mnt/sdi/snap/baz 6.00MiB 0.00B - /mnt/sdi/snap 6.00MiB 0.00B 3.00MiB /mnt/sdi Now there's a much better sharing, files bar and baz share 1M of the extent of file foo and the second extent of files bar and baz is shared between themselves. This will later be turned into a test case for fstests. Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At find_extent_clone(), unless we are given an inline extent, a file extent item that represents hole or an extent that starts beyond the i_size, we always do backref walking to look for clone sources, unless if we have more than SEND_MAX_EXTENT_REFS (64) known references on the extent. However if we know we only have one reference in the extent item and only one clone source (the send root), then it's pointless to do the backref walking to search for clone sources, as we can't clone from any other root. So skip the backref walking in that case. The following test was run on a non-debug kernel (Debian's default kernel config): $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Create an extent tree that's not too small and none of the # extents is shared. for ((i = 1; i <= 50000; i++)); do xfs_io -f -c "pwrite 0 4K" $MNT/file_$i > /dev/null echo -ne "\r$i files created..." done echo btrfs subvolume snapshot -r $MNT $MNT/snap start=$(date +%s%N) btrfs send $MNT/snap > /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo -e "\nsend took $dur milliseconds" umount $MNT Before this change: send took 5389 milliseconds After this change: send took 4519 milliseconds (-16.1%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At find_extent_clone() we are initializing to zero the 'found_itself' and 'found' fields of the backref context before we use it but we have already initialized the structure to zeroes when we declared it on stack, so it's pointless to initialize those fields and they are unnecessarily increasing the object text size with two "mov" instructions (x86_64). Similarly make the 'extent_len' initialization more clear by using an if- -then-else instead of a double assignment to it in case the extent's end crosses the i_size boundary. Before this change: $ size fs/btrfs/send.o text data bss dec hex filename 68694 4252 16 72962 11d02 fs/btrfs/send.o After this change: $ size fs/btrfs/send.o text data bss dec hex filename 68678 4252 16 72946 11cf2 fs/btrfs/send.o Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
We have this unclear comment at find_extent_clone() about extents starting at a file offset greater than or equals to the i_size of the inode. It's not really informative and it's misleading, since it mentions the author found such extents with snapshots and large files. Such extents are a result of fallocate with FALLOC_FL_KEEP_SIZE and there is no relation to snapshots or large files (all write paths update the i_size before inserting a new file extent item). So update the comment to be precise about it and why we don't bother looking for clone sources in that case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When looking for an extent clone, at find_extent_clone(), we start by allocating a path and then check for cases where we can't have clones and exit immediately in those cases. It's a waste of time to allocate the path before those cases, so reorder the logic so that we check for those cases before allocating the path. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Since we have switched all raid56 workload to submit-and-wait method, there is no use for btrfs_fs_info::endio_raid56_workers workqueue and btrfs_raid_bio::end_io_work. Remove them to save some memory. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This switch involves the following changes: - Make finish_parity_scrub() only to submit the write bios It will no longer call rbio_orig_end_io(), and now it will return error. - Add a new helper, recover_scrub_rbio(), to handle recovery It's just doing extra scrub related checks, and then call recover_sectors(). - Rename raid56_parity_scrub_stripe() to scrub_rbio() - Rename scrub_parity_work() to scrub_rbio_work_locked() To follow the existing naming scheme. - Delete unused functions Including: * finish_rmw() * raid_write_end_io() * raid56_bio_end_io() * __raid_recover_end_io() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Just like what we did for write/recovery, also extract the read bio assembly code into a helper for scrub. The difference between the three are: - rmw_assemble_read_bios() only submit reads for missing sectors Thus it will skip cached sectors, but will also read sectors which is not covered by any full stripe. (For cache usage) - recover_assemble_read_bios() reads every sector which has not failed - scrub_assemble_read_bios() has extra check for vertical stripes It's mostly the same as rmw_assemble_read_bios(), but will skip sectors which is not covered by a vertical stripe. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This includes the following changes: - Implement new raid_unplug() functions Now we don't need a workqueue to run the plug, as all our work is just queue rmw_rbio_work() call, which can be executed without sleep. - Implement a rmw_rbio_work_locked() helper This is for unlock_stripe(), which is already holding the full stripe lock. - Remove all the old functions This should already shows how complex the old functions are, as we ended up removing the following functions: * rmw_work() * validate_rbio_for_rmw() * raid56_rmw_end_io_work() * raid56_rmw_stripe() * full_stripe_write() * partial_stripe_write() * __raid56_parity_write() * run_plug() * unplug_work() * btrfs_raid_unplug() * rmw_work() * __raid56_parity_recover() * raid_recover_end_io_work() - Unexport rmw_rbio() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The new entrance will be called rmw_rbio(), it will have a streamlined workflow by using submit-and-wait method. Thus there will be no weird jumps between tons of functions, thus way more reader friendly, and will make later expansion easier, as it's now a straight workflow, the timing is way more clear. Unfortunately we can not yet migrate the RMW path to use this new entrance as we still need extra work to address the plug and unlock_stripe() function. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The helper will be later used to refactor the rmw write path. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The helper will later be used to refactor the whole RMW path. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Currently btrfs uses end_io functions to jump between different stages of recovery. For example, we go the following different functions: - raid56_bio_end_io() This handles the read for all the sectors (except the missing device). - __raid_recover_end_io() This does the real work, it's called inside the delayed work function raid_recover_end_io_work(). This one recovery path involves at least 3 different functions, which is a big burden for readers. This patch will change the behavior by: - Introduce a unified recovery entrance, recover_rbio() - Use submit-and-wait method So the workflow is not interrupted by the endio function jump. This doesn't bring performance change, but reduce the burden for reviewers. - Run the main function in the rmw_workers workqueue Now raid56_parity_recover() only needs to setup the work, and queue the work using start_async_work(). Now readers only need to do one function jump (start_async_work()) to find out the main entrance of recovery path. Furthermore, recover_rbio() function can easily be reused by other paths. The old recovery path is still utilized by degraded write path. It will be cleaned up when we have migrated the write path. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This includes extra changes: - The allocation for unmap_array[] and pointers[] Now we allocate them in one go, and free them together. - Remove @err Use errno_to_blk_status(ret) instead. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This new helper will be also utilized in the incoming refactor of recovery path. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Currently finish_rmw() will update the P/Q stripes before submitting the writes. It's done behind a for(;;) loop, it's a little congested indent-wise, so extract the code into a helper called generate_pq_vertical(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This refactor includes the following behavior change first: - Don't error out if only P/Q is corrupted The old code will directly error out if only P/Q is corrupted. Although it is an logical error if we go into rebuild path with only P/Q corrupted, there is no need to error out. Just skip the rebuild and return the already good data. Then comes the following refactor which shouldn't cause behavior changes: - Introduce a helper to do vertical stripe recovery This not only reduce one indent level, but also paves the road for later data checksum verification in RMW cycles. - Sort rbio->faila/b before recovery So we don't need to do the same swap every vertical stripe - Replace a BUG_ON() with ASSERT() Or checkpatch won't let me pass. - Mark recovered sectors uptodate after the recover loop - Do the cleanup for pointers unconditionally We only need to initialize @pointers and @unmap_array to NULL, so we can safely free them unconditionally. - Mark the repaired sector uptodate in recover_vertical() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The two structures appear on the same call paths, btrfs_bio_ctrl is embedded in extent_page_data and we pass bio_ctrl to some functions. After merging there are fewer indirections and we have only one control structure. The packing remains same. The btrfs_bio_ctrl was selected as the target structure as the operation is closer to bio processing. Structure layout: struct btrfs_bio_ctrl { struct bio * bio; /* 0 8 */ int mirror_num; /* 8 4 */ enum btrfs_compression_type compress_type; /* 12 4 */ u32 len_to_stripe_boundary; /* 16 4 */ u32 len_to_oe_boundary; /* 20 4 */ btrfs_bio_end_io_t end_io_func; /* 24 8 */ bool extent_locked; /* 32 1 */ bool sync_io; /* 33 1 */ /* size: 40, cachelines: 1, members: 8 */ /* padding: 6 */ /* last cacheline: 40 bytes */ }; Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The semantics of the two members is a boolean, so change the type accordingly. We have space in extent_page_data due to alignment there's no change in size. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The div_factor* helpers calculate fraction or percentage fraction. The name is a bit confusing, we use it only for percentage calculations and there are two helpers. There's a helper mult_frac that's for general fractions, that tries to be accurate but we multiply and divide by small numbers so we can use the div_u64 helper. Rename the div_factor* helpers and use 1..100 percentage range, also drop the case checking for percentage == 100, it's never hit. The conversions: * div_factor calculates tenths and the numbers need to be adjusted * div_factor_fine is direct replacement Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
If when doing a direct IO write we need to fallback to buffered IO, we this comment at btrfs_direct_write() that says we can't directly fallback to buffered IO if we have a NOWAIT iocb, because we have no support for NOWAIT buffered writes. That is not true anymore, as support for NOWAIT buffered writes was added recently in commit 926078b2 ("btrfs: enable nowait async buffered writes"). However we still can't fallback to a buffered write in case we have a NOWAIT iocb, because we'll need to flush delalloc and wait for it to complete after doing the buffered write, and that can block for several reasons, the main reason being waiting for IO to complete. So update the comment to mention all that. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The header files should use the /* */ comment style, introduced in commit f3a84ccd ("btrfs: move the tree mod log code into its own file"). Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Currently we have inline extent read code behind two levels of indentation, factor them them out into a new function, read_inline_extent(), to make it a little easier to read. Since we're here, also remove @extent_offset and @pg_offset arguments from uncompress_inline() function, as it's not possible to have inline extents at non-inline file offset. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The argument @new_inline changes the following members of extent_map: - em->compress_type - EXTENT_FLAG_COMPRESSED of em->flags However neither members makes a difference for inline extents: - Inline extent read never use above em members As inside btrfs_get_extent() we directly use the file extent item to do the read. - Inline extents are never to be split Thus code really needs em->compress_type or that flag will never be executed on inlined extents. (btrfs_drop_extent_cache() would be one example) - Fiemap no longer relies on extent maps Recent fiemap optimization makes fiemap to search subvolume tree directly, without using any extent map at all. Thus those members make no difference for inline extents any more. Furthermore such exception without much explanation is really a source of confusion. Thus this patch will completely remove the argument, and always set the involved members, unifying the behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-