Commits · 35da5a7edec32935135737608c9d9da24a419bab · Kirill Smelkov / linux

05 Dec, 2022 40 commits

btrfs: drop private_data parameter from extent_io_tree_init · 35da5a7e

David Sterba authored Oct 28, 2022

All callers except one pass NULL, so the parameter can be dropped and
the inode::io_tree initialization can be open coded.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

35da5a7e

btrfs: pass btrfs_inode to btrfs_delete_subvolume · 3c4f91e2

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3c4f91e2

btrfs: pass btrfs_inode to __unlink_start_trans · e569b1d5

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e569b1d5

btrfs: pass btrfs_inode to btrfs_check_data_csum · 621af94a

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

621af94a

btrfs: switch btrfs_writepage_fixup::inode to btrfs_inode · 36eeaef5

David Sterba authored Oct 28, 2022

The btrfs_writepage_fixup structure is for internal interfaces so we
should use the btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

36eeaef5

btrfs: pass btrfs_inode to btrfs_add_delalloc_inodes · 82ca5a04

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

82ca5a04

btrfs: pass btrfs_inode to btrfs_dirty_inode · 7152b425

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7152b425

btrfs: pass btrfs_inode to btrfs_inode_unlock · e5d4d75b

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e5d4d75b

btrfs: pass btrfs_inode to btrfs_inode_lock · 29b6352b

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

29b6352b

btrfs: pass btrfs_inode to btrfs_truncate · d9dcae67

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d9dcae67

btrfs: pass btrfs_inode to btrfs_submit_dio_bio · bb41632e

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

bb41632e

btrfs: switch btrfs_dio_private::inode to btrfs_inode · e2884c3d

David Sterba authored Oct 27, 2022

The btrfs_dio_private structure is for internal interfaces so we should
use the btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e2884c3d

btrfs: pass btrfs_inode to btrfs_repair_one_sector · d8f9268e

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d8f9268e

btrfs: pass btrfs_inode to submit_one_bio · c5ca391b

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c5ca391b

btrfs: pass btrfs_inode to btrfs_submit_dio_repair_bio · d781c1c3

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d781c1c3

btrfs: pass btrfs_inode to btrfs_submit_data_read_bio · b7620416

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b7620416

btrfs: pass btrfs_inode to btrfs_submit_data_write_bio · 535a7e5d

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

535a7e5d

btrfs: pass btrfs_inode to btrfs_submit_metadata_bio · 644094fd

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

644094fd

btrfs: pass btrfs_inode to btrfs_wq_submit_bio · 5fcdadc2

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5fcdadc2

btrfs: pass btrfs_inode to btrfs_submit_bio_start_direct_io · bfa17066

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

bfa17066

btrfs: pass btrfs_inode to btrfs_submit_bio_start · 882681ac

David Sterba authored Oct 27, 2022

The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

882681ac

btrfs: switch async_submit_bio::inode to btrfs_inode · da67daab

David Sterba authored Oct 27, 2022

The async bio submit is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

da67daab

btrfs: simplify btree_submit_bio_start and btrfs_submit_bio_start parameters · ad65ecf3

David Sterba authored Oct 27, 2022

After previous patches the unused parameters can be removed from
btree_submit_bio_start and btrfs_submit_bio_start as they don't need to
conform to the extent_submit_bio_start_t typedef.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ad65ecf3

btrfs: change how submit bio callback is passed to btrfs_wq_submit_bio · ab2072b2

David Sterba authored Oct 27, 2022

There's a callback function parameter for btrfs_wq_submit_bio that can
be one of: metadata, buffered data, direct io data. The callback
abstraction is unnecessary as we have all functions available.

Replace the parameter with a command that leads to a direct call in
run_one_async_start. The called functions can be then simplified and we
can also remove the extent_submit_bio_start_t typedef.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ab2072b2

btrfs: drop parameter compression_type from btrfs_submit_dio_repair_bio · 7920b773

David Sterba authored Oct 27, 2022

Compression and direct io don't work together so the compression
parameter can be dropped after previous patch that changed the call
to direct.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7920b773

btrfs: change how repair action is passed to btrfs_repair_one_sector · 19af6a7d

David Sterba authored Oct 27, 2022

There's a function pointer passed to btrfs_repair_one_sector that will
submit the right bio for repair. However there are only two callbacks,
for buffered and for direct IO. This can be simplified to a bool-based
switch and call either function, indirect calls in this case is an
unnecessary abstraction. This allows to remove the submit_bio_hook_t
typedef.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

19af6a7d

btrfs: convert btrfs_block_group::seq_zone to runtime flag · 961f5b8b

David Sterba authored Oct 31, 2022

In zoned mode the sequential status of zone can be also tracked in the
runtime flags of block group.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

961f5b8b

btrfs: convert btrfs_block_group::needs_free_space to runtime flag · 0d7764ff

David Sterba authored Oct 31, 2022

We already have flags in block group to track various status bits,
convert needs_free_space as well and reduce size of btrfs_block_group.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0d7764ff

btrfs: zoned: use helper to check a power of two zone size · fd463ac4

David Sterba authored Oct 31, 2022

We have a 64bit compatible helper to check if a value is a power of two,
use it instead of open coding it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

fd463ac4

btrfs: zlib: use copy_page for full page copy · 9e5e6d4e

David Sterba authored Oct 31, 2022

The copy_page helper may use an optimized version for full page copy
(eg. on s390 there's a special instruction for that), there's one more
left to convert.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

9e5e6d4e

btrfs: send: bump the extent reference count limit for backref walking · e2a04165

Filipe Manana authored Nov 01, 2022

After the previous patchset which is comprised of the following patches:

01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source

we have now much better performance when doing backref walking in the send
code, so we can increase the current limit from 64 to 1024 references.
This limit is still a bit conservative because there are still edge cases
where backref walking will be too slow and spend a lot of cpu time, some IO
reading b+tree nodes/leaves and memory. The goal is to eventually get rid
of any limit, but for now bump it as it benefits users with extents shared
more than 64 times and up to 1024 times, allowing for more deduplication
at the destination without having to run a dedupe tool after a receive.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e2a04165

btrfs: send: skip resolution of our own backref when finding clone source · adf02418

Filipe Manana authored Nov 01, 2022

When doing backref walking to determine a source range to clone from, it
is worthless to collect and resolve our own data backref, as we can't
obviously use it as a clone source and it represents the range we want to
clone into. Collecting the backref implies doing the extra work to resolve
it, doing the search for a file extent item in a subvolume tree, etc.
Skipping the data backref is valid as long as we only have the send root
as the single clone root, otherwise the leaf with the file extent item may
be accessible from another clone root due to shared subtrees created by
snapshots, and therefore we have to collect the backref and resolve it.

So add a callback to the backref walking code to guide it to skip data
backrefs.

This change is part of a patchset comprised of the following patches:

  01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
  02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
  03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
  04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
  05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
  06/17 btrfs: send: update comment at find_extent_clone()
  07/17 btrfs: send: drop unnecessary backref context field initializations
  08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
  09/17 btrfs: send: optimize clone detection to increase extent sharing
  10/17 btrfs: use a single argument for extent offset in backref walking functions
  11/17 btrfs: use a structure to pass arguments to backref walking functions
  12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
  13/17 btrfs: constify ulist parameter of ulist_next()
  14/17 btrfs: send: cache leaf to roots mapping during backref walking
  15/17 btrfs: send: skip unnecessary backref iterations
  16/17 btrfs: send: avoid double extent tree search when finding clone source
  17/17 btrfs: send: skip resolution of our own backref when finding clone source

The following test was run on non-debug kernel (Debian's default kernel
config) before and after applying the patchset:

   $ cat test-send-many-shared-extents.sh
   #!/bin/bash

   DEV=/dev/sdh
   MNT=/mnt/sdh

   umount $DEV &> /dev/null
   mkfs.btrfs -f $DEV
   mount $DEV $MNT

   num_files=50000
   num_clones_per_file=50

   for ((i = 1; i <= $num_files; i++)); do
       xfs_io -f -c "pwrite 0 64K" $MNT/file_$i > /dev/null
       echo -ne "\r$i files created..."
   done
   echo

   btrfs subvolume snapshot -r $MNT $MNT/snap1

   cloned=0
   for ((i = 1; i <= $num_clones_per_file; i++)); do
       for ((j = 1; j <= $num_files; j++)); do
           cp --reflink=always $MNT/file_$j $MNT/file_${j}_clone_${i}
           cloned=$((cloned + 1))
           echo -ne "\r$cloned / $((num_files * num_clones_per_file)) clone operations"
       done
   done
   echo

   btrfs subvolume snapshot -r $MNT $MNT/snap2

   # Unmount and mount again to clear all cached metadata (and data).
   umount $DEV
   mount $DEV $MNT

   start=$(date +%s%N)
   btrfs send $MNT/snap2 > /dev/null
   end=$(date +%s%N)

   dur=$(( (end - start) / 1000000000 ))
   echo -e "\nFull send took $dur seconds"

   # Unmount and mount again to clear all cached metadata (and data).
   umount $DEV
   mount $DEV $MNT

   start=$(date +%s%N)
   btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
   end=$(date +%s%N)

   dur=$(( (end - start) / 1000000000 ))
   echo -e "\nIncremental send took $dur seconds"

   umount $MNT

Before applying the patchset:

   (...)
   Full send took 1108 seconds
   (...)
   Incremental send took 1135 seconds

After applying the whole patchset:

   (...)
   Full send took 268 seconds            (-75.8%)
   (...)
   Incremental send took 316 seconds     (-72.2%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

adf02418

btrfs: send: avoid double extent tree search when finding clone source · f73853c7

Filipe Manana authored Nov 01, 2022

At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:

1) Once with a call to extent_from_logical();

2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.

The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.

The first call is there since the send code was introduced and it
accomplishes two things:

1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;

2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.

So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.

Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.

This change is part of a patchset comprised of the following patches:

Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f73853c7

btrfs: send: skip unnecessary backref iterations · 88ffb665

Filipe Manana authored Nov 01, 2022

When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.

Basically what happens currently is this:

1) Call iterate_extent_inodes() to iterate over all the backreferences;

2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();

3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);

4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.

Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.

This change is part of a patchset comprised of the following patches:

Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

88ffb665

btrfs: send: cache leaf to roots mapping during backref walking · 66d04209

Filipe Manana authored Nov 01, 2022

During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.

This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.

This cache also allows for another important optimization that is
introduced in the next patch in the series.

This change is part of a patchset comprised of the following patches:

Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

66d04209

btrfs: constify ulist parameter of ulist_next() · fa104a87

Filipe Manana authored Nov 01, 2022

The ulist_next() iterator function does not need to change the given ulist
so make it const. This will allow the next patch in the series to pass a
ulist to a function that does not need, and should not, modify the ulist.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

fa104a87

btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() · 1baea6f1

Filipe Manana authored Nov 01, 2022

At iterate_extent_inodes() we collect a ulist of leaves for a given extent
with a call to btrfs_find_all_leafs() and then we enter a loop where we
iterate over all the collected leaves. Each iteration of that loop does a
call to btrfs_find_all_roots_safe(), to determine all roots from which a
leaf is accessible, and that results in allocating and releasing a ulist
to store the root IDs.

Instead of allocating and releasing the roots ulist on every iteration,
allocate a ulist before entering the loop and keep using it on each
iteration, reinitializing the ulist at the end of each iteration.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1baea6f1

btrfs: use a structure to pass arguments to backref walking functions · a2c8d27e

Filipe Manana authored Nov 01, 2022

The public backref walking functions have quite a lot of arguments that
are passed down the call stack to find_parent_nodes(), the core function
of the backref walking code.

The next patches in series will need to add even arguments to these
functions that should be passed not only to find_parent_nodes(), but also
to other functions used by the later (directly or even lower in the call
stack).

So create a structure to hold all these arguments and state used by the
main backref walking function, find_parent_nodes(), and use it as the
argument for the public backref walking functions iterate_extent_inodes(),
btrfs_find_all_leafs() and btrfs_find_all_roots().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

a2c8d27e

btrfs: use a single argument for extent offset in backref walking functions · 6ce6ba53

Filipe Manana authored Nov 01, 2022

The interface for find_parent_nodes() has two extent offset related
arguments:

1) One u64 pointer argument for the extent offset;

2) One boolean argument to tell if the extent offset should be ignored or
   not.

These are confusing, becase the extent offset pointer can be NULL and in
some cases callers pass a NULL value as a way to tell the backref walking
code to ignore offsets in file extent items (and simply consider all file
extent items that point to the target data extent).

The boolean argument was added in commit c995ab3c ("btrfs: add a flag
to iterate_inodes_from_logical to find all extent refs for uncompressed
extents"), but it was never really necessary, it was enough if it could
find a way to get a NULL value passed to the "extent_item_pos" argument of
find_parent_nodes(). The arguments are also passed to functions called
by find_parent_nodes() and respective helper functions, which further
makes everything more complicated than needed.

Then we have several backref walking related functions that end up calling
find_parent_nodes(), either directly or through some other function that
they call, and for many we have to use an "extent_item_pos" (u64) argument
and a boolean "ignore_offset" argument too.

This is confusing and not really necessary. So use a single argument to
specify the extent offset, as a simple u64 and not as a pointer, but
using a special value of (u64)-1, defined as a documented constant, to
indicate when the extent offset should be ignored.

This is also preparation work for the upcoming patches in the series that
add other arguments to find_parent_nodes() and other related functions
that use it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6ce6ba53

btrfs: send: optimize clone detection to increase extent sharing · c7499a64

Filipe Manana authored Nov 01, 2022

Currently send does not do the best decisions when it comes to decide
between multiple clone sources, which results in clone operations for
partial extent ranges, which has the following disadvantages:

1) We get less shared extents at the destination;

2) We have to read more data during the send operation and emit more
   write commands.

Besides not being optimal behaviour, it also breaks user expectations and
is often reported by users, with a recent example in the Link tag at the
bottom of this change log.

Part of the reason for this non-optimal behaviour is that the backref
walking code does not provide information about the length of the file
extent items that were found for each backref, so send is blind about
which backref is the best to chose as a cloning source.

The other existing reasons are just silliness, namely always prefering
the inode with the lowest number when multiple are found for the same
root and when we can clone from multiple roots, always prefer the send
root over any of the other clone roots. This does not make any sense
since any inode or root is fine and as good as any other inode/root.

Fix this by making backref walking pass information about the number of
bytes referenced by each file extent item and then have send's backref
callback pick the inode with the highest number of bytes for each root.
Finally select the root from which we can clone more bytes from.

Example reproducer:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdi
   MNT=/mnt/sdi

   mkfs.btrfs -f $DEV
   mount $DEV $MNT

   xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo
   cp --reflink=always $MNT/foo $MNT/bar
   cp --reflink=always $MNT/foo $MNT/baz
   sync

   # Overwrite the second half of file foo.
   xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo
   sync

   echo
   echo "*** fiemap in the original filesystem ***"
   echo
   xfs_io -c "fiemap -v" $MNT/foo
   xfs_io -c "fiemap -v" $MNT/bar
   xfs_io -c "fiemap -v" $MNT/baz
   echo

   btrfs filesystem du $MNT

   btrfs subvolume snapshot -r $MNT $MNT/snap

   btrfs send -f /tmp/send_stream $MNT/snap

   umount $MNT
   mkfs.btrfs -f $DEV &> /dev/null
   mount $DEV $MNT

   btrfs receive -f /tmp/send_stream $MNT

   echo
   echo "*** fiemap in the new filesystem ***"
   echo
   xfs_io -r -c "fiemap -v" $MNT/snap/foo
   xfs_io -r -c "fiemap -v" $MNT/snap/bar
   xfs_io -r -c "fiemap -v" $MNT/snap/baz
   echo

   btrfs filesystem du $MNT

   rm -f /tmp/send_stream
   rm -f /tmp/snap.fssum

   umount $MNT

Before this change:

   $ ./test.sh
   (...)

   *** fiemap in the original filesystem ***

   /mnt/sdi/foo:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..2047]:       26624..28671      2048 0x2000
      1: [2048..4095]:    30720..32767      2048   0x1
   /mnt/sdi/bar:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..4095]:       26624..30719      4096 0x2001
   /mnt/sdi/baz:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..4095]:       26624..30719      4096 0x2001

        Total   Exclusive  Set shared  Filename
      2.00MiB     1.00MiB           -  /mnt/sdi/foo
      2.00MiB       0.00B           -  /mnt/sdi/bar
      2.00MiB       0.00B           -  /mnt/sdi/baz
      6.00MiB     1.00MiB     2.00MiB  /mnt/sdi

   Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
   At subvol /mnt/sdi/snap
   At subvol snap

   *** fiemap in the new filesystem ***

   /mnt/sdi/snap/foo:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..4095]:       26624..30719      4096 0x2001
   /mnt/sdi/snap/bar:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..2047]:       26624..28671      2048 0x2000
      1: [2048..4095]:    30720..32767      2048   0x1
   /mnt/sdi/snap/baz:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..2047]:       26624..28671      2048 0x2000
      1: [2048..4095]:    32768..34815      2048   0x1

        Total   Exclusive  Set shared  Filename
      2.00MiB       0.00B           -  /mnt/sdi/snap/foo
      2.00MiB     1.00MiB           -  /mnt/sdi/snap/bar
      2.00MiB     1.00MiB           -  /mnt/sdi/snap/baz
      6.00MiB     2.00MiB           -  /mnt/sdi/snap
      6.00MiB     2.00MiB     2.00MiB  /mnt/sdi

We end up with two 1M extents that are not shared for files bar and baz.

After this change:

   $ ./test.sh
   (...)

   *** fiemap in the original filesystem ***

   /mnt/sdi/foo:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..2047]:       26624..28671      2048 0x2000
      1: [2048..4095]:    30720..32767      2048   0x1
   /mnt/sdi/bar:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..4095]:       26624..30719      4096 0x2001
   /mnt/sdi/baz:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..4095]:       26624..30719      4096 0x2001

        Total   Exclusive  Set shared  Filename
      2.00MiB     1.00MiB           -  /mnt/sdi/foo
      2.00MiB       0.00B           -  /mnt/sdi/bar
      2.00MiB       0.00B           -  /mnt/sdi/baz
      6.00MiB     1.00MiB     2.00MiB  /mnt/sdi
   Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
   At subvol /mnt/sdi/snap
   At subvol snap

   *** fiemap in the new filesystem ***

   /mnt/sdi/snap/foo:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..4095]:       26624..30719      4096 0x2001
   /mnt/sdi/snap/bar:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..2047]:       26624..28671      2048 0x2000
      1: [2048..4095]:    30720..32767      2048 0x2001
   /mnt/sdi/snap/baz:
    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
      0: [0..2047]:       26624..28671      2048 0x2000
      1: [2048..4095]:    30720..32767      2048 0x2001

        Total   Exclusive  Set shared  Filename
      2.00MiB       0.00B           -  /mnt/sdi/snap/foo
      2.00MiB       0.00B           -  /mnt/sdi/snap/bar
      2.00MiB       0.00B           -  /mnt/sdi/snap/baz
      6.00MiB       0.00B           -  /mnt/sdi/snap
      6.00MiB       0.00B     3.00MiB  /mnt/sdi

Now there's a much better sharing, files bar and baz share 1M of the
extent of file foo and the second extent of files bar and baz is shared
between themselves.

This will later be turned into a test case for fstests.

Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c7499a64