1. 05 Dec, 2022 40 commits
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_submit_bio_start · 882681ac
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      882681ac
    • David Sterba's avatar
      btrfs: switch async_submit_bio::inode to btrfs_inode · da67daab
      David Sterba authored
      The async bio submit is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da67daab
    • David Sterba's avatar
      btrfs: simplify btree_submit_bio_start and btrfs_submit_bio_start parameters · ad65ecf3
      David Sterba authored
      After previous patches the unused parameters can be removed from
      btree_submit_bio_start and btrfs_submit_bio_start as they don't need to
      conform to the extent_submit_bio_start_t typedef.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad65ecf3
    • David Sterba's avatar
      btrfs: change how submit bio callback is passed to btrfs_wq_submit_bio · ab2072b2
      David Sterba authored
      There's a callback function parameter for btrfs_wq_submit_bio that can
      be one of: metadata, buffered data, direct io data. The callback
      abstraction is unnecessary as we have all functions available.
      
      Replace the parameter with a command that leads to a direct call in
      run_one_async_start. The called functions can be then simplified and we
      can also remove the extent_submit_bio_start_t typedef.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab2072b2
    • David Sterba's avatar
      btrfs: drop parameter compression_type from btrfs_submit_dio_repair_bio · 7920b773
      David Sterba authored
      Compression and direct io don't work together so the compression
      parameter can be dropped after previous patch that changed the call
      to direct.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7920b773
    • David Sterba's avatar
      btrfs: change how repair action is passed to btrfs_repair_one_sector · 19af6a7d
      David Sterba authored
      There's a function pointer passed to btrfs_repair_one_sector that will
      submit the right bio for repair. However there are only two callbacks,
      for buffered and for direct IO. This can be simplified to a bool-based
      switch and call either function, indirect calls in this case is an
      unnecessary abstraction. This allows to remove the submit_bio_hook_t
      typedef.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19af6a7d
    • David Sterba's avatar
      btrfs: convert btrfs_block_group::seq_zone to runtime flag · 961f5b8b
      David Sterba authored
      In zoned mode the sequential status of zone can be also tracked in the
      runtime flags of block group.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      961f5b8b
    • David Sterba's avatar
      btrfs: convert btrfs_block_group::needs_free_space to runtime flag · 0d7764ff
      David Sterba authored
      We already have flags in block group to track various status bits,
      convert needs_free_space as well and reduce size of btrfs_block_group.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d7764ff
    • David Sterba's avatar
      btrfs: zoned: use helper to check a power of two zone size · fd463ac4
      David Sterba authored
      We have a 64bit compatible helper to check if a value is a power of two,
      use it instead of open coding it.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd463ac4
    • David Sterba's avatar
      btrfs: zlib: use copy_page for full page copy · 9e5e6d4e
      David Sterba authored
      The copy_page helper may use an optimized version for full page copy
      (eg. on s390 there's a special instruction for that), there's one more
      left to convert.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e5e6d4e
    • Filipe Manana's avatar
      btrfs: send: bump the extent reference count limit for backref walking · e2a04165
      Filipe Manana authored
      After the previous patchset which is comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      we have now much better performance when doing backref walking in the send
      code, so we can increase the current limit from 64 to 1024 references.
      This limit is still a bit conservative because there are still edge cases
      where backref walking will be too slow and spend a lot of cpu time, some IO
      reading b+tree nodes/leaves and memory. The goal is to eventually get rid
      of any limit, but for now bump it as it benefits users with extents shared
      more than 64 times and up to 1024 times, allowing for more deduplication
      at the destination without having to run a dedupe tool after a receive.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2a04165
    • Filipe Manana's avatar
      btrfs: send: skip resolution of our own backref when finding clone source · adf02418
      Filipe Manana authored
      When doing backref walking to determine a source range to clone from, it
      is worthless to collect and resolve our own data backref, as we can't
      obviously use it as a clone source and it represents the range we want to
      clone into. Collecting the backref implies doing the extra work to resolve
      it, doing the search for a file extent item in a subvolume tree, etc.
      Skipping the data backref is valid as long as we only have the send root
      as the single clone root, otherwise the leaf with the file extent item may
      be accessible from another clone root due to shared subtrees created by
      snapshots, and therefore we have to collect the backref and resolve it.
      
      So add a callback to the backref walking code to guide it to skip data
      backrefs.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      The following test was run on non-debug kernel (Debian's default kernel
      config) before and after applying the patchset:
      
         $ cat test-send-many-shared-extents.sh
         #!/bin/bash
      
         DEV=/dev/sdh
         MNT=/mnt/sdh
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         num_files=50000
         num_clones_per_file=50
      
         for ((i = 1; i <= $num_files; i++)); do
             xfs_io -f -c "pwrite 0 64K" $MNT/file_$i > /dev/null
             echo -ne "\r$i files created..."
         done
         echo
      
         btrfs subvolume snapshot -r $MNT $MNT/snap1
      
         cloned=0
         for ((i = 1; i <= $num_clones_per_file; i++)); do
             for ((j = 1; j <= $num_files; j++)); do
                 cp --reflink=always $MNT/file_$j $MNT/file_${j}_clone_${i}
                 cloned=$((cloned + 1))
                 echo -ne "\r$cloned / $((num_files * num_clones_per_file)) clone operations"
             done
         done
         echo
      
         btrfs subvolume snapshot -r $MNT $MNT/snap2
      
         # Unmount and mount again to clear all cached metadata (and data).
         umount $DEV
         mount $DEV $MNT
      
         start=$(date +%s%N)
         btrfs send $MNT/snap2 > /dev/null
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000000 ))
         echo -e "\nFull send took $dur seconds"
      
         # Unmount and mount again to clear all cached metadata (and data).
         umount $DEV
         mount $DEV $MNT
      
         start=$(date +%s%N)
         btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000000 ))
         echo -e "\nIncremental send took $dur seconds"
      
         umount $MNT
      
      Before applying the patchset:
      
         (...)
         Full send took 1108 seconds
         (...)
         Incremental send took 1135 seconds
      
      After applying the whole patchset:
      
         (...)
         Full send took 268 seconds            (-75.8%)
         (...)
         Incremental send took 316 seconds     (-72.2%)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adf02418
    • Filipe Manana's avatar
      btrfs: send: avoid double extent tree search when finding clone source · f73853c7
      Filipe Manana authored
      At find_extent_clone() we search twice for the extent item corresponding
      to the data extent that the current file extent items points to:
      
      1) Once with a call to extent_from_logical();
      
      2) Once again during backref walking, through iterate_extent_inodes()
         which eventually leads to find_parent_nodes() where we will search
         again the extent tree for the same extent item.
      
      The extent tree can be huge, so doing this one extra search for every
      extent we want to send adds up and it's expensive.
      
      The first call is there since the send code was introduced and it
      accomplishes two things:
      
      1) Check that the extent is flagged as a data extent in the extent tree.
         But it can not be anything else, otherwise we wouldn't have a file
         extent item in the send root pointing to it.
         This was probably added to catch bugs in the early days where send was
         yet too young and the interaction with everything else was far from
         perfect;
      
      2) Check how many direct references there are on the extent, and if
         there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
         backred walking as it may take too long and slowdown send.
      
      So improve on this by having a callback in the backref walking code that
      is called when it finds the extent item in the extent tree, and have those
      checks done in the callback. When the callback returns anything different
      from 0, it stops the backref walking code. This way we do a single search
      on the extent tree for the extent item of our data extent.
      
      Also, before this change we were only checking the number of references on
      the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
      walking we will end up resolving backrefs for extent buffers in the path
      from a leaf having a file extent item pointing to our data extent, up to
      roots of trees from which the extent buffer is accessible from, due to
      shared subtrees resulting from snapshoting. We were therefore allowing for
      the possibility for send taking too long due to some node in the path from
      the leaf to a root node being shared too many times. After this change we
      check for reference counts being greater than SEND_MAX_EXTENT_REFS for
      both data extents and metadata extents.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      Performance test results are in the changelog of patch 17/17.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f73853c7
    • Filipe Manana's avatar
      btrfs: send: skip unnecessary backref iterations · 88ffb665
      Filipe Manana authored
      When looking for a clone source for an extent, we are iterating over all
      the backreferences for an extent. This is often a waste of time, because
      once we find a good clone source we could stop immediately instead of
      continuing backref walking, which is expensive.
      
      Basically what happens currently is this:
      
      1) Call iterate_extent_inodes() to iterate over all the backreferences;
      
      2) It calls btrfs_find_all_leafs() which in turn calls the main function
         to walk over backrefs and collect them - find_parent_nodes();
      
      3) Then we collect all the references for our target data extent from the
         extent tree (and delayed refs if any), add them to the rb trees,
         resolve all the indirect backreferences and search for all the file
         extent items in fs trees, building a list of inodes for each one of
         them (struct extent_inode_elem);
      
      4) Then back at iterate_extent_inodes() we find all the roots associated
         to each found leaf, and call the callback __iterate_backrefs defined
         at send.c for each inode in the inode list associated to each leaf.
      
      Some times one the first backreferences we find in a fs tree is optimal
      to satisfy the clone operation that send wants to perform, and in that
      case we could stop immediately and avoid resolving all the remaining
      indirect backreferences (search fs trees for the respective file extent
      items, etc). This possibly if when we find a fs tree leaf with a file
      extent item we are able to know what are all the roots that can lead to
      the leaf - this is now possible after the previous patch in the series
      that adds a cache that maps leaves to a list of roots. So we can now
      shortcircuit backref walking during send, by having the callback we
      pass to iterate_extent_inodes() to be called when we find a file extent
      item for an indirect backreference, and have it return a special value
      when it found a suitable backreference and it does not need to look for
      more backreferences. This change does that.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      Performance test results are in the changelog of patch 17/17.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88ffb665
    • Filipe Manana's avatar
      btrfs: send: cache leaf to roots mapping during backref walking · 66d04209
      Filipe Manana authored
      During a send operation, when doing backref walking to determine which
      inodes/offsets/roots we can clone from, the most repetitive and expensive
      step is to map each leaf that has file extent items pointing to the target
      data extent to the IDs of the roots from which the leaves are accessible,
      which happens at iterate_extent_inodes(). That step requires finding every
      parent node of a leaf, then the parent of each parent, and so on until we
      reach a root node. So it's a naturally expensive operation, and repetitive
      because each leaf can have hundreds of file extent items (for a nodesize
      of 16K, that can be slightly over 200 file extent items). There's also
      temporal locality, as we process all file extent items from a leave before
      moving the next leaf.
      
      This change caches the mapping of leaves to root IDs, to avoid repeating
      those computations over and over again. The cache is limited to a maximum
      of 128 entries, with each entry being a struct with a size of 128 bytes,
      so the maximum cache size is 16K plus any nodes internally allocated by
      the maple tree that is used to index pointers to those structs. The cache
      is invalidated whenever we detect relocation happened since we started
      filling the cache, because if relocation happened then extent buffers for
      leaves and nodes of the trees used by a send operation may have been
      reallocated.
      
      This cache also allows for another important optimization that is
      introduced in the next patch in the series.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      Performance test results are in the changelog of patch 17/17.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66d04209
    • Filipe Manana's avatar
      btrfs: constify ulist parameter of ulist_next() · fa104a87
      Filipe Manana authored
      The ulist_next() iterator function does not need to change the given ulist
      so make it const. This will allow the next patch in the series to pass a
      ulist to a function that does not need, and should not, modify the ulist.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa104a87
    • Filipe Manana's avatar
      btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() · 1baea6f1
      Filipe Manana authored
      At iterate_extent_inodes() we collect a ulist of leaves for a given extent
      with a call to btrfs_find_all_leafs() and then we enter a loop where we
      iterate over all the collected leaves. Each iteration of that loop does a
      call to btrfs_find_all_roots_safe(), to determine all roots from which a
      leaf is accessible, and that results in allocating and releasing a ulist
      to store the root IDs.
      
      Instead of allocating and releasing the roots ulist on every iteration,
      allocate a ulist before entering the loop and keep using it on each
      iteration, reinitializing the ulist at the end of each iteration.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1baea6f1
    • Filipe Manana's avatar
      btrfs: use a structure to pass arguments to backref walking functions · a2c8d27e
      Filipe Manana authored
      The public backref walking functions have quite a lot of arguments that
      are passed down the call stack to find_parent_nodes(), the core function
      of the backref walking code.
      
      The next patches in series will need to add even arguments to these
      functions that should be passed not only to find_parent_nodes(), but also
      to other functions used by the later (directly or even lower in the call
      stack).
      
      So create a structure to hold all these arguments and state used by the
      main backref walking function, find_parent_nodes(), and use it as the
      argument for the public backref walking functions iterate_extent_inodes(),
      btrfs_find_all_leafs() and btrfs_find_all_roots().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a2c8d27e
    • Filipe Manana's avatar
      btrfs: use a single argument for extent offset in backref walking functions · 6ce6ba53
      Filipe Manana authored
      The interface for find_parent_nodes() has two extent offset related
      arguments:
      
      1) One u64 pointer argument for the extent offset;
      
      2) One boolean argument to tell if the extent offset should be ignored or
         not.
      
      These are confusing, becase the extent offset pointer can be NULL and in
      some cases callers pass a NULL value as a way to tell the backref walking
      code to ignore offsets in file extent items (and simply consider all file
      extent items that point to the target data extent).
      
      The boolean argument was added in commit c995ab3c ("btrfs: add a flag
      to iterate_inodes_from_logical to find all extent refs for uncompressed
      extents"), but it was never really necessary, it was enough if it could
      find a way to get a NULL value passed to the "extent_item_pos" argument of
      find_parent_nodes(). The arguments are also passed to functions called
      by find_parent_nodes() and respective helper functions, which further
      makes everything more complicated than needed.
      
      Then we have several backref walking related functions that end up calling
      find_parent_nodes(), either directly or through some other function that
      they call, and for many we have to use an "extent_item_pos" (u64) argument
      and a boolean "ignore_offset" argument too.
      
      This is confusing and not really necessary. So use a single argument to
      specify the extent offset, as a simple u64 and not as a pointer, but
      using a special value of (u64)-1, defined as a documented constant, to
      indicate when the extent offset should be ignored.
      
      This is also preparation work for the upcoming patches in the series that
      add other arguments to find_parent_nodes() and other related functions
      that use it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ce6ba53
    • Filipe Manana's avatar
      btrfs: send: optimize clone detection to increase extent sharing · c7499a64
      Filipe Manana authored
      Currently send does not do the best decisions when it comes to decide
      between multiple clone sources, which results in clone operations for
      partial extent ranges, which has the following disadvantages:
      
      1) We get less shared extents at the destination;
      
      2) We have to read more data during the send operation and emit more
         write commands.
      
      Besides not being optimal behaviour, it also breaks user expectations and
      is often reported by users, with a recent example in the Link tag at the
      bottom of this change log.
      
      Part of the reason for this non-optimal behaviour is that the backref
      walking code does not provide information about the length of the file
      extent items that were found for each backref, so send is blind about
      which backref is the best to chose as a cloning source.
      
      The other existing reasons are just silliness, namely always prefering
      the inode with the lowest number when multiple are found for the same
      root and when we can clone from multiple roots, always prefer the send
      root over any of the other clone roots. This does not make any sense
      since any inode or root is fine and as good as any other inode/root.
      
      Fix this by making backref walking pass information about the number of
      bytes referenced by each file extent item and then have send's backref
      callback pick the inode with the highest number of bytes for each root.
      Finally select the root from which we can clone more bytes from.
      
      Example reproducer:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
         cp --reflink=always $MNT/foo $MNT/baz
         sync
      
         # Overwrite the second half of file foo.
         xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo
         sync
      
         echo
         echo "*** fiemap in the original filesystem ***"
         echo
         xfs_io -c "fiemap -v" $MNT/foo
         xfs_io -c "fiemap -v" $MNT/bar
         xfs_io -c "fiemap -v" $MNT/baz
         echo
      
         btrfs filesystem du $MNT
      
         btrfs subvolume snapshot -r $MNT $MNT/snap
      
         btrfs send -f /tmp/send_stream $MNT/snap
      
         umount $MNT
         mkfs.btrfs -f $DEV &> /dev/null
         mount $DEV $MNT
      
         btrfs receive -f /tmp/send_stream $MNT
      
         echo
         echo "*** fiemap in the new filesystem ***"
         echo
         xfs_io -r -c "fiemap -v" $MNT/snap/foo
         xfs_io -r -c "fiemap -v" $MNT/snap/bar
         xfs_io -r -c "fiemap -v" $MNT/snap/baz
         echo
      
         btrfs filesystem du $MNT
      
         rm -f /tmp/send_stream
         rm -f /tmp/snap.fssum
      
         umount $MNT
      
      Before this change:
      
         $ ./test.sh
         (...)
      
         *** fiemap in the original filesystem ***
      
         /mnt/sdi/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB     1.00MiB           -  /mnt/sdi/foo
            2.00MiB       0.00B           -  /mnt/sdi/bar
            2.00MiB       0.00B           -  /mnt/sdi/baz
            6.00MiB     1.00MiB     2.00MiB  /mnt/sdi
      
         Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
         At subvol /mnt/sdi/snap
         At subvol snap
      
         *** fiemap in the new filesystem ***
      
         /mnt/sdi/snap/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/snap/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/snap/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    32768..34815      2048   0x1
      
              Total   Exclusive  Set shared  Filename
            2.00MiB       0.00B           -  /mnt/sdi/snap/foo
            2.00MiB     1.00MiB           -  /mnt/sdi/snap/bar
            2.00MiB     1.00MiB           -  /mnt/sdi/snap/baz
            6.00MiB     2.00MiB           -  /mnt/sdi/snap
            6.00MiB     2.00MiB     2.00MiB  /mnt/sdi
      
      We end up with two 1M extents that are not shared for files bar and baz.
      
      After this change:
      
         $ ./test.sh
         (...)
      
         *** fiemap in the original filesystem ***
      
         /mnt/sdi/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB     1.00MiB           -  /mnt/sdi/foo
            2.00MiB       0.00B           -  /mnt/sdi/bar
            2.00MiB       0.00B           -  /mnt/sdi/baz
            6.00MiB     1.00MiB     2.00MiB  /mnt/sdi
         Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
         At subvol /mnt/sdi/snap
         At subvol snap
      
         *** fiemap in the new filesystem ***
      
         /mnt/sdi/snap/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/snap/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048 0x2001
         /mnt/sdi/snap/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB       0.00B           -  /mnt/sdi/snap/foo
            2.00MiB       0.00B           -  /mnt/sdi/snap/bar
            2.00MiB       0.00B           -  /mnt/sdi/snap/baz
            6.00MiB       0.00B           -  /mnt/sdi/snap
            6.00MiB       0.00B     3.00MiB  /mnt/sdi
      
      Now there's a much better sharing, files bar and baz share 1M of the
      extent of file foo and the second extent of files bar and baz is shared
      between themselves.
      
      This will later be turned into a test case for fstests.
      
      Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7499a64
    • Filipe Manana's avatar
      btrfs: send: avoid unnecessary backref lookups when finding clone source · 22a3c0ac
      Filipe Manana authored
      At find_extent_clone(), unless we are given an inline extent, a file
      extent item that represents hole or an extent that starts beyond the
      i_size, we always do backref walking to look for clone sources, unless
      if we have more than SEND_MAX_EXTENT_REFS (64) known references on the
      extent.
      
      However if we know we only have one reference in the extent item and only
      one clone source (the send root), then it's pointless to do the backref
      walking to search for clone sources, as we can't clone from any other
      root. So skip the backref walking in that case.
      
      The following test was run on a non-debug kernel (Debian's default kernel
      config):
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         # Create an extent tree that's not too small and none of the
         # extents is shared.
         for ((i = 1; i <= 50000; i++)); do
            xfs_io -f -c "pwrite 0 4K" $MNT/file_$i > /dev/null
            echo -ne "\r$i files created..."
         done
         echo
      
         btrfs subvolume snapshot -r $MNT $MNT/snap
      
         start=$(date +%s%N)
         btrfs send $MNT/snap > /dev/null
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo -e "\nsend took $dur milliseconds"
      
         umount $MNT
      
      Before this change:
      
         send took 5389 milliseconds
      
      After this change:
      
         send took 4519 milliseconds  (-16.1%)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      22a3c0ac
    • Filipe Manana's avatar
      btrfs: send: drop unnecessary backref context field initializations · 344174a1
      Filipe Manana authored
      At find_extent_clone() we are initializing to zero the 'found_itself' and
      'found' fields of the backref context before we use it but we have already
      initialized the structure to zeroes when we declared it on stack, so it's
      pointless to initialize those fields and they are unnecessarily increasing
      the object text size with two "mov" instructions (x86_64).
      
      Similarly make the 'extent_len' initialization more clear by using an if-
      -then-else instead of a double assignment to it in case the extent's end
      crosses the i_size boundary.
      
      Before this change:
      
         $ size fs/btrfs/send.o
            text	   data	    bss	    dec	    hex	filename
           68694	   4252	     16	  72962	  11d02	fs/btrfs/send.o
      
      After this change:
      
         $ size fs/btrfs/send.o
            text	   data	    bss	    dec	    hex	filename
           68678	   4252	     16	  72946	  11cf2	fs/btrfs/send.o
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      344174a1
    • Filipe Manana's avatar
      btrfs: send: update comment at find_extent_clone() · d3f41317
      Filipe Manana authored
      We have this unclear comment at find_extent_clone() about extents starting
      at a file offset greater than or equals to the i_size of the inode. It's
      not really informative and it's misleading, since it mentions the author
      found such extents with snapshots and large files.
      
      Such extents are a result of fallocate with FALLOC_FL_KEEP_SIZE and there
      is no relation to snapshots or large files (all write paths update the
      i_size before inserting a new file extent item). So update the comment to
      be precise about it and why we don't bother looking for clone sources in
      that case.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3f41317
    • Filipe Manana's avatar
      btrfs: send: avoid unnecessary path allocations when finding extent clone · 61ce908a
      Filipe Manana authored
      When looking for an extent clone, at find_extent_clone(), we start by
      allocating a path and then check for cases where we can't have clones
      and exit immediately in those cases. It's a waste of time to allocate
      the path before those cases, so reorder the logic so that we check for
      those cases before allocating the path.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61ce908a
    • Qu Wenruo's avatar
      btrfs: remove the unused endio_raid56_workers and btrfs_raid_bio::end_io_work · 1a1a2851
      Qu Wenruo authored
      Since we have switched all raid56 workload to submit-and-wait method,
      there is no use for btrfs_fs_info::endio_raid56_workers workqueue and
      btrfs_raid_bio::end_io_work.
      
      Remove them to save some memory.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a1a2851
    • Qu Wenruo's avatar
      btrfs: raid56: switch scrub path to use a single function · 6bfd0133
      Qu Wenruo authored
      This switch involves the following changes:
      
      - Make finish_parity_scrub() only to submit the write bios
        It will no longer call rbio_orig_end_io(), and now it will
        return error.
      
      - Add a new helper, recover_scrub_rbio(), to handle recovery
        It's just doing extra scrub related checks, and then call
        recover_sectors().
      
      - Rename raid56_parity_scrub_stripe() to scrub_rbio()
      - Rename scrub_parity_work() to scrub_rbio_work_locked()
        To follow the existing naming scheme.
      
      - Delete unused functions
        Including:
        * finish_rmw()
        * raid_write_end_io()
        * raid56_bio_end_io()
        * __raid_recover_end_io()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6bfd0133
    • Qu Wenruo's avatar
      btrfs: raid56: extract scrub read bio list assembly code into a helper · cb3450b7
      Qu Wenruo authored
      Just like what we did for write/recovery, also extract the read bio
      assembly code into a helper for scrub.
      
      The difference between the three are:
      
      - rmw_assemble_read_bios() only submit reads for missing sectors
        Thus it will skip cached sectors, but will also read sectors which
        is not covered by any full stripe. (For cache usage)
      
      - recover_assemble_read_bios() reads every sector which has not failed
      
      - scrub_assemble_read_bios() has extra check for vertical stripes
        It's mostly the same as rmw_assemble_read_bios(), but will skip
        sectors which is not covered by a vertical stripe.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb3450b7
    • Qu Wenruo's avatar
      btrfs: raid56: switch write path to rmw_rbio() · 93723095
      Qu Wenruo authored
      This includes the following changes:
      
      - Implement new raid_unplug() functions
        Now we don't need a workqueue to run the plug, as all our
        work is just queue rmw_rbio_work() call, which can be executed
        without sleep.
      
      - Implement a rmw_rbio_work_locked() helper
        This is for unlock_stripe(), which is already holding the full stripe
        lock.
      
      - Remove all the old functions
        This should already shows how complex the old functions are, as we
        ended up removing the following functions:
      
        * rmw_work()
        * validate_rbio_for_rmw()
        * raid56_rmw_end_io_work()
        * raid56_rmw_stripe()
        * full_stripe_write()
        * partial_stripe_write()
        * __raid56_parity_write()
        * run_plug()
        * unplug_work()
        * btrfs_raid_unplug()
        * rmw_work()
        * __raid56_parity_recover()
        * raid_recover_end_io_work()
      
      - Unexport rmw_rbio()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      93723095
    • Qu Wenruo's avatar
      btrfs: raid56: introduce the main entrance for RMW path · 5eb30ee2
      Qu Wenruo authored
      The new entrance will be called rmw_rbio(), it will have a streamlined
      workflow by using submit-and-wait method.
      
      Thus there will be no weird jumps between tons of functions, thus way
      more reader friendly, and will make later expansion easier, as it's now
      a straight workflow, the timing is way more clear.
      
      Unfortunately we can not yet migrate the RMW path to use this new
      entrance as we still need extra work to address the plug and
      unlock_stripe() function.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5eb30ee2
    • Qu Wenruo's avatar
      btrfs: raid56: extract rwm write bios assembly into a helper · 6486d21c
      Qu Wenruo authored
      The helper will be later used to refactor the rmw write path.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6486d21c
    • Qu Wenruo's avatar
      btrfs: raid56: extract the rmw bio list build code into a helper · 509c27aa
      Qu Wenruo authored
      The helper will later be used to refactor the whole RMW path.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      509c27aa
    • Qu Wenruo's avatar
      btrfs: raid56: switch recovery path to a single function · d817ce35
      Qu Wenruo authored
      Currently btrfs uses end_io functions to jump between different stages
      of recovery.
      
      For example, we go the following different functions:
      
      - raid56_bio_end_io()
        This handles the read for all the sectors (except the missing device).
      
      - __raid_recover_end_io()
        This does the real work, it's called inside the delayed work function
        raid_recover_end_io_work().
      
      This one recovery path involves at least 3 different functions, which is
      a big burden for readers.
      
      This patch will change the behavior by:
      
      - Introduce a unified recovery entrance, recover_rbio()
      
      - Use submit-and-wait method
        So the workflow is not interrupted by the endio function jump.
        This doesn't bring performance change, but reduce the burden for
        reviewers.
      
      - Run the main function in the rmw_workers workqueue
        Now raid56_parity_recover() only needs to setup the work, and
        queue the work using start_async_work().
      
      Now readers only need to do one function jump (start_async_work()) to
      find out the main entrance of recovery path.
      
      Furthermore, recover_rbio() function can easily be reused by other paths.
      
      The old recovery path is still utilized by degraded write path.
      It will be cleaned up when we have migrated the write path.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d817ce35
    • Qu Wenruo's avatar
      btrfs: raid56: extract sector recovery code into a helper · ec936b03
      Qu Wenruo authored
      This includes extra changes:
      
      - The allocation for unmap_array[] and pointers[]
        Now we allocate them in one go, and free them together.
      
      - Remove @err
        Use errno_to_blk_status(ret) instead.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec936b03
    • Qu Wenruo's avatar
      btrfs: raid56: extract the recovery bio list build code into a helper · d31968d9
      Qu Wenruo authored
      This new helper will be also utilized in the incoming refactor of
      recovery path.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d31968d9
    • Qu Wenruo's avatar
      btrfs: raid56: extract the pq generation code into a helper · 30e3c897
      Qu Wenruo authored
      Currently finish_rmw() will update the P/Q stripes before submitting
      the writes.
      
      It's done behind a for(;;) loop, it's a little congested indent-wise, so
      extract the code into a helper called generate_pq_vertical().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30e3c897
    • Qu Wenruo's avatar
      btrfs: raid56: extract the vertical stripe recovery code into recover_vertical() · 9c5ff9b4
      Qu Wenruo authored
      This refactor includes the following behavior change first:
      
      - Don't error out if only P/Q is corrupted
      
        The old code will directly error out if only P/Q is corrupted.
        Although it is an logical error if we go into rebuild path with
        only P/Q corrupted, there is no need to error out.
      
        Just skip the rebuild and return the already good data.
      
      Then comes the following refactor which shouldn't cause behavior
      changes:
      
      - Introduce a helper to do vertical stripe recovery
      
        This not only reduce one indent level, but also paves the road for
        later data checksum verification in RMW cycles.
      
      - Sort rbio->faila/b before recovery
      
        So we don't need to do the same swap every vertical stripe
      
      - Replace a BUG_ON() with ASSERT()
      
        Or checkpatch won't let me pass.
      
      - Mark recovered sectors uptodate after the recover loop
      
      - Do the cleanup for pointers unconditionally
      
        We only need to initialize @pointers and @unmap_array to NULL, so
        we can safely free them unconditionally.
      
      - Mark the repaired sector uptodate in recover_vertical()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c5ff9b4
    • David Sterba's avatar
      btrfs: merge struct extent_page_data to btrfs_bio_ctrl · ee5f017d
      David Sterba authored
      The two structures appear on the same call paths, btrfs_bio_ctrl is
      embedded in extent_page_data and we pass bio_ctrl to some functions.
      After merging there are fewer indirections and we have only one control
      structure. The packing remains same.
      
      The btrfs_bio_ctrl was selected as the target structure as the operation
      is closer to bio processing.
      
      Structure layout:
      
      struct btrfs_bio_ctrl {
              struct bio *               bio;                  /*     0     8 */
              int                        mirror_num;           /*     8     4 */
              enum btrfs_compression_type compress_type;       /*    12     4 */
              u32                        len_to_stripe_boundary; /*    16     4 */
              u32                        len_to_oe_boundary;   /*    20     4 */
              btrfs_bio_end_io_t         end_io_func;          /*    24     8 */
              bool                       extent_locked;        /*    32     1 */
              bool                       sync_io;              /*    33     1 */
      
              /* size: 40, cachelines: 1, members: 8 */
              /* padding: 6 */
              /* last cacheline: 40 bytes */
      };
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee5f017d
    • David Sterba's avatar
      btrfs: switch extent_page_data bit fields to bools · 8ec8519b
      David Sterba authored
      The semantics of the two members is a boolean, so change the type
      accordingly.  We have space in extent_page_data due to alignment there's
      no change in size.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ec8519b
    • David Sterba's avatar
      btrfs: simplify percent calculation helpers, rename div_factor · 428c8e03
      David Sterba authored
      The div_factor* helpers calculate fraction or percentage fraction. The
      name is a bit confusing, we use it only for percentage calculations and
      there are two helpers.
      
      There's a helper mult_frac that's for general fractions, that tries to
      be accurate but we multiply and divide by small numbers so we can use
      the div_u64 helper.
      
      Rename the div_factor* helpers and use 1..100 percentage range, also drop
      the case checking for percentage == 100, it's never hit.
      
      The conversions:
      
      * div_factor calculates tenths and the numbers need to be adjusted
      * div_factor_fine is direct replacement
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      428c8e03
    • Filipe Manana's avatar
      btrfs: update stale comment for nowait direct IO writes · 20af93d9
      Filipe Manana authored
      If when doing a direct IO write we need to fallback to buffered IO, we
      this comment at btrfs_direct_write() that says we can't directly fallback
      to buffered IO if we have a NOWAIT iocb, because we have no support for
      NOWAIT buffered writes. That is not true anymore, as support for NOWAIT
      buffered writes was added recently in commit 926078b2 ("btrfs: enable
      nowait async buffered writes").
      
      However we still can't fallback to a buffered write in case we have a
      NOWAIT iocb, because we'll need to flush delalloc and wait for it to
      complete after doing the buffered write, and that can block for several
      reasons, the main reason being waiting for IO to complete.
      
      So update the comment to mention all that.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20af93d9