1. 05 Dec, 2022 40 commits
    • Filipe Manana's avatar
      btrfs: send: skip unnecessary backref iterations · 88ffb665
      Filipe Manana authored
      When looking for a clone source for an extent, we are iterating over all
      the backreferences for an extent. This is often a waste of time, because
      once we find a good clone source we could stop immediately instead of
      continuing backref walking, which is expensive.
      
      Basically what happens currently is this:
      
      1) Call iterate_extent_inodes() to iterate over all the backreferences;
      
      2) It calls btrfs_find_all_leafs() which in turn calls the main function
         to walk over backrefs and collect them - find_parent_nodes();
      
      3) Then we collect all the references for our target data extent from the
         extent tree (and delayed refs if any), add them to the rb trees,
         resolve all the indirect backreferences and search for all the file
         extent items in fs trees, building a list of inodes for each one of
         them (struct extent_inode_elem);
      
      4) Then back at iterate_extent_inodes() we find all the roots associated
         to each found leaf, and call the callback __iterate_backrefs defined
         at send.c for each inode in the inode list associated to each leaf.
      
      Some times one the first backreferences we find in a fs tree is optimal
      to satisfy the clone operation that send wants to perform, and in that
      case we could stop immediately and avoid resolving all the remaining
      indirect backreferences (search fs trees for the respective file extent
      items, etc). This possibly if when we find a fs tree leaf with a file
      extent item we are able to know what are all the roots that can lead to
      the leaf - this is now possible after the previous patch in the series
      that adds a cache that maps leaves to a list of roots. So we can now
      shortcircuit backref walking during send, by having the callback we
      pass to iterate_extent_inodes() to be called when we find a file extent
      item for an indirect backreference, and have it return a special value
      when it found a suitable backreference and it does not need to look for
      more backreferences. This change does that.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      Performance test results are in the changelog of patch 17/17.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88ffb665
    • Filipe Manana's avatar
      btrfs: send: cache leaf to roots mapping during backref walking · 66d04209
      Filipe Manana authored
      During a send operation, when doing backref walking to determine which
      inodes/offsets/roots we can clone from, the most repetitive and expensive
      step is to map each leaf that has file extent items pointing to the target
      data extent to the IDs of the roots from which the leaves are accessible,
      which happens at iterate_extent_inodes(). That step requires finding every
      parent node of a leaf, then the parent of each parent, and so on until we
      reach a root node. So it's a naturally expensive operation, and repetitive
      because each leaf can have hundreds of file extent items (for a nodesize
      of 16K, that can be slightly over 200 file extent items). There's also
      temporal locality, as we process all file extent items from a leave before
      moving the next leaf.
      
      This change caches the mapping of leaves to root IDs, to avoid repeating
      those computations over and over again. The cache is limited to a maximum
      of 128 entries, with each entry being a struct with a size of 128 bytes,
      so the maximum cache size is 16K plus any nodes internally allocated by
      the maple tree that is used to index pointers to those structs. The cache
      is invalidated whenever we detect relocation happened since we started
      filling the cache, because if relocation happened then extent buffers for
      leaves and nodes of the trees used by a send operation may have been
      reallocated.
      
      This cache also allows for another important optimization that is
      introduced in the next patch in the series.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      Performance test results are in the changelog of patch 17/17.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66d04209
    • Filipe Manana's avatar
      btrfs: constify ulist parameter of ulist_next() · fa104a87
      Filipe Manana authored
      The ulist_next() iterator function does not need to change the given ulist
      so make it const. This will allow the next patch in the series to pass a
      ulist to a function that does not need, and should not, modify the ulist.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa104a87
    • Filipe Manana's avatar
      btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() · 1baea6f1
      Filipe Manana authored
      At iterate_extent_inodes() we collect a ulist of leaves for a given extent
      with a call to btrfs_find_all_leafs() and then we enter a loop where we
      iterate over all the collected leaves. Each iteration of that loop does a
      call to btrfs_find_all_roots_safe(), to determine all roots from which a
      leaf is accessible, and that results in allocating and releasing a ulist
      to store the root IDs.
      
      Instead of allocating and releasing the roots ulist on every iteration,
      allocate a ulist before entering the loop and keep using it on each
      iteration, reinitializing the ulist at the end of each iteration.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1baea6f1
    • Filipe Manana's avatar
      btrfs: use a structure to pass arguments to backref walking functions · a2c8d27e
      Filipe Manana authored
      The public backref walking functions have quite a lot of arguments that
      are passed down the call stack to find_parent_nodes(), the core function
      of the backref walking code.
      
      The next patches in series will need to add even arguments to these
      functions that should be passed not only to find_parent_nodes(), but also
      to other functions used by the later (directly or even lower in the call
      stack).
      
      So create a structure to hold all these arguments and state used by the
      main backref walking function, find_parent_nodes(), and use it as the
      argument for the public backref walking functions iterate_extent_inodes(),
      btrfs_find_all_leafs() and btrfs_find_all_roots().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a2c8d27e
    • Filipe Manana's avatar
      btrfs: use a single argument for extent offset in backref walking functions · 6ce6ba53
      Filipe Manana authored
      The interface for find_parent_nodes() has two extent offset related
      arguments:
      
      1) One u64 pointer argument for the extent offset;
      
      2) One boolean argument to tell if the extent offset should be ignored or
         not.
      
      These are confusing, becase the extent offset pointer can be NULL and in
      some cases callers pass a NULL value as a way to tell the backref walking
      code to ignore offsets in file extent items (and simply consider all file
      extent items that point to the target data extent).
      
      The boolean argument was added in commit c995ab3c ("btrfs: add a flag
      to iterate_inodes_from_logical to find all extent refs for uncompressed
      extents"), but it was never really necessary, it was enough if it could
      find a way to get a NULL value passed to the "extent_item_pos" argument of
      find_parent_nodes(). The arguments are also passed to functions called
      by find_parent_nodes() and respective helper functions, which further
      makes everything more complicated than needed.
      
      Then we have several backref walking related functions that end up calling
      find_parent_nodes(), either directly or through some other function that
      they call, and for many we have to use an "extent_item_pos" (u64) argument
      and a boolean "ignore_offset" argument too.
      
      This is confusing and not really necessary. So use a single argument to
      specify the extent offset, as a simple u64 and not as a pointer, but
      using a special value of (u64)-1, defined as a documented constant, to
      indicate when the extent offset should be ignored.
      
      This is also preparation work for the upcoming patches in the series that
      add other arguments to find_parent_nodes() and other related functions
      that use it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ce6ba53
    • Filipe Manana's avatar
      btrfs: send: optimize clone detection to increase extent sharing · c7499a64
      Filipe Manana authored
      Currently send does not do the best decisions when it comes to decide
      between multiple clone sources, which results in clone operations for
      partial extent ranges, which has the following disadvantages:
      
      1) We get less shared extents at the destination;
      
      2) We have to read more data during the send operation and emit more
         write commands.
      
      Besides not being optimal behaviour, it also breaks user expectations and
      is often reported by users, with a recent example in the Link tag at the
      bottom of this change log.
      
      Part of the reason for this non-optimal behaviour is that the backref
      walking code does not provide information about the length of the file
      extent items that were found for each backref, so send is blind about
      which backref is the best to chose as a cloning source.
      
      The other existing reasons are just silliness, namely always prefering
      the inode with the lowest number when multiple are found for the same
      root and when we can clone from multiple roots, always prefer the send
      root over any of the other clone roots. This does not make any sense
      since any inode or root is fine and as good as any other inode/root.
      
      Fix this by making backref walking pass information about the number of
      bytes referenced by each file extent item and then have send's backref
      callback pick the inode with the highest number of bytes for each root.
      Finally select the root from which we can clone more bytes from.
      
      Example reproducer:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
         cp --reflink=always $MNT/foo $MNT/baz
         sync
      
         # Overwrite the second half of file foo.
         xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo
         sync
      
         echo
         echo "*** fiemap in the original filesystem ***"
         echo
         xfs_io -c "fiemap -v" $MNT/foo
         xfs_io -c "fiemap -v" $MNT/bar
         xfs_io -c "fiemap -v" $MNT/baz
         echo
      
         btrfs filesystem du $MNT
      
         btrfs subvolume snapshot -r $MNT $MNT/snap
      
         btrfs send -f /tmp/send_stream $MNT/snap
      
         umount $MNT
         mkfs.btrfs -f $DEV &> /dev/null
         mount $DEV $MNT
      
         btrfs receive -f /tmp/send_stream $MNT
      
         echo
         echo "*** fiemap in the new filesystem ***"
         echo
         xfs_io -r -c "fiemap -v" $MNT/snap/foo
         xfs_io -r -c "fiemap -v" $MNT/snap/bar
         xfs_io -r -c "fiemap -v" $MNT/snap/baz
         echo
      
         btrfs filesystem du $MNT
      
         rm -f /tmp/send_stream
         rm -f /tmp/snap.fssum
      
         umount $MNT
      
      Before this change:
      
         $ ./test.sh
         (...)
      
         *** fiemap in the original filesystem ***
      
         /mnt/sdi/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB     1.00MiB           -  /mnt/sdi/foo
            2.00MiB       0.00B           -  /mnt/sdi/bar
            2.00MiB       0.00B           -  /mnt/sdi/baz
            6.00MiB     1.00MiB     2.00MiB  /mnt/sdi
      
         Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
         At subvol /mnt/sdi/snap
         At subvol snap
      
         *** fiemap in the new filesystem ***
      
         /mnt/sdi/snap/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/snap/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/snap/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    32768..34815      2048   0x1
      
              Total   Exclusive  Set shared  Filename
            2.00MiB       0.00B           -  /mnt/sdi/snap/foo
            2.00MiB     1.00MiB           -  /mnt/sdi/snap/bar
            2.00MiB     1.00MiB           -  /mnt/sdi/snap/baz
            6.00MiB     2.00MiB           -  /mnt/sdi/snap
            6.00MiB     2.00MiB     2.00MiB  /mnt/sdi
      
      We end up with two 1M extents that are not shared for files bar and baz.
      
      After this change:
      
         $ ./test.sh
         (...)
      
         *** fiemap in the original filesystem ***
      
         /mnt/sdi/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB     1.00MiB           -  /mnt/sdi/foo
            2.00MiB       0.00B           -  /mnt/sdi/bar
            2.00MiB       0.00B           -  /mnt/sdi/baz
            6.00MiB     1.00MiB     2.00MiB  /mnt/sdi
         Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
         At subvol /mnt/sdi/snap
         At subvol snap
      
         *** fiemap in the new filesystem ***
      
         /mnt/sdi/snap/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/snap/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048 0x2001
         /mnt/sdi/snap/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB       0.00B           -  /mnt/sdi/snap/foo
            2.00MiB       0.00B           -  /mnt/sdi/snap/bar
            2.00MiB       0.00B           -  /mnt/sdi/snap/baz
            6.00MiB       0.00B           -  /mnt/sdi/snap
            6.00MiB       0.00B     3.00MiB  /mnt/sdi
      
      Now there's a much better sharing, files bar and baz share 1M of the
      extent of file foo and the second extent of files bar and baz is shared
      between themselves.
      
      This will later be turned into a test case for fstests.
      
      Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7499a64
    • Filipe Manana's avatar
      btrfs: send: avoid unnecessary backref lookups when finding clone source · 22a3c0ac
      Filipe Manana authored
      At find_extent_clone(), unless we are given an inline extent, a file
      extent item that represents hole or an extent that starts beyond the
      i_size, we always do backref walking to look for clone sources, unless
      if we have more than SEND_MAX_EXTENT_REFS (64) known references on the
      extent.
      
      However if we know we only have one reference in the extent item and only
      one clone source (the send root), then it's pointless to do the backref
      walking to search for clone sources, as we can't clone from any other
      root. So skip the backref walking in that case.
      
      The following test was run on a non-debug kernel (Debian's default kernel
      config):
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         # Create an extent tree that's not too small and none of the
         # extents is shared.
         for ((i = 1; i <= 50000; i++)); do
            xfs_io -f -c "pwrite 0 4K" $MNT/file_$i > /dev/null
            echo -ne "\r$i files created..."
         done
         echo
      
         btrfs subvolume snapshot -r $MNT $MNT/snap
      
         start=$(date +%s%N)
         btrfs send $MNT/snap > /dev/null
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo -e "\nsend took $dur milliseconds"
      
         umount $MNT
      
      Before this change:
      
         send took 5389 milliseconds
      
      After this change:
      
         send took 4519 milliseconds  (-16.1%)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      22a3c0ac
    • Filipe Manana's avatar
      btrfs: send: drop unnecessary backref context field initializations · 344174a1
      Filipe Manana authored
      At find_extent_clone() we are initializing to zero the 'found_itself' and
      'found' fields of the backref context before we use it but we have already
      initialized the structure to zeroes when we declared it on stack, so it's
      pointless to initialize those fields and they are unnecessarily increasing
      the object text size with two "mov" instructions (x86_64).
      
      Similarly make the 'extent_len' initialization more clear by using an if-
      -then-else instead of a double assignment to it in case the extent's end
      crosses the i_size boundary.
      
      Before this change:
      
         $ size fs/btrfs/send.o
            text	   data	    bss	    dec	    hex	filename
           68694	   4252	     16	  72962	  11d02	fs/btrfs/send.o
      
      After this change:
      
         $ size fs/btrfs/send.o
            text	   data	    bss	    dec	    hex	filename
           68678	   4252	     16	  72946	  11cf2	fs/btrfs/send.o
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      344174a1
    • Filipe Manana's avatar
      btrfs: send: update comment at find_extent_clone() · d3f41317
      Filipe Manana authored
      We have this unclear comment at find_extent_clone() about extents starting
      at a file offset greater than or equals to the i_size of the inode. It's
      not really informative and it's misleading, since it mentions the author
      found such extents with snapshots and large files.
      
      Such extents are a result of fallocate with FALLOC_FL_KEEP_SIZE and there
      is no relation to snapshots or large files (all write paths update the
      i_size before inserting a new file extent item). So update the comment to
      be precise about it and why we don't bother looking for clone sources in
      that case.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3f41317
    • Filipe Manana's avatar
      btrfs: send: avoid unnecessary path allocations when finding extent clone · 61ce908a
      Filipe Manana authored
      When looking for an extent clone, at find_extent_clone(), we start by
      allocating a path and then check for cases where we can't have clones
      and exit immediately in those cases. It's a waste of time to allocate
      the path before those cases, so reorder the logic so that we check for
      those cases before allocating the path.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61ce908a
    • Qu Wenruo's avatar
      btrfs: remove the unused endio_raid56_workers and btrfs_raid_bio::end_io_work · 1a1a2851
      Qu Wenruo authored
      Since we have switched all raid56 workload to submit-and-wait method,
      there is no use for btrfs_fs_info::endio_raid56_workers workqueue and
      btrfs_raid_bio::end_io_work.
      
      Remove them to save some memory.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a1a2851
    • Qu Wenruo's avatar
      btrfs: raid56: switch scrub path to use a single function · 6bfd0133
      Qu Wenruo authored
      This switch involves the following changes:
      
      - Make finish_parity_scrub() only to submit the write bios
        It will no longer call rbio_orig_end_io(), and now it will
        return error.
      
      - Add a new helper, recover_scrub_rbio(), to handle recovery
        It's just doing extra scrub related checks, and then call
        recover_sectors().
      
      - Rename raid56_parity_scrub_stripe() to scrub_rbio()
      - Rename scrub_parity_work() to scrub_rbio_work_locked()
        To follow the existing naming scheme.
      
      - Delete unused functions
        Including:
        * finish_rmw()
        * raid_write_end_io()
        * raid56_bio_end_io()
        * __raid_recover_end_io()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6bfd0133
    • Qu Wenruo's avatar
      btrfs: raid56: extract scrub read bio list assembly code into a helper · cb3450b7
      Qu Wenruo authored
      Just like what we did for write/recovery, also extract the read bio
      assembly code into a helper for scrub.
      
      The difference between the three are:
      
      - rmw_assemble_read_bios() only submit reads for missing sectors
        Thus it will skip cached sectors, but will also read sectors which
        is not covered by any full stripe. (For cache usage)
      
      - recover_assemble_read_bios() reads every sector which has not failed
      
      - scrub_assemble_read_bios() has extra check for vertical stripes
        It's mostly the same as rmw_assemble_read_bios(), but will skip
        sectors which is not covered by a vertical stripe.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb3450b7
    • Qu Wenruo's avatar
      btrfs: raid56: switch write path to rmw_rbio() · 93723095
      Qu Wenruo authored
      This includes the following changes:
      
      - Implement new raid_unplug() functions
        Now we don't need a workqueue to run the plug, as all our
        work is just queue rmw_rbio_work() call, which can be executed
        without sleep.
      
      - Implement a rmw_rbio_work_locked() helper
        This is for unlock_stripe(), which is already holding the full stripe
        lock.
      
      - Remove all the old functions
        This should already shows how complex the old functions are, as we
        ended up removing the following functions:
      
        * rmw_work()
        * validate_rbio_for_rmw()
        * raid56_rmw_end_io_work()
        * raid56_rmw_stripe()
        * full_stripe_write()
        * partial_stripe_write()
        * __raid56_parity_write()
        * run_plug()
        * unplug_work()
        * btrfs_raid_unplug()
        * rmw_work()
        * __raid56_parity_recover()
        * raid_recover_end_io_work()
      
      - Unexport rmw_rbio()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      93723095
    • Qu Wenruo's avatar
      btrfs: raid56: introduce the main entrance for RMW path · 5eb30ee2
      Qu Wenruo authored
      The new entrance will be called rmw_rbio(), it will have a streamlined
      workflow by using submit-and-wait method.
      
      Thus there will be no weird jumps between tons of functions, thus way
      more reader friendly, and will make later expansion easier, as it's now
      a straight workflow, the timing is way more clear.
      
      Unfortunately we can not yet migrate the RMW path to use this new
      entrance as we still need extra work to address the plug and
      unlock_stripe() function.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5eb30ee2
    • Qu Wenruo's avatar
      btrfs: raid56: extract rwm write bios assembly into a helper · 6486d21c
      Qu Wenruo authored
      The helper will be later used to refactor the rmw write path.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6486d21c
    • Qu Wenruo's avatar
      btrfs: raid56: extract the rmw bio list build code into a helper · 509c27aa
      Qu Wenruo authored
      The helper will later be used to refactor the whole RMW path.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      509c27aa
    • Qu Wenruo's avatar
      btrfs: raid56: switch recovery path to a single function · d817ce35
      Qu Wenruo authored
      Currently btrfs uses end_io functions to jump between different stages
      of recovery.
      
      For example, we go the following different functions:
      
      - raid56_bio_end_io()
        This handles the read for all the sectors (except the missing device).
      
      - __raid_recover_end_io()
        This does the real work, it's called inside the delayed work function
        raid_recover_end_io_work().
      
      This one recovery path involves at least 3 different functions, which is
      a big burden for readers.
      
      This patch will change the behavior by:
      
      - Introduce a unified recovery entrance, recover_rbio()
      
      - Use submit-and-wait method
        So the workflow is not interrupted by the endio function jump.
        This doesn't bring performance change, but reduce the burden for
        reviewers.
      
      - Run the main function in the rmw_workers workqueue
        Now raid56_parity_recover() only needs to setup the work, and
        queue the work using start_async_work().
      
      Now readers only need to do one function jump (start_async_work()) to
      find out the main entrance of recovery path.
      
      Furthermore, recover_rbio() function can easily be reused by other paths.
      
      The old recovery path is still utilized by degraded write path.
      It will be cleaned up when we have migrated the write path.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d817ce35
    • Qu Wenruo's avatar
      btrfs: raid56: extract sector recovery code into a helper · ec936b03
      Qu Wenruo authored
      This includes extra changes:
      
      - The allocation for unmap_array[] and pointers[]
        Now we allocate them in one go, and free them together.
      
      - Remove @err
        Use errno_to_blk_status(ret) instead.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec936b03
    • Qu Wenruo's avatar
      btrfs: raid56: extract the recovery bio list build code into a helper · d31968d9
      Qu Wenruo authored
      This new helper will be also utilized in the incoming refactor of
      recovery path.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d31968d9
    • Qu Wenruo's avatar
      btrfs: raid56: extract the pq generation code into a helper · 30e3c897
      Qu Wenruo authored
      Currently finish_rmw() will update the P/Q stripes before submitting
      the writes.
      
      It's done behind a for(;;) loop, it's a little congested indent-wise, so
      extract the code into a helper called generate_pq_vertical().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30e3c897
    • Qu Wenruo's avatar
      btrfs: raid56: extract the vertical stripe recovery code into recover_vertical() · 9c5ff9b4
      Qu Wenruo authored
      This refactor includes the following behavior change first:
      
      - Don't error out if only P/Q is corrupted
      
        The old code will directly error out if only P/Q is corrupted.
        Although it is an logical error if we go into rebuild path with
        only P/Q corrupted, there is no need to error out.
      
        Just skip the rebuild and return the already good data.
      
      Then comes the following refactor which shouldn't cause behavior
      changes:
      
      - Introduce a helper to do vertical stripe recovery
      
        This not only reduce one indent level, but also paves the road for
        later data checksum verification in RMW cycles.
      
      - Sort rbio->faila/b before recovery
      
        So we don't need to do the same swap every vertical stripe
      
      - Replace a BUG_ON() with ASSERT()
      
        Or checkpatch won't let me pass.
      
      - Mark recovered sectors uptodate after the recover loop
      
      - Do the cleanup for pointers unconditionally
      
        We only need to initialize @pointers and @unmap_array to NULL, so
        we can safely free them unconditionally.
      
      - Mark the repaired sector uptodate in recover_vertical()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c5ff9b4
    • David Sterba's avatar
      btrfs: merge struct extent_page_data to btrfs_bio_ctrl · ee5f017d
      David Sterba authored
      The two structures appear on the same call paths, btrfs_bio_ctrl is
      embedded in extent_page_data and we pass bio_ctrl to some functions.
      After merging there are fewer indirections and we have only one control
      structure. The packing remains same.
      
      The btrfs_bio_ctrl was selected as the target structure as the operation
      is closer to bio processing.
      
      Structure layout:
      
      struct btrfs_bio_ctrl {
              struct bio *               bio;                  /*     0     8 */
              int                        mirror_num;           /*     8     4 */
              enum btrfs_compression_type compress_type;       /*    12     4 */
              u32                        len_to_stripe_boundary; /*    16     4 */
              u32                        len_to_oe_boundary;   /*    20     4 */
              btrfs_bio_end_io_t         end_io_func;          /*    24     8 */
              bool                       extent_locked;        /*    32     1 */
              bool                       sync_io;              /*    33     1 */
      
              /* size: 40, cachelines: 1, members: 8 */
              /* padding: 6 */
              /* last cacheline: 40 bytes */
      };
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee5f017d
    • David Sterba's avatar
      btrfs: switch extent_page_data bit fields to bools · 8ec8519b
      David Sterba authored
      The semantics of the two members is a boolean, so change the type
      accordingly.  We have space in extent_page_data due to alignment there's
      no change in size.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ec8519b
    • David Sterba's avatar
      btrfs: simplify percent calculation helpers, rename div_factor · 428c8e03
      David Sterba authored
      The div_factor* helpers calculate fraction or percentage fraction. The
      name is a bit confusing, we use it only for percentage calculations and
      there are two helpers.
      
      There's a helper mult_frac that's for general fractions, that tries to
      be accurate but we multiply and divide by small numbers so we can use
      the div_u64 helper.
      
      Rename the div_factor* helpers and use 1..100 percentage range, also drop
      the case checking for percentage == 100, it's never hit.
      
      The conversions:
      
      * div_factor calculates tenths and the numbers need to be adjusted
      * div_factor_fine is direct replacement
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      428c8e03
    • Filipe Manana's avatar
      btrfs: update stale comment for nowait direct IO writes · 20af93d9
      Filipe Manana authored
      If when doing a direct IO write we need to fallback to buffered IO, we
      this comment at btrfs_direct_write() that says we can't directly fallback
      to buffered IO if we have a NOWAIT iocb, because we have no support for
      NOWAIT buffered writes. That is not true anymore, as support for NOWAIT
      buffered writes was added recently in commit 926078b2 ("btrfs: enable
      nowait async buffered writes").
      
      However we still can't fallback to a buffered write in case we have a
      NOWAIT iocb, because we'll need to flush delalloc and wait for it to
      complete after doing the buffered write, and that can block for several
      reasons, the main reason being waiting for IO to complete.
      
      So update the comment to mention all that.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20af93d9
    • David Sterba's avatar
      btrfs: fix SPDX comment in tree-mod-log.h · c30ff698
      David Sterba authored
      The header files should use the /* */ comment style, introduced in
      commit f3a84ccd ("btrfs: move the tree mod log code into its own
      file").
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c30ff698
    • Qu Wenruo's avatar
      btrfs: extract the inline extent read code into its own function · a982fc82
      Qu Wenruo authored
      Currently we have inline extent read code behind two levels of
      indentation, factor them them out into a new function,
      read_inline_extent(), to make it a little easier to read.
      
      Since we're here, also remove @extent_offset and @pg_offset arguments
      from uncompress_inline() function, as it's not possible to have inline
      extents at non-inline file offset.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a982fc82
    • Qu Wenruo's avatar
      btrfs: remove new_inline argument from btrfs_extent_item_to_extent_map() · 280f15cb
      Qu Wenruo authored
      The argument @new_inline changes the following members of extent_map:
      
      - em->compress_type
      - EXTENT_FLAG_COMPRESSED of em->flags
      
      However neither members makes a difference for inline extents:
      
      - Inline extent read never use above em members
      
        As inside btrfs_get_extent() we directly use the file extent item to
        do the read.
      
      - Inline extents are never to be split
      
        Thus code really needs em->compress_type or that flag will never be
        executed on inlined extents.
        (btrfs_drop_extent_cache() would be one example)
      
      - Fiemap no longer relies on extent maps
      
        Recent fiemap optimization makes fiemap to search subvolume tree
        directly, without using any extent map at all.
      
        Thus those members make no difference for inline extents any more.
      
      Furthermore such exception without much explanation is really a source
      of confusion.
      
      Thus this patch will completely remove the argument, and always set the
      involved members, unifying the behavior.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      280f15cb
    • Qu Wenruo's avatar
      btrfs: do not reset extent map members for inline extents read · a196a894
      Qu Wenruo authored
      Currently for inline extents read inside btrfs_get_extent(), we will
      reset several extent map members:
      
      - em->start
      
        Reset to extent_start, which is completely unnecessary.
        The extent_start and em->start should have already be zero, ensured by
        tree-checker already.
      
      - em->len
      
        Reset the round_up(copy_size, fs_info->sectorsize), which is again
        unnecessary.
      
      - em->orig_block_len
      
        Reset to em->len (sectorsize), while it is originally unset from
        btrfs_extent_item_to_extent_map().
      
        This makes no difference, as all extent map handling paths will
        ignore the orig_block_len if they found it's an inlined extent.
      
        Such inline extent orig_block_len ignoring examples can be found in
        btrfs_drop_extent_cache().
      
      - em->orig_start
      
        Reset to em->start (0), while it is originally set to EXTENT_MAP_HOLE.
      
        This makes no difference either, as all extent map handling paths will
        ignore the em->orig_start if they found it's an inline extent.
      
      Thus all these em members resetting are unnecessary.
      
      Replace them with ASSERT()s checking the only two members (block_start
      and length) that make sense.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a196a894
    • Qu Wenruo's avatar
      btrfs: make inline extent read calculation much simpler · affc5424
      Qu Wenruo authored
      Currently we calculate inline extent read in a way that inline extent
      can start at non-zero offset.
      
      This is consistent with the inode selftests, which puts an inline extent
      at file offset 5.
      
      Meanwhile the inline extent creation code will only create inline extent
      at file offset 0.
      
      Furthermore with the introduction of tree-checker on file extents, we are
      actively rejecting inline extent which starts at non-zero file offset.
      And so far we haven't yet seen any report of rejected inline extents at
      non-zero file offset.
      
      This all means, the extra calculation to support inline extents at
      non-zero file offset is mostly paper weight, and damaging the
      readability of the code.
      
      Thus this patch will:
      
      - Add extra ASSERT()s to make sure involved file offset are all 0
      
      - Remove @extent_offset calculation
      
      - Simplify the involved code
        As several variables are now single-use, no need to declare them as
        a variable anymore.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      affc5424
    • Qu Wenruo's avatar
      btrfs: selftests: remove impossible inline extent at non-zero file offset · d52a1365
      Qu Wenruo authored
      In our inode-tests.c, we create an inline offset at file offset 5, which
      is no longer possible since the introduction of tree-checker.
      
      Thus I don't think we should spend time maintaining some corner cases
      which are already ruled out by tree-checker.
      
      So this patch will:
      
      - Change the inline extent to start at file offset 0
      
        Also change its length to 6 to cover the original length
      
      - Add an extra ASSERT() for btrfs_add_extent_mapping()
      
        This is to make sure tree-checker is working correctly.
      
      - Update the inode selftest
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d52a1365
    • Josef Bacik's avatar
      btrfs: move orphan prototypes into orphan.h · aa5d3003
      Josef Bacik authored
      Move these out of ctree.h into orphan.h to cut down on code in ctree.h.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa5d3003
    • Josef Bacik's avatar
      btrfs: move super_block specific helpers into super.h · 7f0add25
      Josef Bacik authored
      This will make syncing fs.h to user space a little easier if we can pull
      the super block specific helpers out of fs.h and put them in super.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f0add25
    • Josef Bacik's avatar
      btrfs: move super prototypes into super.h · c03b2207
      Josef Bacik authored
      Move these out of ctree.h into super.h to cut down on code in ctree.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c03b2207
    • Josef Bacik's avatar
      btrfs: move CONFIG_BTRFS_FS_RUN_SANITY_TESTS checks to fs.h · 6a6b4daf
      Josef Bacik authored
      We already have a few of these in fs.h, move the remaining checks out of
      ctree.h into fs.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a6b4daf
    • Josef Bacik's avatar
      btrfs: move verity prototypes into verity.h · 5c11adcc
      Josef Bacik authored
      Move these out of ctree.h into verity.h to cut down on code in ctree.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c11adcc
    • Josef Bacik's avatar
      btrfs: move dev-replace prototypes into dev-replace.h · 77407dc0
      Josef Bacik authored
      We already have a dev-replace.h, simply move these prototypes and
      helpers into dev-replace.h where they belong.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77407dc0
    • Josef Bacik's avatar
      btrfs: move scrub prototypes into scrub.h · 2fc6822c
      Josef Bacik authored
      Move these out of ctree.h into scrub.h to cut down on code in ctree.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2fc6822c