1. 19 Jun, 2023 40 commits
    • Christoph Hellwig's avatar
      btrfs: mark extent_buffer_under_io static · 7f26fb1c
      Christoph Hellwig authored
      extent_buffer_under_io is only used in extent_io.c, so mark it static.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f26fb1c
    • Qu Wenruo's avatar
      btrfs: trigger orphan inode cleanup during START_SYNC ioctl · edc72881
      Qu Wenruo authored
      There is an internal error report that scrub found an error in an orphan
      inode's data.
      
      However there are very limited ways to cleanup such orphan inodes:
      
      - btrfs_start_pre_rw_mount()
        This happens at either mount, or RO->RW switch.
        This is not a viable solution for root fs which may not be unmounted
        or RO mounted.
      
        Furthermore this doesn't cover every subvolume, it only covers the
        currently cached subvolumes.
      
      - btrfs_lookup_dentry()
        This happens when we first lookup the subvolume dentry.
        But dentry can be cached thus it's not ensured to be triggered every
        time.
      
      - create_snapshot()
        This only happens for the created snapshot, not the source one.
      
      This means if we didn't trigger orphan items cleanup, there is really no
      other way to manually trigger it. Add this step to the START_SYNC ioctl.
      This is a slight change in the semantics of the ioctl but as sync can be
      potentially slow and is usually paired with WAIT_SYNC ioctl.
      
      The errors are not handled because the main point of the ioctl is the
      async commit, orphan cleanup is a side effect.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      edc72881
    • Filipe Manana's avatar
      btrfs: fix comment referring to no longer existing btrfs_clean_tree_block() · 618d1d7d
      Filipe Manana authored
      There's a comment at btrfs_init_new_buffer() that refers to a function
      named btrfs_clean_tree_block(), however the function was renamed to
      btrfs_clear_buffer_dirty() in commit 190a8339 ("btrfs: rename
      btrfs_clean_tree_block to btrfs_clear_buffer_dirty"). So update the
      comment to refer to the current name.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      618d1d7d
    • Filipe Manana's avatar
      btrfs: change for_rename argument of btrfs_record_unlink_dir() to bool · 59fcf388
      Filipe Manana authored
      The for_rename argument of btrfs_record_unlink_dir() is defined as an
      integer, but the argument is in fact used as a boolean. So change it to
      a boolean to make its use more clear.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59fcf388
    • Filipe Manana's avatar
      btrfs: remove pointless label and goto at btrfs_record_unlink_dir() · acfb5a4f
      Filipe Manana authored
      There's no point of having a label and goto at btrfs_record_unlink_dir()
      because the function is trivial and can just return early if we are not
      in a rename context. So remove the label and goto and instead return
      early if we are not in a rename.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      acfb5a4f
    • Filipe Manana's avatar
      btrfs: update comments at btrfs_record_unlink_dir() to be more clear · 1e75ef03
      Filipe Manana authored
      Update the comments at btrfs_record_unlink_dir() so that they mention
      where new names are logged and where old names are removed. Also, while
      at it make the width of the comments closer to 80 columns and capitalize
      the sentences and finish them with punctuation.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e75ef03
    • Filipe Manana's avatar
      btrfs: use inode_logged() at btrfs_record_unlink_dir() · d67ba263
      Filipe Manana authored
      At btrfs_record_unlink_dir() we directly check the logged_trans field of
      the given inodes to check if they were previously logged in the current
      transaction, and if any of them were, then we can avoid setting the field
      last_unlink_trans of the directory to the id of the current transaction if
      we are in a rename path. Avoiding that can later prevent falling back to
      a transaction commit if anyone attempts to log the directory.
      
      However the logged_trans field, store in struct btrfs_inode, is transient,
      not persisted in the inode item on its subvolume b+tree, so that means
      that if an inode is evicted and then loaded again, its original value is
      lost and it's reset to 0. So directly checking the logged_trans field can
      lead to some false negative, and that only results in a performance impact
      as mentioned before.
      
      Instead of directly checking the logged_trans field of the inodes, use the
      inode_logged() helper, which will check in the log tree if an inode was
      logged before in case its logged_trans field has a value of 0. This way
      we can avoid setting the directory inode's last_unlink_trans and cause
      future logging attempts of it to fallback to transaction commits. The
      following test script shows one example where this happens without this
      patch:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nullb0
        MNT=/mnt/nullb0
      
        num_init_files=10000
        num_new_files=10000
      
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $num_init_files; i++)); do
            echo -n > $MNT/testdir/file_$i
         done
      
        echo -n > $MNT/testdir/foo
      
        sync
      
        # Add some files so that there's more work in the transaction other
        # than just renaming file foo.
        for ((i = 1; i <= $num_new_files; i++)); do
            echo -n > $MNT/testdir/new_file_$i
        done
      
        # Change the file, fsync it.
        setfattr -n user.x1 -v 123 $MNT/testdir/foo
        xfs_io -c "fsync" $MNT/testdir/foo
      
        # Now triggger eviction of file foo but no eviction for our test
        # directory, since it is being used by the process below. This will
        # set logged_trans of the file's inode to 0 once it is loaded again.
        (
            cd $MNT/testdir
            while true; do
                :
            done
        ) &
        pid=$!
      
        echo 2 > /proc/sys/vm/drop_caches
      
        kill $pid
        wait $pid
      
        # Move foo out of our testdir. This will set last_unlink_trans
        # of the directory inode to the current transaction, because
        # logged_trans of both the directory and the file are set to 0.
        mv $MNT/testdir/foo $MNT/foo
      
        # Change file foo again and fsync it.
        # This fsync will result in a transaction commit because the rename
        # above has set last_unlink_trans of the parent directory to the id
        # of the current transaction and because our inode for file foo has
        # last_unlink_trans set to the current transaction, since it was
        # evicted and reloaded and it was previously modified in the current
        # transaction (the xattr addition).
        xfs_io -c "pwrite 0 64K" $MNT/foo
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/foo
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo "file fsync took: $dur milliseconds"
      
        umount $MNT
      
      Before this patch:   fsync took 19 milliseconds
      After this patch:    fsync took  5 milliseconds
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d67ba263
    • Filipe Manana's avatar
      btrfs: use inode_logged() at need_log_inode() · bf1f4fd3
      Filipe Manana authored
      At need_log_inode() we directly check the ->logged_trans field of the
      given inode to check if it was previously logged in the transaction, with
      the goal of skipping logging the inode again when it's not necessary.
      The ->logged_trans field in not persisted in the inode item or elsewhere,
      it's only stored in memory (struct btrfs_inode), so it's transient and
      lost once the inode is evicted and then loaded again. Once an inode is
      loaded, we are conservative and set ->logged_trans to 0, which may mean
      that either the inode was never logged in the current transaction or it
      was logged but evicted before being loaded again.
      
      Instead of checking the inode's ->logged_trans field directly, we can
      use instead the helper inode_logged(), which will really check if the
      inode was logged before in the current transaction in case we have a
      ->logged_trans field with a value of 0. This will prevent unnecessarily
      logging an inode when it's not needed, and in some cases preventing a
      transaction commit, in case the logging requires a fallback to a
      transaction commit. The following test script shows a scenario where
      due to eviction we fallback a transaction commit when trying to fsync
      a file that was renamed:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nullb0
        MNT=/mnt/nullb0
      
        num_init_files=10000
        num_new_files=10000
      
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $num_init_files; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        echo -n > $MNT/testdir/foo
      
        sync
      
        # Add some files so that there's more work in the transaction other
        # than just renaming file foo.
        for ((i = 1; i <= $num_new_files; i++)); do
            echo -n > $MNT/testdir/new_file_$i
        done
      
        # Fsync the directory first.
        xfs_io -c "fsync" $MNT/testdir
      
        # Rename file foo.
        mv $MNT/testdir/foo $MNT/testdir/bar
      
        # Now trigger eviction of the test directory's inode.
        # Once loaded again, it will have logged_trans set to 0 and
        # last_unlink_trans set to the current transaction.
        echo 2 > /proc/sys/vm/drop_caches
      
        # Fsync file bar (ex-foo).
        # Before the patch the fsync would result in a transaction commit
        # because the inode for file bar has last_unlink_trans set to the
        # current transaction, so it will attempt to log the parent directory
        # as well, which will fallback to a full transaction commit because
        # it also has its last_unlink_trans set to the current transaction,
        # due to the inode eviction.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir/bar
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo "file fsync took: $dur milliseconds"
      
        umount $MNT
      
      Before this patch:  fsync took 22 milliseconds
      After this patch:   fsync took  8 milliseconds
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf1f4fd3
    • Jiapeng Chong's avatar
      btrfs: scrub: remove more unused functions · b9cb105e
      Jiapeng Chong authored
      These functions are defined in the scrub.c file, but last callers were
      removed in e9255d6c ("btrfs: scrub: remove the old scrub recheck
      code").
      
      fs/btrfs/scrub.c:553:20: warning: unused function 'scrub_stripe_index_and_offset'.
      fs/btrfs/scrub.c:543:19: warning: unused function 'scrub_nr_raid_mirrors'.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4937Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9cb105e
    • Qu Wenruo's avatar
      btrfs: handle tree backref walk error properly · b7f9945a
      Qu Wenruo authored
      [BUG]
      Smatch reports the following errors related to commit ("btrfs: output
      affected files when relocation fails"):
      
      	fs/btrfs/inode.c:283 print_data_reloc_error()
      	error: uninitialized symbol 'ref_level'.
      
      [CAUSE]
      That part of code is mostly copied from scrub, but unfortunately scrub
      code from the beginning is not doing the error handling properly.
      
      The offending code looks like this:
      
      	do {
      		ret = tree_backref_for_extent();
      		btrfs_warn_rl();
      	} while (ret != 1);
      
      There are several problems involved:
      
      - No error handling
        If that tree_backref_for_extent() failed, we would output the same
        error again and again, never really exit as it requires ret == 1 to
        exit.
      
      - Always do one extra output
        As tree_backref_for_extent() only return > 0 if there is no more
        backref item.
        This means after the last item we hit, we would output an invalid
        error message for ret > 0 case.
      
      [FIX]
      Fix the old code by:
      
      - Move @ref_root and @ref_level into the if branch
        And do not initialize them, so we can catch such uninitialized values
        just like what we do in the inode.c
      
      - Explicitly check the return value of tree_backref_for_extent()
        And handle ret < 0 and ret > 0 cases properly.
      
      - No more do {} while () loop
        Instead go while (true) {} loop since we will handle @ret manually.
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7f9945a
    • Christoph Hellwig's avatar
      btrfs: don't hold an extra reference for redirtied buffers · f880fe6e
      Christoph Hellwig authored
      When btrfs_redirty_list_add redirties a buffer, it also acquires
      an extra reference that is released on transaction commit.  But
      this is not required as buffers that are dirty or under writeback
      are never freed (look for calls to extent_buffer_under_io())).
      
      Remove the extra reference and the infrastructure used to drop it
      again.
      
      History behind redirty logic:
      
      In the first place, it used releasing_list to hold all the
      to-be-released extent buffers, and decided which buffers to re-dirty at
      the commit time. Then, in a later version, the behaviour got changed to
      re-dirty a necessary buffer and add re-dirtied one to the list in
      btrfs_free_tree_block(). In short, the list was there mostly for the
      patch series' historical reason.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      [ add Naohiro's comment regarding history ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f880fe6e
    • Christoph Hellwig's avatar
      btrfs: fix dirty_metadata_bytes for redirtied buffers · f18cc978
      Christoph Hellwig authored
      dirty_metadata_bytes is decremented in both places that clear the dirty
      bit in a buffer, but only incremented in btrfs_mark_buffer_dirty, which
      means that a buffer that is redirtied using btrfs_redirty_list_add won't
      be added to dirty_metadata_bytes, but it will be subtracted when written
      out, leading an inconsistency in the counter.
      
      Move the dirty_metadata_bytes from btrfs_mark_buffer_dirty into
      set_extent_buffer_dirty to also account for the redirty case, and remove
      the now unused set_extent_buffer_dirty return value.
      
      Fixes: d3575156 ("btrfs: zoned: redirty released extent buffers")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f18cc978
    • Johannes Thumshirn's avatar
      btrfs: unexport btrfs_run_discard_work and make it static · bb5167e6
      Johannes Thumshirn authored
      Mark btrfs_run_discard_work static and move it above its callers.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb5167e6
    • Josef Bacik's avatar
      btrfs: rename del_ptr to btrfs_del_ptr and export it · 016f9d0b
      Josef Bacik authored
      This exists internal to ctree.c, however btrfs check needs to use it for
      some of its operations.  I'd rather not duplicate that code inside of
      btrfs check as this is low level and I want to keep this code in one
      place, so rename the function to btrfs_del_ptr and export it so that it
      can be used inside of btrfs-progs safely.  Add a comment to make sure
      this doesn't get removed by a future cleanup.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      016f9d0b
    • Josef Bacik's avatar
      btrfs: add a btrfs_csum_type_size helper · b3cbfb0d
      Josef Bacik authored
      This is needed in btrfs-progs for the tools that convert the checksum
      types for file systems and a few other things.  We don't have it in the
      kernel as we just want to get the size for the super blocks type.
      However I don't want to have to manually add this every time we sync
      ctree.c into btrfs-progs, so add the helper in the kernel with a note so
      it doesn't get removed by a later cleanup.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3cbfb0d
    • Josef Bacik's avatar
      btrfs: add __KERNEL__ check for btrfs_no_printk · a95b7f93
      Josef Bacik authored
      We want to override this in btrfs-progs, so wrap this in the __KERNEL__
      check so we can easily sync this to btrfs-progs and have our local
      version of btrfs_no_printk do the work.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a95b7f93
    • Josef Bacik's avatar
      btrfs: move split_flags/combine_flags helpers to inode-item.h · f541833c
      Josef Bacik authored
      These are more related to the inode item flags on disk than the
      in-memory btrfs_inode, move the helpers to inode-item.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f541833c
    • Josef Bacik's avatar
      btrfs: move btrfs_verify_level_key into tree-checker.c · 2cac5af1
      Josef Bacik authored
      This is more a buffer validation helper, move it into the tree-checker
      files where it makes more sense.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2cac5af1
    • Josef Bacik's avatar
      btrfs: add __btrfs_check_node helper · c26fa931
      Josef Bacik authored
      This helper returns a btrfs_tree_block_status for the various errors,
      and then btrfs_check_node() will return -EUCLEAN if it gets anything
      other than BTRFS_TREE_BLOCK_CLEAN which will be used by the kernel.  In
      the future btrfs-progs will use this helper instead.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c26fa931
    • Josef Bacik's avatar
      btrfs: extend btrfs_leaf_check to return btrfs_tree_block_status · 924452c8
      Josef Bacik authored
      Instead of blanket returning -EUCLEAN for all the failures in
      btrfs_check_leaf, use btrfs_tree_block_status and return the appropriate
      status for each failure.  Rename the helper to __btrfs_check_leaf and
      then make a wrapper of btrfs_check_leaf that will return -EUCLEAN to
      non-clean error codes.  This will allow us to have the
      __btrfs_check_leaf variant in btrfs-progs while keeping the behavior in
      the kernel consistent.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      924452c8
    • Josef Bacik's avatar
      btrfs: use btrfs_tree_block_status for leaf item errors · c8d54215
      Josef Bacik authored
      We have a variety of item specific errors that can occur.  For now
      simply put these under the umbrella of BTRFS_TREE_BLOCK_INVALID_ITEM,
      this can be fleshed out as we need in the future.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8d54215
    • Josef Bacik's avatar
      btrfs: add btrfs_tree_block_status definitions to tree-checker.h · a7b4e6c7
      Josef Bacik authored
      We use this in btrfs-progs to determine if we can fix different types of
      corruptions.  We don't care about this in the kernel, however it would
      be good to share this code between the kernel and btrfs-progs, so add
      the status definitions so we can start converting the tree-checker code
      over to using these status flags instead of blanket returning -EUCLEAN.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7b4e6c7
    • Josef Bacik's avatar
      btrfs: simplify btrfs_check_leaf_* helpers into a single helper · 85d8a826
      Josef Bacik authored
      We have two helpers for checking leaves, because we have an extra check
      for debugging in btrfs_mark_buffer_dirty(), and at that stage we may
      have item data that isn't consistent yet.  However we can handle this
      case internally in the helper, if BTRFS_HEADER_FLAG_WRITTEN is set we
      know the buffer should be internally consistent, otherwise we need to
      skip checking the item data.
      
      Simplify this helper down a single helper and handle the item data
      checking logic internally to the helper.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      85d8a826
    • Josef Bacik's avatar
      btrfs: remove level argument from btrfs_set_block_flags · 4aec05fa
      Josef Bacik authored
      We just pass in btrfs_header_level(eb) for the level, and we're passing
      in the eb already, so simply get the level from the eb inside of
      btrfs_set_block_flags.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4aec05fa
    • Josef Bacik's avatar
      btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c · 54d687c1
      Josef Bacik authored
      This is completely related to block rsv's, move it out of the free space
      cache code and into block-rsv.c.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54d687c1
    • Qu Wenruo's avatar
      btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read · 94ead93e
      Qu Wenruo authored
      For P/Q stripe scrub, we have quite some duplicated read IO:
      
      - Data stripes read for verification
        This is triggered by the scrub_submit_initial_read() inside
        scrub_raid56_parity_stripe().
      
      - Data stripes read (again) for P/Q stripe verification
        This is triggered by scrub_assemble_read_bios() from scrub_rbio().
      
        Although we can have hit rbio cache and avoid unnecessary read, the
        chance is very low, as scrub would easily flush the whole rbio cache.
      
      This means, even we're just scrubbing a single P/Q stripe, we would read
      the data stripes twice for the best case scenario.  If we need to
      recover some data stripes, it would cause more reads on the same data
      stripes, again and again.
      
      However before we call raid56_parity_submit_scrub_rbio() we already
      have all data stripes repaired and their contents ready to use.
      But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
      itself still needs to re-read the data stripes.
      
      To avoid such cache miss, this patch would:
      
      - Introduce a new helper, raid56_parity_cache_data_pages()
        This function would grab the pages from an array, and copy the content
        to the rbio, marking all the involved sectors uptodate.
      
        The page copy is unavoidable because of the cache pages of rbio are all
        self managed, thus can not utilize outside pages without screwing up
        the lifespan.
      
      - Use the repaired data stripes as cache inside
        scrub_raid56_parity_stripe()
      
      By this, we ensure all the data sectors of the scrub rbio are already
      uptodate, and no need to read them again from disk.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94ead93e
    • Filipe Manana's avatar
      btrfs: assert tree lock is held when removing free space entries · 7e5ba559
      Filipe Manana authored
      Removing a free space entry from an in memory space cache requires having
      the corresponding btrfs_free_space_ctl's 'tree_lock' held. We have several
      code paths that remove an entry, so add assertions where appropriate to
      verify we are holding the lock, as the lock is acquired by some other
      function up in the call chain, which makes it easy to miss in the future.
      
      Note: for this to work we need to lock the local btrfs_free_space_ctl at
      load_free_space_cache(), which was not being done because it's local,
      declared on the stack, so no other task has access to it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e5ba559
    • Filipe Manana's avatar
      btrfs: assert tree lock is held when linking free space · 9649bd9a
      Filipe Manana authored
      When linking a free space entry, at link_free_space(), the caller should
      be holding the spinlock 'tree_lock' of the given btrfs_free_space_ctl
      argument, which is necessary for manipulating the red black tree of free
      space entries (done by tree_insert_offset(), which already asserts the
      lock is held) and for manipulating the 'free_space', 'free_extents',
      'discardable_extents' and 'discardable_bytes' counters of the given
      struct btrfs_free_space_ctl.
      
      So assert that the spinlock 'tree_lock' of the given btrfs_free_space_ctl
      is held by the current task. We have multiple code paths that end up
      calling link_free_space(), and all currently take the lock before calling
      it.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9649bd9a
    • Filipe Manana's avatar
      btrfs: assert tree lock is held when searching for free space entries · 91de9e97
      Filipe Manana authored
      When searching for a free space entry by offset, at tree_search_offset(),
      we are supposed to have the btrfs_free_space_ctl's 'tree_lock' held, so
      assert that. We have multiple callers of tree_search_offset(), and all
      currently hold the necessary lock before calling it.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91de9e97
    • Filipe Manana's avatar
      btrfs: assert proper locks are held at tree_insert_offset() · 13c2018f
      Filipe Manana authored
      There are multiple code paths leading to tree_insert_offset(), and each
      path takes the necessary locks before tree_insert_offset() is called,
      since they do other things that require those locks to be held. This makes
      it easy to miss the locking somewhere, so make tree_insert_offset() assert
      that the required locks are being held by the calling task.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13c2018f
    • Filipe Manana's avatar
      btrfs: simplify arguments to tree_insert_offset() · 0d6bac4d
      Filipe Manana authored
      For the in-memory component of space caching (free space cache and free
      space tree), three of the arguments passed to tree_insert_offset() can
      always be taken from the new free space entry that we are about to add.
      
      So simplify tree_insert_offset() to take the new entry instead of the
      'offset', 'node' and 'bitmap' arguments. This will also allow to make
      further changes simpler.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d6bac4d
    • Filipe Manana's avatar
      btrfs: use precomputed end offsets at do_trimming() · b77433b1
      Filipe Manana authored
      The are two computations of end offsets at do_trimming() that are not
      necessary, as they were previously computed and stored in local const
      variables. So just use the variables instead, to make the source code
      shorter and easier to read.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b77433b1
    • Filipe Manana's avatar
      btrfs: avoid searching twice for previous node when merging free space entries · 9085f425
      Filipe Manana authored
      At try_merge_free_space(), avoid calling twice rb_prev() to find the
      previous node, as that requires looping through the red black tree, so
      store the result of the rb_prev() call and then use it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9085f425
    • Filipe Manana's avatar
      btrfs: avoid extra memory allocation when copying free space cache · fbb2e654
      Filipe Manana authored
      At copy_free_space_cache(), we add a new entry to the block group's ctl
      before we free the entry from the temporary ctl. Adding a new entry
      requires the allocation of a new struct btrfs_free_space, so we can
      avoid a temporary extra allocation by freeing the entry from the
      temporary ctl before we add a new entry to the main ctl, which possibly
      also reduces the chances for a memory allocation failure in case of very
      high memory pressure. So just do that.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fbb2e654
    • Tom Rix's avatar
      btrfs: simplify transid initialization in btrfs_ioctl_wait_sync · 12df6a62
      Tom Rix authored
      A small code simplification, move the default value of transid to its
      initialization and remove the else-statement.
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12df6a62
    • Qu Wenruo's avatar
      btrfs: output affected files when relocation fails · b9a9a850
      Qu Wenruo authored
      [PROBLEM]
      When relocation fails (mostly due to checksum mismatch), we only got
      very cryptic error messages like:
      
        BTRFS info (device dm-4): relocating block group 13631488 flags data
        BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
        BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS info (device dm-4): balance: ended with status: -5
      
      The end user has to decipher the above messages and use various tools to
      locate the affected files and find a way to fix the problem (mostly
      deleting the file).  This is not an easy work even for experienced
      developer, not to mention the end users.
      
      [SCRUB IS DOING BETTER]
      By contrast, scrub is providing much better error messages:
      
        BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
        BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
        BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0
      
      Which provides the affected files directly to the end user.
      
      [IMPROVEMENT]
      Instead of the generic data checksum error messages, which is not doing
      a good job for data reloc inodes, this patch introduce a scrub like
      backref walking based solution.
      
      When a sector fails its checksum for data reloc inode, we go the
      following workflow:
      
      - Get the real logical bytenr
        For data reloc inode, the file offset is the offset inside the block
        group.
        Thus the real logical bytenr is @file_off + @block_group->start.
      
      - Do an extent type check
        If it's tree blocks it's much easier to handle, just go through
        all the tree block backref.
      
      - Do a backref walk and inode path resolution for data extents
        This is mostly the same as scrub.
        But unfortunately we can not reuse the same function as the output
        format is different.
      
      Now the new output would be more user friendly:
      
        BTRFS info (device dm-4): relocating block group 13631488 flags data
        BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
        BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file)
        BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
        BTRFS info (device dm-4): balance: ended with status: -5
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9a9a850
    • Christoph Hellwig's avatar
      btrfs: remove hipri_workers workqueue · 8bfec2e4
      Christoph Hellwig authored
      Now that btrfs_wq_submit_bio is never called for synchronous I/O,
      the hipri_workers workqueue is not used anymore and can be removed.
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bfec2e4
    • Christoph Hellwig's avatar
      btrfs: determine synchronous writers from bio or writeback control · e917ff56
      Christoph Hellwig authored
      The writeback_control structure already passes down the information about
      a writeback being synchronous from the core VM code, and thus information
      is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
      helper.
      
      Use that information to decide if checksums calculation is offloaded to
      a workqueue instead of btrfs_inode::sync_writers field that not only
      bloats the inode but also has too wide scope, being inode wide instead
      of limited to the actual writeback request.
      
      The sync writes were set in:
      
      - btrfs_do_write_iter - regular IO, sync status is set
      - start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
        mode
      - btrfs_write_marked_extents - write marked extents, writeback with
        WB_SYNC_ALL mode
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e917ff56
    • Christoph Hellwig's avatar
      btrfs: submit IO synchronously for fast checksum implementations · da023618
      Christoph Hellwig authored
      Most modern hardware supports very fast accelerated crc32c calculation.
      If that is supported the CPU overhead of the checksum calculation is
      very limited, and offloading the calculation to special worker threads
      has a lot of overhead for no gain.
      
      E.g. on an Intel Optane device is actually very much slows down even
      1M buffered writes with fio:
      
      Unpatched:
      
      write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets
      
      With synchronous CRCs:
      
      write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets
      
      With a lot of variation during the unpatched run going down as low as
      1100MB/s, while the synchronous CRC version has about the same peak write
      speed but much lower dips, and fewer kworkers churning around.
      Both tests had fio saturated at 100% CPU.
      
      (thanks to Jens Axboe via Chris Mason for the benchmarking)
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da023618
    • Anand Jain's avatar
      btrfs: use SECTOR_SHIFT to convert LBA to physical offset · adbe7e38
      Anand Jain authored
      Using SECTOR_SHIFT to convert LBA to physical address makes it more
      readable.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adbe7e38