1. 11 Jul, 2024 40 commits
    • Qu Wenruo's avatar
      btrfs: remove extent_map::orig_start member · 4aa7b5d1
      Qu Wenruo authored
      Since we have extent_map::offset, the old extent_map::orig_start is just
      extent_map::start - extent_map::offset for non-hole/inline extents.
      
      And since the new extent_map::offset is already verified by
      validate_extent_map() while the old orig_start is not, let's just remove
      the old member from all call sites.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4aa7b5d1
    • Qu Wenruo's avatar
      btrfs: introduce extra sanity checks for extent maps · 3f255ece
      Qu Wenruo authored
      Since extent_map structure has the all the needed members to represent a
      file extent directly, we can apply all the file extent sanity checks to
      an extent map.
      
      The new sanity checks will cross check both the old members
      (block_start/block_len/orig_start) and the new members
      (disk_bytenr/disk_num_bytes/offset).
      
      There is a special case for offset/orig_start/start cross check, we only
      do such sanity check for compressed extent, as only compressed
      read/encoded write really utilize orig_start.
      This can be proved by the cleanup patch of orig_start.
      
      The checks happens at the following times:
      
      - add_extent_mapping()
        This is for newly added extent map
      
      - replace_extent_mapping()
        This is for btrfs_drop_extent_map_range() and split_extent_map()
      
      - try_merge_map()
      
      For a lot of call sites we have to properly populate all the members to
      pass the sanity check, meanwhile the following code needs extra
      modification:
      
      - setup_file_extents() from inode-tests
        The file extents layout of setup_file_extents() is already too invalid
        that tree-checker would reject most of them in real world.
      
        However there is just a special unaligned regular extent which has
        mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1).
        So instead of dropping the whole test case, here we just unify
        disk_num_bytes and ram_bytes to 4096 - 1.
      
      - test_case_7() from extent-map-tests
        An extent is inserted with 16K length, but on-disk extent size is
        only 4K.
        This means it must be a compressed extent, so set the compressed flag
        for it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3f255ece
    • Qu Wenruo's avatar
      btrfs: introduce new members for extent_map · 3d2ac992
      Qu Wenruo authored
      Introduce two new members for extent_map:
      
      - disk_bytenr
      - offset
      
      Both are matching the members with the same name inside
      btrfs_file_extent_items.
      
      For now this patch only touches those members when:
      
      - Reading btrfs_file_extent_items from disk
      - Inserting new holes
      - Merging two extent maps
        With the new disk_bytenr and disk_num_bytes, doing merging would be a
        little more complex, as we have 3 different cases:
      
        * Both extent maps are referring to the same data extents
          |<----- data extent A ----->|
             |<- em 1 ->|<- em 2 ->|
      
        * Both extent maps are referring to different data extents
          |<-- data extent A -->|<-- data extent B -->|
                     |<- em 1 ->|<- em 2 ->|
      
        * One of the extent maps is referring to a merged and larger data
          extent that covers both extent maps
      
          This is not really valid case other than some selftests.
          So this test case would be removed.
      
        A new helper merge_ondisk_extents() is introduced to handle the above
        valid cases.
      
      To properly assign values for those new members, a new btrfs_file_extent
      parameter is introduced to all the involved call sites.
      
      - For NOCOW writes the btrfs_file_extent would be exposed from
        can_nocow_file_extent().
      
      - For other writes, the members can be easily calculated
        As most of them have 0 offset and utilizing the whole on-disk data
        extent.
        The exception is encoded write, but thankfully that interface provided
        offset directly and all other needed info.
      
      For now, both the old members (block_start/block_len/orig_start) are
      co-existing with the new members (disk_bytenr/offset), meanwhile all the
      critical code is still using the old members only.
      
      The cleanup will happen later after all the old and new members are
      properly validated.
      
      There would be some re-ordering for the assignment of the extent_map
      members, now we follow the new ordering:
      
      - start and len
        Or file_pos and num_bytes for other structures.
      
      - disk_bytenr and disk_num_bytes
      - offset and ram_bytes
      - compression
      
      So expect some seemingly unrelated line movement.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d2ac992
    • Qu Wenruo's avatar
      btrfs: export the expected file extent through can_nocow_extent() · 87a6962f
      Qu Wenruo authored
      Currently function can_nocow_extent() only returns members needed for
      extent_map.
      
      However since we will soon change the extent_map structure to be more
      like btrfs_file_extent_item, we want to expose the expected file extent
      caused by the NOCOW write for future usage.
      
      This introduces a new structure, btrfs_file_extent, to be a more
      memory access friendly representation of btrfs_file_extent_item.
      And use that structure to expose the expected file extent caused by the
      NOCOW write.
      
      For now there is no user of the new structure yet.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87a6962f
    • Qu Wenruo's avatar
      btrfs: rename extent_map::orig_block_len to disk_num_bytes · e8fe524d
      Qu Wenruo authored
      This would make it very obvious that the member just matches
      btrfs_file_extent_item::disk_num_bytes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e8fe524d
    • Filipe Manana's avatar
      btrfs: move fiemap code into its own file · 8996f61a
      Filipe Manana authored
      Currently the core of the fiemap code lives in extent_io.c, which does
      not make any sense because it's not related to extent IO at all (and it
      was not as well before the big rewrite of fiemap I did some time ago).
      The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
      an inode operation.
      
      Since there's a significant amount of fiemap code, move all of it into a
      dedicated file, including its entry point inode.c:btrfs_fiemap().
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8996f61a
    • Filipe Manana's avatar
      btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate() · f9763e4d
      Filipe Manana authored
      Now that there is a helper to commit the current transaction and we are
      using it, there's no need for the label and goto statements at
      ensure_commit_roots_uptodate(). So replace them with direct return
      statements that call btrfs_commit_current_transaction().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9763e4d
    • Filipe Manana's avatar
      btrfs: add and use helper to commit the current transaction · ded980eb
      Filipe Manana authored
      We have several places that attach to the current transaction with
      btrfs_attach_transaction_barrier() and then commit the transaction if
      there is one. Add a helper and use it to deduplicate this pattern.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ded980eb
    • Filipe Manana's avatar
      btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned() · 1f8aee29
      Filipe Manana authored
      At finish_extent_writes_for_zoned() we use btrfs_join_transaction() to
      catch any running transaction and then commit it. This will however create
      a new and empty transaction in case there's no running transaction anymore
      (got committed by the transaction kthread or other task for example) or
      there's a running transaction finishing its commit and with a state >=
      TRANS_STATE_UNBLOCKED. In the former case we don't need to do anything
      while in the second case we just need to wait for the transaction to
      complete its commit.
      
      So improve this by using btrfs_attach_transaction_barrier() instead, which
      does not create a new transaction if there's none running, and if there's
      a current transaction that is committing, it will wait for it to fully
      commit and not create a new transaction. This helps avoiding creating and
      committing empty transactions, saving IO, time and unnecessary rotation of
      the backup roots in the super block.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f8aee29
    • Filipe Manana's avatar
      btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate() · 0557feab
      Filipe Manana authored
      At ensure_commit_roots_uptodate() we use btrfs_join_transaction() to
      catch any running transaction and then commit it. This will however create
      a new and empty transaction in case there's no running transaction anymore
      (got committed by the transaction kthread or other task for example) or
      there's a running transaction finishing its commit and with a state >=
      TRANS_STATE_UNBLOCKED. In the former case we don't need to do anything
      while in the second case we just need to wait for the transaction to
      complete its commit.
      
      So improve this by using btrfs_attach_transaction_barrier() instead, which
      does not create a new transaction if there's none running, and if there's
      a current transaction that is committing, it will wait for it to fully
      commit and not create a new transaction. This helps avoiding creating and
      committing empty transactions, saving IO, time and unnecessary rotation of
      the backup roots in the super block.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0557feab
    • Filipe Manana's avatar
      btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient · 9e79c497
      Filipe Manana authored
      Before starting a send operation we have to make sure that every root has
      its commit root matching the regular root, to that send doesn't find stale
      inodes in the commit root (inodes that were deleted in the regular root)
      and fails the inode lookups with -ESTALE.
      
      Currently we keep looking for roots used by the send operation and as soon
      as we find one we commit the current transaction (or a new one since
      btrfs_join_transaction() creates one if there isn't any running or the
      running one is in a state >= TRANS_STATE_UNBLOCKED). It's pointless to
      keep looking until we don't find any, because after the first transaction
      commit all the other roots are updated too, as they were already tagged in
      the fs_info->fs_roots_radix radix tree when they were modified in order to
      have a commit root different from the regular root.
      
      Currently we are also always passing the main send root into
      btrfs_join_transaction(), which despite not having any functional issue,
      it is not optimal because in case the root wasn't modified we end up
      adding it to fs_info->fs_roots_radix and then update its root item in the
      root tree when committing the transaction, causing unnecessary work.
      
      So simplify and make this more efficient by removing the looping and by
      passing the first root we found that is modified as the argument to
      btrfs_join_transaction().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e79c497
    • Filipe Manana's avatar
      btrfs: avoid create and commit empty transaction when committing super · cab0d862
      Filipe Manana authored
      At btrfs_commit_super(), called in a few contexts such as when unmounting
      a filesystem, we use btrfs_join_transaction() to catch any running
      transaction and then commit it. This will however create a new and empty
      transaction in case there's no running transaction or there's a running
      transaction with a state >= TRANS_STATE_UNBLOCKED.
      
      As we just want to be sure that any existing transaction is fully
      committed, we can use btrfs_attach_transaction_barrier() instead of
      btrfs_join_transaction(), therefore avoiding the creation and commit of
      empty transactions, which only waste IO and causes rotation of the
      precious backup roots.
      
      Example where we create and commit a pointless empty transaction:
      
        $ mkfs.btrfs -f /dev/sdj
        $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
        generation            6
      
        $ mount /dev/sdj /mnt/sdj
        $ touch /mnt/sdj/foo
      
        # Commit the currently open transaction. Just 'sync' or wait ~30
        # seconds for the transaction kthread to commit it.
        $ sync
      
        $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
        generation            7
      
        $ umount /mnt/sdj
      
        $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
        generation            8
      
      The transaction with id 8 was pointless, an empty transaction that did
      not achieve anything.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cab0d862
    • Filipe Manana's avatar
      btrfs: qgroup: avoid start/commit empty transaction when flushing reservations · de18fba8
      Filipe Manana authored
      When flushing reservations we are using btrfs_join_transaction() to get a
      handle for the current transaction and then commit it to try to release
      space. However btrfs_join_transaction() has some undesirable consequences:
      
      1) If there's no running transaction, it will create one, and we will
         commit it right after. This is unnecessary because it will not release
         any space, and it will result in unnecessary IO and rotation of backup
         roots in the superblock;
      
      2) If there's a current transaction and that transaction is committing
         (its state is >= TRANS_STATE_COMMIT_DOING), it will wait for that
         transaction to almost finish its commit (for its state to be >=
         TRANS_STATE_UNBLOCKED) and then start and return a new transaction.
      
         We will then commit that new transaction, which is pointless because
         all we wanted was to wait for the current (previous) transaction to
         fully finish its commit (state == TRANS_STATE_COMPLETED), and by
         starting and committing a new transaction we are wasting IO too and
         causing unnecessary rotation of backup roots in the superblock.
      
      So improve this by using btrfs_attach_transaction_barrier() instead, which
      does not create a new transaction if there's none running, and if there's
      a current transaction that is committing, it will wait for it to fully
      commit and not create a new transaction.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de18fba8
    • David Sterba's avatar
      btrfs: simplify range parameters of btrfs_wait_ordered_roots() · 42317ab4
      David Sterba authored
      The range is specified only in two ways, we can simplify the case for
      the whole filesystem range as a NULL block group parameter.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42317ab4
    • Qu Wenruo's avatar
      btrfs: automatically remove the subvolume qgroup · 839d6ea4
      Qu Wenruo authored
      Currently if we fully clean a subvolume (not only delete its directory,
      but fully clean all it's related data and root item), the associated
      qgroup would not be removed.
      
      We have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
      
      Change the behavior to automatically removie the qgroup of a fully
      cleaned subvolume when possible:
      
      - Full qgroup but still consistent
        We can and should remove the qgroup.
        The qgroup numbers should be 0, without any rsv.
      
      - Full qgroup but inconsistent
        Can happen with drop_subtree_threshold feature (skip accounting
        and mark qgroup inconsistent).
      
        We can and should remove the qgroup.
        Higher level qgroup numbers will be incorrect, but since qgroup
        is already inconsistent, it should not be a problem.
      
      - Squota mode
        This is the special case, we can only drop the qgroup if its numbers
        are all 0.
      
        This would be handled by can_delete_qgroup(), so we only need to check
        the return value and ignore the -EBUSY error.
      
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      839d6ea4
    • Qu Wenruo's avatar
      btrfs: slightly loosen the requirement for qgroup removal · a776bf5f
      Qu Wenruo authored
      [BUG]
      Currently if one is utilizing "qgroups/drop_subtree_threshold" sysfs,
      and a snapshot with level higher than that value is dropped, we will
      not be able to delete the qgroup until next qgroup rescan:
      
        uuid=ffffffff-eeee-dddd-cccc-000000000000
      
        wipefs -fa $dev
        mkfs.btrfs -f $dev -O quota -s 4k -n 4k -U $uuid
        mount $dev $mnt
      
        btrfs subvolume create $mnt/subv1/
        for (( i = 0; i < 1024; i++ )); do
        	xfs_io -f -c "pwrite 0 2k" $mnt/subv1/file_$i > /dev/null
        done
        sync
        btrfs subvolume snapshot $mnt/subv1 $mnt/snapshot
        btrfs quota enable $mnt
        btrfs quota rescan -w $mnt
        sync
        echo 1 > /sys/fs/btrfs/$uuid/qgroups/drop_subtree_threshold
        btrfs subvolume delete $mnt/snapshot
        btrfs subvolume sync $mnt
        btrfs qgroup show -prce --sync $mnt
        btrfs qgroup destroy 0/257 $mnt
        umount $mnt
      
      The final qgroup removal would fail with the following error:
      
        ERROR: unable to destroy quota group: Device or resource busy
      
      [CAUSE]
      The above script would generate a subvolume of level 2, then snapshot
      it, enable qgroup, set the drop_subtree_threshold, then drop the
      snapshot.
      
      Since the subvolume drop would meet the threshold, qgroup would be
      marked inconsistent and skip accounting to avoid hanging the system at
      transaction commit.
      
      But currently we do not allow a qgroup with any rfer/excl numbers to be
      dropped, and this is not really compatible with the new
      drop_subtree_threshold behavior.
      
      [FIX]
      Only require the strict zero rfer/excl/rfer_cmpr/excl_cmpr for squota
      mode.  This is due to the fact that squota can never go inconsistent,
      and it can have dropped subvolume but with non-zero qgroup numbers for
      future accounting.
      
      For full qgroup mode, we only check if there is a subvolume for it.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a776bf5f
    • David Sterba's avatar
      btrfs: constify parameters of write_eb_member() and its users · 56e6f268
      David Sterba authored
      Reported by 'gcc -Wcast-qual', the argument from which write_extent_buffer()
      reads data to write to the eb should be const. In addition the const
      needs to be also added to __write_extent_buffer() local buffers.
      
      All callers of write_eb_member() can now be updated to use const for the
      input buffer structure or type.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      56e6f268
    • David Sterba's avatar
      btrfs: keep const when returning value from get_unaligned_le8() · 840a97bd
      David Sterba authored
      This was reported by 'gcc -Wcast-qual', the get_unaligned_le8() simply
      returns the argument and there's no reason to drop the cast.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      840a97bd
    • David Sterba's avatar
      btrfs: remove unused define EXTENT_SIZE_PER_ITEM · 5100c4eb
      David Sterba authored
      This was added  in c61a16a7 ("Btrfs: fix the confusion between
      delalloc bytes and metadata bytes") and removed in 03fe78cc
      ("btrfs: use delalloc_bytes to determine flush amount for
      shrink_delalloc") where the calculation was reworked to use a
      non-constant numbers. This was found by 'make W=2'.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5100c4eb
    • David Sterba's avatar
      btrfs: use for-local variables that shadow function variables · d2715d1d
      David Sterba authored
      We've started to use for-loop local variables and in a few places this
      shadows a function variable. Convert a few cases reported by 'make W=2'.
      If applicable also change the style to post-increment, that's the
      preferred one.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2715d1d
    • David Sterba's avatar
      btrfs: rename macro local variables that clash with other variables · 91629e6d
      David Sterba authored
      Fix variable names in two macros where there's a local function variable
      of the same name.  In subpage_calc_start_bit() it's in several callers,
      in btrfs_abort_transaction() it's only in replace_file_extents().
      Found by 'make W=2'.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91629e6d
    • David Sterba's avatar
      btrfs: remove duplicate name variable declarations · 9c5e1fb0
      David Sterba authored
      When running 'make W=2' there are a few reports where a variable of the
      same name is declared in a nested block. In all the cases we can use the
      one declared in the parent block, no problematic cases were found.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c5e1fb0
    • Filipe Manana's avatar
      btrfs: use a btrfs_inode local variable at btrfs_sync_file() · 56b7169f
      Filipe Manana authored
      Instead of using a VFS inode local pointer and then doing many BTRFS_I()
      calls inside btrfs_sync_file(), use a btrfs_inode pointer instead. This
      makes everything a bit easier to read and less confusing, allowing to
      make some statements shorter.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      56b7169f
    • Filipe Manana's avatar
      btrfs: pass a btrfs_inode to btrfs_wait_ordered_range() · e641e323
      Filipe Manana authored
      Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode
      instead, as this is generally what we do for internal APIs, making it
      more consistent with most of the code base. This will later allow to
      help to remove a lot of BTRFS_I() calls in btrfs_sync_file().
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e641e323
    • Filipe Manana's avatar
      btrfs: pass a btrfs_inode to btrfs_fdatawrite_range() · cef2daba
      Filipe Manana authored
      Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode
      instead, as this is generally what we do for internal APIs, making it
      more consistent with most of the code base. This will later allow to
      help to remove a lot of BTRFS_I() calls in btrfs_sync_file().
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cef2daba
    • Filipe Manana's avatar
      btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx) · 4d0120a5
      Filipe Manana authored
      Instead of using a inode pointer, use a btrfs_inode pointer in the log
      context structure, as this is generally what we need and allows for some
      internal APIs to take a btrfs_inode instead, making them more consistent
      with most of the code base. This will later allow to help to remove a lot
      of BTRFS_I() calls in btrfs_sync_file().
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d0120a5
    • Filipe Manana's avatar
      btrfs: make btrfs_finish_ordered_extent() return void · c41881ae
      Filipe Manana authored
      Currently btrfs_finish_ordered_extent() returns a boolean indicating if
      the ordered extent was added to the work queue for completion, but none
      of its callers cares about it, so make it return void.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c41881ae
    • Anand Jain's avatar
      btrfs: move btrfs_block_group_root() to block-group.c · 83937fb6
      Anand Jain authored
      The function btrfs_block_group_root() is declared in disk-io.c; however,
      all its callers are in block-group.c. Move it to the latter file and
      declare it static.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      83937fb6
    • Anand Jain's avatar
      btrfs: drop bytenr_orig and fix comment in btrfs_scan_one_device() · 70559abf
      Anand Jain authored
      Drop the single-use variable bytenr_orig and instead use btrfs_sb_offset()
      in the function argument passing.
      
      Fix a stale comment about not automatically fixing a bad primary
      superblock from the backup mirror copies. Also, move the comment closer
      to where the primary superblock read occurs.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      70559abf
    • Filipe Manana's avatar
      btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree · 4e660ca3
      Filipe Manana authored
      We are currently using a cached rb_root (struct rb_root_cached) for the
      rb root of struct extent_map_tree. This doesn't offer much of an advantage
      here because:
      
      1) It's only advantage over the regular rb_root is that it caches a
         pointer to the left most node (first node), so a call to
         rb_first_cached() doesn't have to chase pointers until it reaches
         the left most node;
      
      2) We only have two scenarios that access left most node with
         rb_first_cached():
      
            When dropping all extent maps from an inode, during inode eviction;
      
            When iterating over extent maps during the extent map shrinker;
      
      3) In both cases we keep removing extent maps, which causes deletion of
         the left most node so rb_erase_cached() has to call rb_next() to find
         out what's the next left most node and assign it to
         struct rb_root_cached::rb_leftmost;
      
      4) We can do that ourselves in those two uses cases and stop using a
         rb_root_cached rb tree and use instead a regular rb_root rb tree.
      
         This reduces the size of struct extent_map_tree by 8 bytes and, since
         this structure is embedded in struct btrfs_inode, it also reduces the
         size of that structure by 8 bytes.
      
         So on a 64 bits platform the size of btrfs_inode is reduced from 1032
         bytes down to 1024 bytes.
      
         This means we will be able to have 4 inodes per 4K page instead of 3.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4e660ca3
    • Filipe Manana's avatar
      btrfs: rename rb_root member of extent_map_tree from map to root · 7f5830bc
      Filipe Manana authored
      Currently we name the rb_root member of struct extent_map_tree as 'map',
      which is odd and confusing. Since it's a root node, rename it to 'root'.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f5830bc
    • Filipe Manana's avatar
      btrfs: remove objectid from struct btrfs_inode on 64 bits platforms · 7a7bc214
      Filipe Manana authored
      On 64 bits platforms we don't really need to have a dedicated member (the
      objectid field) for the inode's number since we store in the VFS inode's
      i_ino member, which is an unsigned long and this type is 64 bits wide on
      64 bits platforms. We only need that field in case we are on a 32 bits
      platform because the unsigned long type is 32 bits wide on such platforms
      See commit 33345d01 ("Btrfs: Always use 64bit inode number") regarding
      this 64/32 bits detail.
      
      The objectid field of struct btrfs_inode is also used to store the ID of
      a root for directories that are stubs for unreferenced roots. In such
      cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime
      flag set.
      
      So in order to reduce the size of btrfs_inode structure on 64 bits
      platforms we can remove the objectid member and use the VFS inode's i_ino
      member instead whenever we need to get the inode number. In case the inode
      is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member
      last_reflink_trans to store the ID of the unreferenced root, since such
      inode is a directory and reflinks can't be done against directories.
      
      So remove the objectid fields for 64 bits platforms and alias the
      last_reflink_trans field with a name of ref_root_id in a union.
      On a release kernel config, this reduces the size of struct btrfs_inode
      from 1040 bytes down to 1032 bytes.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a7bc214
    • Filipe Manana's avatar
      btrfs: remove location key from struct btrfs_inode · 068fc8f9
      Filipe Manana authored
      Currently struct btrfs_inode has a key member, named "location", that is
      either:
      
      1) The key of the inode's item. In this case the objectid is the number
         of the inode;
      
      2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
         the case where we have a root that is a snapshot of a subvolume that
         points to other subvolumes. In this case the objectid is the ID of
         a subvolume inside the snapshotted parent subvolume.
      
      The key is only used to lookup the inode item for the first case, while
      for the second it's never used since it corresponds to directory stubs
      created with new_simple_dir() and which are marked as dummy, so there's
      no actual inode item to ever update. In the second case we only check
      the key type at btrfs_ino() for 32 bits platforms and its objectid is
      only needed for unlink.
      
      Instead of using a key we can do fine with just the objectid, since we
      can generate the key whenever we need it having only the objectid, as
      in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
      is always 0.
      
      So use only an objectid instead of a full key. This reduces the size of
      struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      068fc8f9
    • Filipe Manana's avatar
      btrfs: don't allocate file extent tree for non regular files · 3d7db6e8
      Filipe Manana authored
      When not using the NO_HOLES feature we always allocate an io tree for an
      inode's file_extent_tree. This is wasteful because that io tree is only
      used for regular files, so we allocate more memory than needed for inodes
      that represent directories or symlinks for example, or for inodes that
      correspond to free space inodes.
      
      So improve on this by allocating the io tree only for inodes of regular
      files that are not free space inodes.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d7db6e8
    • Filipe Manana's avatar
      btrfs: unify index_cnt and csum_bytes from struct btrfs_inode · d9891ae2
      Filipe Manana authored
      The index_cnt field of struct btrfs_inode is used only for two purposes:
      
      1) To store the index for the next entry added to a directory;
      
      2) For the data relocation inode to track the logical start address of the
         block group currently being relocated.
      
      For the relocation case we use index_cnt because it's not used for
      anything else in the relocation use case - we could have used other fields
      that are not used by relocation such as defrag_bytes, last_unlink_trans
      or last_reflink_trans for example (among others).
      
      Since the csum_bytes field is not used for directories, do the following
      changes:
      
      1) Put index_cnt and csum_bytes in a union, and index_cnt is only
         initialized when the inode is a directory. The csum_bytes is only
         accessed in IO paths for regular files, so we're fine here;
      
      2) Use the defrag_bytes field for relocation, since the data relocation
         inode is never used for defrag purposes. And to make the naming better,
         alias it to reloc_block_group_start by using a union.
      
      This reduces the size of struct btrfs_inode by 8 bytes in a release
      kernel, from 1056 bytes down to 1048 bytes.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9891ae2
    • Filipe Manana's avatar
      btrfs: remove inode_lock from struct btrfs_root and use xarray locks · e2844cce
      Filipe Manana authored
      Currently we use the spinlock inode_lock from struct btrfs_root to
      serialize access to two different data structures:
      
      1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
      2) The inodes xarray (struct btrfs_root::inodes).
      
      Instead of using our own lock, we can use the spinlock that is part of the
      xarray implementation, by using the xa_lock() and xa_unlock() APIs and
      using the xarray APIs with the double underscore prefix that don't take
      the xarray locks and assume the caller is using xa_lock() and xa_unlock().
      
      So remove the spinlock inode_lock from struct btrfs_root and use the
      corresponding xarray locks. This brings 2 benefits:
      
      1) We reduce the size of struct btrfs_root, from 1336 bytes down to
         1328 bytes on a 64 bits release kernel config;
      
      2) We reduce lock contention by not using anymore the same lock for
         changing two different and unrelated xarrays.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2844cce
    • Filipe Manana's avatar
      btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path() · d25f4ec1
      Filipe Manana authored
      Make btrfs_iget_path() simpler and easier to read by avoiding nesting of
      if-then-else statements and having an error label to do all the error
      handling instead of repeating it a couple times.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d25f4ec1
    • Filipe Manana's avatar
      btrfs: preallocate inodes xarray entry to avoid transaction abort · 061ea858
      Filipe Manana authored
      When creating a new inode, at btrfs_create_new_inode(), one of the very
      last steps is to add the inode to the root's inodes xarray. This often
      requires allocating memory which may fail (even though xarrays have a
      dedicated kmem_cache which make it less likely to fail), and at that point
      we are forced to abort the current transaction (as some, but not all, of
      the inode metadata was added to its subvolume btree).
      
      To avoid a transaction abort, preallocate memory for the xarray early at
      btrfs_create_new_inode(), so that if we fail we don't need to abort the
      transaction and the insertion into the xarray is guaranteed to succeed.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      061ea858
    • Filipe Manana's avatar
      btrfs: use an xarray to track open inodes in a root · 310b2f5d
      Filipe Manana authored
      Currently we use a red black tree (rb-tree) to track the currently open
      inodes of a root (in struct btrfs_root::inode_tree). This however is not
      very efficient when the number of inodes is large since rb-trees are
      binary trees. For example for 100K open inodes, the tree has a depth of
      17. Besides that, inserting into the tree requires navigating through it
      and pulling useless cache lines in the process since the red black tree
      nodes are embedded within the btrfs inode - on the other hand, by being
      embedded, it requires no extra memory allocations.
      
      We can improve this by using an xarray instead, which is efficient when
      indices are densely clustered (such as inode numbers), is more cache
      friendly and behaves like a resizable array, with a much better search
      and insertion complexity than a red black tree. This only has one small
      disadvantage which is that insertion will sometimes require allocating
      memory for the xarray - which may fail (not that often since it uses a
      kmem_cache) - but on the other hand we can reduce the btrfs inode
      structure size by 24 bytes (from 1080 down to 1056 bytes) after removing
      the embedded red black tree node, which after the next patches will allow
      to reduce the size of the structure to 1024 bytes, meaning we will be able
      to store 4 inodes per 4K page instead of 3 inodes.
      
      This change does a straightforward change to use an xarray, and results
      in a transaction abort if we can't allocate memory for the xarray when
      creating an inode - but the next patch changes things so that we don't
      need to abort.
      
      Running the following fs_mark test showed some improvements:
      
          $ cat test.sh
          #!/bin/bash
      
          DEV=/dev/nullb0
          MNT=/mnt/nullb0
          MOUNT_OPTIONS="-o ssd"
          FILES=100000
          THREADS=$(nproc --all)
      
          echo "performance" | \
              tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
          mkfs.btrfs -f $DEV
          mount $MOUNT_OPTIONS $DEV $MNT
      
          OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
          for ((i = 1; i <= $THREADS; i++)); do
              OPTS="$OPTS -d $MNT/d$i"
          done
      
          fs_mark $OPTS
      
          umount $MNT
      
      Before this patch:
      
          FSUse%        Count         Size    Files/sec     App Overhead
              10      1200000            0      92081.6         12505547
              16      2400000            0     138222.6         13067072
              23      3600000            0     148833.1         13290336
              43      4800000            0      97864.7         13931248
              53      6000000            0      85597.3         14384313
      
      After this patch:
      
          FSUse%        Count         Size    Files/sec     App Overhead
              10      1200000            0      93225.1         12571078
              16      2400000            0     146720.3         12805007
              23      3600000            0     160626.4         13073835
              46      4800000            0     116286.2         13802927
              53      6000000            0      90087.9         14754892
      
      The test was run with a release kernel config (Debian's default config).
      
      Also capturing the insertion times into the rb tree and into the xarray,
      that is measuring the duration of the old function inode_tree_add() and
      the duration of the new btrfs_add_inode_to_root() function, gave the
      following results (in nanoseconds):
      
      Before this patch, inode_tree_add() execution times:
      
         Count: 5000000
         Range:  0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
         Percentiles:  90th: 1015.000; 95th: 1139.000; 99th: 1397.000
               0.000 -       7.816:      40 |
               7.816 -      37.858:     209 |
              37.858 -     170.278:    6059 |
             170.278 -     753.961: 2754890 #####################################################
             753.961 -    3326.728: 2232312 ###########################################
            3326.728 -   14667.018:    4366 |
           14667.018 -   64652.943:     852 |
           64652.943 -  284981.761:     550 |
          284981.761 - 1256150.914:     221 |
         1256150.914 - 5536887.000:       7 |
      
      After this patch, btrfs_add_inode_to_root() execution times:
      
         Count: 5000000
         Range:  0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
         Percentiles:  90th: 342.000; 95th: 432.000; 99th: 572.000
              0.000 -       7.264:     104 |
              7.264 -      33.145:     352 |
             33.145 -     140.081:  109606 #
            140.081 -     581.930: 4840090 #####################################################
            581.930 -    2407.590:   43532 |
           2407.590 -    9950.979:    2245 |
           9950.979 -   41119.278:     514 |
          41119.278 -  169902.616:     155 |
         169902.616 -  702018.539:      47 |
         702018.539 - 2900652.000:       9 |
      
      Average, percentiles, standard deviation, etc, are all much better.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      310b2f5d
    • Qu Wenruo's avatar
      btrfs: raid56: do extra dumping for CONFIG_BTRFS_ASSERT · bbbee460
      Qu Wenruo authored
      There are several hard-to-hit ASSERT()s hit inside raid56.
      Unfortunately the ASSERT() expression is a little complex, and except
      the ASSERT(), there is nothing to provide any clue.
      
      Considering if race is involved, it's pretty hard to reproduce.
      Meanwhile sometimes the dump of the rbio structure can provide some
      pretty good clues, it's worth to do the extra multi-line dump for
      btrfs raid56 related code.
      
      The dump looks like this:
      
        BTRFS critical (device dm-3): bioc logical=4598530048 full_stripe=4598530048 size=0 map_type=0x81 mirror=0 replace_nr_stripes=0 replace_stripe_src=-1 num_stripes=5
        BTRFS critical (device dm-3):     nr=0 devid=1 physical=1166147584
        BTRFS critical (device dm-3):     nr=1 devid=2 physical=1145176064
        BTRFS critical (device dm-3):     nr=2 devid=4 physical=1145176064
        BTRFS critical (device dm-3):     nr=3 devid=5 physical=1145176064
        BTRFS critical (device dm-3):     nr=4 devid=3 physical=1145176064
        BTRFS critical (device dm-3): rbio flags=0x0 nr_sectors=80 nr_data=4 real_stripes=5 stripe_nsectors=16 scrubp=0 dbitmap=0x0
        BTRFS critical (device dm-3): logical=4598530048
        assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1702
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbbee460