1. 12 Oct, 2023 40 commits
    • Filipe Manana's avatar
      btrfs: abort transaction on generation mismatch when marking eb as dirty · 50564b65
      Filipe Manana authored
      When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
      we check if its generation matches the running transaction and if not we
      just print a warning. Such mismatch is an indicator that something really
      went wrong and only printing a warning message (and stack trace) is not
      enough to prevent a corruption. Allowing a transaction to commit with such
      an extent buffer will trigger an error if we ever try to read it from disk
      due to a generation mismatch with its parent generation.
      
      So abort the current transaction with -EUCLEAN if we notice a generation
      mismatch. For this we need to pass a transaction handle to
      btrfs_mark_buffer_dirty() which is always available except in test code,
      in which case we can pass NULL since it operates on dummy extent buffers
      and all test roots have a single node/leaf (root node at level 0).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50564b65
    • Anand Jain's avatar
      btrfs: scan but don't register device on single device filesystem · bc27d6f0
      Anand Jain authored
      After the commit 5f58d783 ("btrfs: free device in btrfs_close_devices
      for a single device filesystem") we unregister the device from the kernel
      memory upon unmounting for a single device.
      
      So, device registration that was performed before mounting if any is no
      longer in the kernel memory.
      
      However, in fact, note that device registration is unnecessary for a
      single-device btrfs filesystem unless it's a seed device.
      
      So for commands like 'btrfs device scan' or 'btrfs device ready' with a
      non-seed single-device btrfs filesystem, they can return success just
      after superblock verification and without the actual device scan.  When
      'device scan --forget' is called on such device no error is returned.
      
      The seed device must remain in the kernel memory to allow the sprout
      device to mount without the need to specify the seed device explicitly.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc27d6f0
    • David Sterba's avatar
      btrfs: rename errno identifiers to error · ed164802
      David Sterba authored
      We sync the kernel files to userspace and the 'errno' symbol is defined
      by standard library, which does not matter in kernel but the parameters
      or local variables could clash. Rename them all.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed164802
    • Filipe Manana's avatar
      btrfs: always reserve space for delayed refs when starting transaction · 28270e25
      Filipe Manana authored
      When starting a transaction (or joining an existing one with
      btrfs_start_transaction()), we reserve space for the number of items we
      want to insert in a btree, but we don't do it for the delayed refs we
      will generate while using the transaction to modify (COW) extent buffers
      in a btree or allocate new extent buffers. Basically how it works:
      
      1) When we start a transaction we reserve space for the number of items
         the caller wants to be inserted/modified/deleted in a btree. This space
         goes to the transaction block reserve;
      
      2) If the delayed refs block reserve is not full, its size is greater
         than the amount of its reserved space, and the flush method is
         BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
         it corresponding to the number of items the caller wants to
         insert/modify/delete in a btree;
      
      3) The size of the delayed refs block reserve is increased when a task
         creates delayed refs after COWing an extent buffer, allocating a new
         one or deleting (freeing) an extent buffer. This happens after the
         the task started or joined a transaction, whenever it calls
         btrfs_update_delayed_refs_rsv();
      
      4) The delayed refs block reserve is then refilled by anyone calling
         btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
         operations or when someone else calls btrfs_start_transaction() with
         a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;
      
      5) As a task COWs or allocates extent buffers, it consumes space from the
         transaction block reserve. When the task releases its transaction
         handle (btrfs_end_transaction()) or it attempts to commit the
         transaction, it releases any remaining space in the transaction block
         reserve that it did not use, as not all space may have been used (due
         to pessimistic space calculation) by calling btrfs_block_rsv_release()
         which will try to add that unused space to the delayed refs block
         reserve (if its current size is greater than its reserved space).
         That transferred space may not be enough to completely fulfill the
         delayed refs block reserve.
      
         Plus we have some tasks that will attempt do modify as many leaves
         as they can before getting -ENOSPC (and then reserving more space and
         retrying), such as hole punching and extent cloning which call
         btrfs_replace_file_extents(). Such tasks can generate therefore a
         high number of delayed refs, for both metadata and data (we can't
         know in advance how many file extent items we will find in a range
         and therefore how many delayed refs for dropping references on data
         extents we will generate);
      
      6) If a transaction starts its commit before the delayed refs block
         reserve is refilled, for example by the transaction kthread or by
         someone who called btrfs_join_transaction() before starting the
         commit, then when running delayed references if we don't have enough
         reserved space in the delayed refs block reserve, we will consume
         space from the global block reserve.
      
      Now this doesn't make a lot of sense because:
      
      1) We should reserve space for delayed references when starting the
         transaction, since we have no guarantees the delayed refs block
         reserve will be refilled;
      
      2) If no refill happens then we will consume from the global block reserve
         when running delayed refs during the transaction commit;
      
      3) If we have a bunch of tasks calling btrfs_start_transaction() with a
         number of items greater than zero and at the time the delayed refs
         reserve is full, then we don't reserve any space at
         btrfs_start_transaction() for the delayed refs that will be generated
         by a task, and we can therefore end up using a lot of space from the
         global reserve when running the delayed refs during a transaction
         commit;
      
      4) There are also other operations that result in bumping the size of the
         delayed refs reserve, such as creating and deleting block groups, as
         well as the need to update a block group item because we allocated or
         freed an extent from the respective block group;
      
      5) If we have a significant gap between the delayed refs reserve's size
         and its reserved space, two very bad things may happen:
      
         1) The reserved space of the global reserve may not be enough and we
            fail the transaction commit with -ENOSPC when running delayed refs;
      
         2) If the available space in the global reserve is enough it may result
            in nearly exhausting it. If the fs has no more unallocated device
            space for allocating a new block group and all the available space
            in existing metadata block groups is not far from the global
            reserve's size before we started the transaction commit, we may end
            up in a situation where after the transaction commit we have too
            little available metadata space, and any future transaction commit
            will fail with -ENOSPC, because although we were able to reserve
            space to start the transaction, we were not able to commit it, as
            running delayed refs generates some more delayed refs (to update the
            extent tree for example) - this includes not even being able to
            commit a transaction that was started with the goal of unlinking a
            file, removing an empty data block group or doing reclaim/balance,
            so there's no way to release metadata space.
      
            In the worst case the next time we mount the filesystem we may
            also fail with -ENOSPC due to failure to commit a transaction to
            cleanup orphan inodes. This later case was reported and hit by
            someone running a SLE (SUSE Linux Enterprise) distribution for
            example - where the fs had no more unallocated space that could be
            used to allocate a new metadata block group, and the available
            metadata space was about 1.5M, not enough to commit a transaction
            to cleanup an orphan inode (or do relocation of data block groups
            that were far from being full).
      
      So improve on this situation by always reserving space for delayed refs
      when calling start_transaction(), and if the flush method is
      BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
      reserve if it's not full. The space reserved for the delayed refs is added
      to a local block reserve that is part of the transaction handle, and when
      a task updates the delayed refs block reserve size, after creating a
      delayed ref, the space is transferred from that local reserve to the
      global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
      local reserve does not have enough space, which may happen for tasks
      that generate a variable and potentially large number of delayed refs
      (such as the hole punching and extent cloning cases mentioned before),
      we transfer any available space and then rely on the current behaviour
      of hoping some other task refills the delayed refs reserve or fallback
      to the global block reserve.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28270e25
    • Filipe Manana's avatar
      btrfs: stop doing excessive space reservation for csum deletion · adb86dbe
      Filipe Manana authored
      Currently when reserving space for deleting the csum items for a data
      extent, when adding or updating a delayed ref head, we determine how
      many leaves of csum items we can have and then pass that number to the
      helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
      space for all tree modifications we need when running delayed references,
      however the amount of space it computes is excessive for deleting csum
      items because:
      
      1) It uses btrfs_calc_insert_metadata_size() which is excessive because
         we only need to delete csum items from the csum tree, we don't need
         to insert any items, so btrfs_calc_metadata_size() is all we need (as
         it computes space needed to delete an item);
      
      2) If the free space tree is enabled, it doubles the amount of space,
         which is pointless for csum deletion since we don't need to touch the
         free space tree or any other tree other than the csum tree.
      
      So improve on this by tracking how many csum deletions we have and using
      a new helper to calculate space for csum deletions (just a wrapper around
      btrfs_calc_metadata_size() with a comment). This reduces the amount of
      space we need to reserve for csum deletions by a factor of 4, and it helps
      reduce the number of times we have to block space reservations and have
      the reclaim task enter the space flushing algorithm (flush delayed items,
      flush delayed refs, etc) in order to satisfy tickets.
      
      For example this results in a total time decrease when unlinking (or
      truncating) files with many extents, as we end up having to block on space
      metadata reservations less often. Example test:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nullb0
        MNT=/mnt/test
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $DEV
        # Use compression to quickly create files with a lot of extents
        # (each with a size of 128K).
        mount -o compress=lzo $DEV $MNT
      
        # 100G gives at least 983040 extents with a size of 128K.
        xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar
      
        # Flush all delalloc and clear all metadata from memory.
        umount $MNT
        mount -o compress=lzo $DEV $MNT
      
        start=$(date +%s%N)
        rm -f $MNT/foobar
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
        echo "rm took $dur milliseconds"
      
        umount $MNT
      
      Before this change rm took: 7504 milliseconds
      After this change rm took:  6574 milliseconds  (-12.4%)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adb86dbe
    • Filipe Manana's avatar
      btrfs: remove pointless initialization at btrfs_delayed_refs_rsv_release() · b6ea3e6a
      Filipe Manana authored
      There's no point in initializing to 0 the local variable 'released' as
      we don't use it before the next assignment to it. So remove the
      initialization. This may help avoid some warnings with clang tools such
      as the one reported/fixed by commit 966de47f ("btrfs: remove redundant
      initialization of variables in log_new_ancestors").
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6ea3e6a
    • Filipe Manana's avatar
      btrfs: reserve space for delayed refs on a per ref basis · 3ee56a58
      Filipe Manana authored
      Currently when reserving space for delayed refs we do it on a per ref head
      basis. This is generally enough because most back refs for an extent end
      up being inlined in the extent item - with the default leaf size of 16K we
      can have at most 33 inline back refs (this is calculated by the macro
      BTRFS_MAX_EXTENT_ITEM_SIZE()). The amount of bytes reserved for each ref
      head is given by btrfs_calc_delayed_ref_bytes(), which basically
      corresponds to a single path for insertion into the extent tree plus
      another path for insertion into the free space tree if it's enabled.
      
      However if we have reached the limit of inline refs or we have a mix of
      inline and non-inline refs, then we will need to insert a non-inline ref
      and update the existing extent item to update the total number of
      references for the extent. This implies we need reserved space for two
      insertion paths in the extent tree, but we only reserved for one path.
      The extent item and the non-inline ref item may be located in different
      leaves, or even if they are located in the same leaf, after updating the
      extent item and before inserting the non-inline ref item, the extent
      buffers in the btree path may have been written (due to memory pressure
      for e.g.), in which case we need to COW the entire path again. In this
      case since we have not reserved enough space for the delayed refs block
      reserve, we will use the global block reserve.
      
      If we are in a situation where the fs has no more unallocated space enough
      to allocate a new metadata block group and available space in the existing
      metadata block groups is close to the maximum size of the global block
      reserve (512M), we may end up consuming too much of the free metadata
      space to the point where we can't commit any future transaction because it
      will fail, with -ENOSPC, during its commit when trying to allocate an
      extent for some COW operation (running delayed refs generated by running
      delayed refs or COWing the root tree's root node at commit_cowonly_roots()
      for example). Such dramatic scenario can happen if we have many delayed
      refs that require the insertion of non-inline ref items, due to too many
      reflinks or snapshots. We also have situations where we use the global
      block reserve because we could not in advance know that we will need
      space to update some trees (block group creation for example), so this
      all adds up to increase the chances of exhausting the global block reserve
      and making any future transaction commit to fail with -ENOSPC and turn
      the fs into RO mode, or fail the mount operation in case the mount needs
      to start and commit a transaction, such as when we have orphans to cleanup
      for example - such case was reported and hit by someone running a SLE
      (SUSE Linux Enterprise) distribution for example - where the fs had no
      more unallocated space that could be used to allocate a new metadata block
      group, and the available metadata space was about 1.5M, not enough to
      commit a transaction to cleanup an orphan inode (or do relocation of data
      block groups that were far from being full).
      
      So reserve space for delayed refs by individual refs and not by ref heads,
      as we may need to COW multiple extent tree paths due to non-inline ref
      items.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ee56a58
    • Filipe Manana's avatar
      btrfs: allow to run delayed refs by bytes to be released instead of count · 8a526c44
      Filipe Manana authored
      When running delayed references, through btrfs_run_delayed_refs(), we can
      specify how many to run, run all existing delayed references and keep
      running delayed references while we can find any. This is controlled with
      the value of the 'count' argument, where a value of 0 means to run all
      delayed references that exist by the time btrfs_run_delayed_refs() is
      called, (unsigned long)-1 means to keep running delayed references while
      we are able find any, and any other value to run that exact number of
      delayed references.
      
      Typically a specific value other than 0 or -1 is used when flushing space
      to try to release a certain amount of bytes for a ticket. In this case
      we just simply calculate how many delayed reference heads correspond to a
      specific amount of bytes, with calc_delayed_refs_nr(). However that only
      takes into account the space reserved for the reference heads themselves,
      and does not account for the space reserved for deleting checksums from
      the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
      in case we are going to delete a data extent. This means we may end up
      running more delayed references than necessary in case we process delayed
      references for deleting a data extent.
      
      So change the logic of btrfs_run_delayed_refs() to take a bytes argument
      to specify how many bytes of delayed references to run/release, using the
      special values of 0 to mean all existing delayed references and U64_MAX
      (or (u64)-1) to keep running delayed references while we can find any.
      
      This prevents running more delayed references than necessary, when we have
      delayed references for deleting data extents, but also makes the upcoming
      changes/patches simpler and it's preparatory work for them.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a526c44
    • Filipe Manana's avatar
      btrfs: simplify check for extent item overrun at lookup_inline_extent_backref() · da8848ac
      Filipe Manana authored
      At lookup_inline_extent_backref() we can simplify the check for an overrun
      of the extent item by making the while loop's condition to be "ptr < end"
      and then check after the loop if an overrun happened ("ptr > end"). This
      reduces indentation and makes the loop condition more clear. So move the
      check out of the loop and change the loop condition accordingly, while
      also adding the 'unlikely' tag to the check since it's not supposed to be
      triggered.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da8848ac
    • Filipe Manana's avatar
      btrfs: return -EUCLEAN if extent item is missing when searching inline backref · eba444f1
      Filipe Manana authored
      At lookup_inline_extent_backref() when trying to insert an inline backref,
      if we don't find the extent item we log an error and then return -EIO.
      This error code is confusing because there was actually no IO error, and
      this means we have some corruption, either caused by a bug or something
      like a memory bitflip for example. So change the error code from -EIO to
      -EUCLEAN.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eba444f1
    • Filipe Manana's avatar
      btrfs: use a single variable for return value at lookup_inline_extent_backref() · cc925b96
      Filipe Manana authored
      At lookup_inline_extent_backref(), instead of using a 'ret' and an 'err'
      variable for tracking the return value, use a single one ('ret'). This
      simplifies the code, makes it comply with most of the existing code and
      it's less prone for logic errors as time has proven over and over in the
      btrfs code.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc925b96
    • Filipe Manana's avatar
      btrfs: use a single variable for return value at run_delayed_extent_op() · 20fb05a6
      Filipe Manana authored
      Instead of using a 'ret' and an 'err' variable at run_delayed_extent_op()
      for tracking the return value, use a single one ('ret'). This simplifies
      the code, makes it comply with most of the existing code and it's less
      prone for logic errors as time has proven over and over in the btrfs code.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20fb05a6
    • Filipe Manana's avatar
      btrfs: remove pointless 'ref_root' variable from run_delayed_data_ref() · e721043a
      Filipe Manana authored
      The 'ref_root' variable, at run_delayed_data_ref(), is not really needed
      as we can always use ref->root directly, plus its initialization to 0 is
      completely pointless as we assign it ref->root before its first use.
      So just drop that variable and use ref->root directly.
      
      This may help avoid some warnings with clang tools such as the one
      reported/fixed by commit 966de47f ("btrfs: remove redundant
      initialization of variables in log_new_ancestors").
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e721043a
    • Filipe Manana's avatar
      btrfs: initialize key where it's used when running delayed data ref · 7cce0d69
      Filipe Manana authored
      At run_delayed_data_ref() we are always initializing a key but the key
      is only needed and used if we are inserting a new extent. So move the
      declaration and initialization of the key to 'if' branch where it's used.
      Also rename the key from 'ins' to 'key', as it's a more clear name.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7cce0d69
    • Filipe Manana's avatar
      btrfs: remove refs_to_drop argument from __btrfs_free_extent() · 1df6b3c0
      Filipe Manana authored
      Currently the 'refs_to_drop' argument of __btrfs_free_extent() always
      matches the value of node->ref_mod, so remove the argument and use
      node->ref_mod at __btrfs_free_extent().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1df6b3c0
    • Filipe Manana's avatar
      btrfs: remove refs_to_add argument from __btrfs_inc_extent_ref() · 88b2d088
      Filipe Manana authored
      Currently the 'refs_to_add' argument of __btrfs_inc_extent_ref() always
      matches the value of node->ref_mod, so remove the argument and use
      node->ref_mod at __btrfs_inc_extent_ref().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88b2d088
    • Filipe Manana's avatar
      btrfs: remove the refcount warning/check at btrfs_put_delayed_ref() · abff279e
      Filipe Manana authored
      At btrfs_put_delayed_ref(), it's pointless to have a WARN_ON() to check if
      the refcount of the delayed ref is zero. Such check is already done by the
      refcount_t module and refcount_dec_and_test(), which loudly complains if
      we try to decrement a reference count that is currently 0.
      
      The WARN_ON() dates back to the time when used a regular atomic_t type
      for the reference counter, before we switched to the refcount_t type.
      The main goal of the refcount_t type/module is precisely to catch such
      types of bugs and loudly complain if they happen.
      
      This also reduces a bit the module's text size.
      Before this change:
      
         $ size fs/btrfs/btrfs.ko
            text	   data	    bss	    dec	    hex	filename
         1612483	 167145	  16864	1796492	 1b698c	fs/btrfs/btrfs.ko
      
      After this change:
      
         $ size fs/btrfs/btrfs.ko
            text	   data	    bss	    dec	    hex	filename
         1612371	 167073	  16864	1796308	 1b68d4	fs/btrfs/btrfs.ko
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abff279e
    • Filipe Manana's avatar
      btrfs: remove unnecessary logic when running new delayed references · 3cbb9f51
      Filipe Manana authored
      When running delayed references, at btrfs_run_delayed_refs(), we have this
      logic to run any new delayed references that might have been added just
      after we ran all delayed references. This logic grabs the first delayed
      reference, then locks it to wait for any contention on it before running
      all new delayed references. This however is pointless and not necessary
      because at __btrfs_run_delayed_refs() when we start running delayed
      references, we pick the first reference with btrfs_obtain_ref_head() and
      then we will lock it (with btrfs_delayed_ref_lock()).
      
      So remove the duplicate and unnecessary logic at btrfs_run_delayed_refs().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3cbb9f51
    • Filipe Manana's avatar
      btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes() · 03551d65
      Filipe Manana authored
      We are passing a block reserve argument to btrfs_reserve_metadata_bytes()
      which is not really used, all we need is to pass the space_info associated
      to the block reserve, we don't change the block reserve at all.
      
      Not only it's pointless to pass the block reserve, it's also confusing as
      one might think that the reserved bytes will end up being added to the
      passed block reserve, when that's not the case. The pattern for reserving
      space and adding it to a block reserve is to first reserve space with
      btrfs_reserve_metadata_bytes() and if that succeeds, then add the space to
      a block reserve by calling btrfs_block_rsv_add_bytes().
      
      Also the reverse of btrfs_reserve_metadata_bytes(), which is
      btrfs_space_info_free_bytes_may_use(), takes a space_info argument and
      not a block reserve, so one more reason to pass a space_info and not a
      block reserve to btrfs_reserve_metadata_bytes().
      
      So change btrfs_reserve_metadata_bytes() and its callers to pass a
      space_info argument instead of a block reserve argument.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03551d65
    • Qu Wenruo's avatar
      btrfs: remove the need_raid_map parameter from btrfs_map_block() · 9fb2acc2
      Qu Wenruo authored
      The parameter @need_raid_map is mostly a legacy from the old days where
      we don't yet have a solid definition on the @mirror_num, and only
      check-integrity was using that parameter, while all other call sites
      just pass 1 for that parameter.
      
      Now since we have removed check-integrity functionality, we can also
      remove the @need_raid_map parameter.
      
      This change will also remove the ability to read P/Q stripe directly
      when passing 0 as @need_raid_map.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9fb2acc2
    • Qu Wenruo's avatar
      btrfs: check-integrity: remove CONFIG_BTRFS_FS_CHECK_INTEGRITY option · 732fab95
      Qu Wenruo authored
      Since all check-integrity entry points have been removed, let's also
      remove the config and all related code relying on that.
      
      And since we have removed the mount option for check-integrity, we also
      need to re-number all the BTRFS_MOUNT_* enums.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      732fab95
    • Qu Wenruo's avatar
      btrfs: check-integrity: remove btrfsic_unmount() function · fb2a836d
      Qu Wenruo authored
      The function btrfsic_mount() is part of the deprecated check-integrity
      functionality.
      
      Now let's remove the main entry point of check-integrity, and thankfully
      most of the check-integrity code is self-contained inside
      check-integrity.c, we can safely remove the function without huge
      changes to btrfs code base.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb2a836d
    • Qu Wenruo's avatar
      btrfs: check-integrity: remove btrfsic_mount() function · af32d363
      Qu Wenruo authored
      The function btrfsic_mount() is part of the deprecated check-integrity
      functionality.
      
      Now let's remove the main entry point of check-integrity, and thankfully
      most of the check-integrity code is self-contained inside
      check-integrity.c, we can safely remove the function without huge
      changes to btrfs code base.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af32d363
    • Qu Wenruo's avatar
      btrfs: check-integrity: remove btrfsic_check_bio() function · 51cf580c
      Qu Wenruo authored
      The function btrfsic_check_bio() is part of the deprecated
      check-integrity functionality.
      
      Now let's remove the main entry point of check-integrity, and thankfully
      most of the check-integrity code is self-contained inside
      check-integrity.c, we can safely remove the function without huge
      changes to btrfs code base.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51cf580c
    • David Sterba's avatar
      btrfs: move extent_buffer::lock_owner to debug section · 150cce2d
      David Sterba authored
      The lock_owner is used for a rare corruption case and we haven't seen
      any reports in years. Move it to the debugging section of eb.  To close
      the holes also move log_index so the final layout looks like:
      
      struct extent_buffer {
              u64                        start;                /*     0     8 */
              long unsigned int          len;                  /*     8     8 */
              long unsigned int          bflags;               /*    16     8 */
              struct btrfs_fs_info *     fs_info;              /*    24     8 */
              spinlock_t                 refs_lock;            /*    32     4 */
              atomic_t                   refs;                 /*    36     4 */
              int                        read_mirror;          /*    40     4 */
              s8                         log_index;            /*    44     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct rw_semaphore        lock;                 /*    64    40 */
              struct page *              pages[16];            /*   104   128 */
      
              /* size: 232, cachelines: 4, members: 11 */
              /* sum members: 229, holes: 1, sum holes: 3 */
              /* forced alignments: 1, forced holes: 1, sum forced holes: 3 */
              /* last cacheline: 40 bytes */
      } __attribute__((__aligned__(8)));
      
      This saves 8 bytes in total and still keeps the lock on a separate cacheline.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      150cce2d
    • David Sterba's avatar
      btrfs: reduce size of struct btrfs_ref · 321f4992
      David Sterba authored
      We can reduce two members' size that in turn reduce size of struct
      btrfs_ref from 64 to 56 bytes. As the structure is often used as a local
      variable several functions reduce their stack usage.
      
      - make enum btrfs_ref_type packed, there are only 4 values
      
      - switch action and its values to a packed enum
      
      Final structure layout:
      
      struct btrfs_ref {
              enum btrfs_ref_type        type;                 /*     0     1 */
              enum btrfs_delayed_ref_action action;            /*     1     1 */
              bool                       skip_qgroup;          /*     2     1 */
      
              /* XXX 5 bytes hole, try to pack */
      
              u64                        bytenr;               /*     8     8 */
              u64                        len;                  /*    16     8 */
              u64                        parent;               /*    24     8 */
              union {
                      struct btrfs_data_ref data_ref;          /*    32    24 */
                      struct btrfs_tree_ref tree_ref;          /*    32    16 */
              };                                               /*    32    24 */
      
              /* size: 56, cachelines: 1, members: 7 */
              /* sum members: 51, holes: 1, sum holes: 5 */
              /* last cacheline: 56 bytes */
      };
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      321f4992
    • David Sterba's avatar
      btrfs: reduce size and reorder compression members in struct btrfs_inode · e41570d3
      David Sterba authored
      Currently the compression type values are bounded and fit to an u8, we
      can pack the btrfs_inode a bit by reordering them to the space created
      by the location key. This reduces size from 1112 to 1104.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e41570d3
    • David Sterba's avatar
      btrfs: reduce size of prelim_ref::level · 105c8c42
      David Sterba authored
      The values of level are bounded and fit into a byte so let's use it for
      the structure to reduce size from 88 to 80 bytes on a release build,
      which increases number of objects in the default 8K slab from 93 to 102.
      
      struct prelim_ref {
              struct rb_node             rbnode __attribute__((__aligned__(8))); /*     0    24 */
              u64                        root_id;              /*    24     8 */
              struct btrfs_key           key_for_search;       /*    32    17 */
              u8                         level;                /*    49     1 */
      
              /* XXX 2 bytes hole, try to pack */
      
              int                        count;                /*    52     4 */
              struct extent_inode_elem * inode_list;           /*    56     8 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              u64                        parent;               /*    64     8 */
              u64                        wanted_disk_byte;     /*    72     8 */
      
              /* size: 80, cachelines: 2, members: 8 */
              /* sum members: 78, holes: 1, sum holes: 2 */
              /* forced alignments: 1 */
              /* last cacheline: 16 bytes */
      } __attribute__((__aligned__(8)));
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      105c8c42
    • David Sterba's avatar
      btrfs: reduce arguments of helpers space accounting root item · 02cd00fa
      David Sterba authored
      There are two helpers to increase used bytes of root items that add or
      subtract one node size, we don't need to pass the argument for that.
      Rename the function so it matches the root item member that gets
      changed.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02cd00fa
    • David Sterba's avatar
      btrfs: reduce parameters of btrfs_pin_extent_for_log_replay · 007dec8c
      David Sterba authored
      Both callers of btrfs_pin_extent_for_log_replay expand the parameters to
      extent buffer members. We can simply pass the extent buffer instead.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      007dec8c
    • David Sterba's avatar
      btrfs: reduce parameters of btrfs_pin_reserved_extent · f863c502
      David Sterba authored
      There is only one caller of btrfs_pin_reserved_extent that expands the
      parameters to extent buffer members. We can simply pass the extent
      buffer instead.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f863c502
    • David Sterba's avatar
      btrfs: drop __must_check annotations · 203f6a87
      David Sterba authored
      Drop all __must_check annotations because they're used in random
      functions and not consistently. All errors should be handled.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      203f6a87
    • David Sterba's avatar
      btrfs: reformat remaining kdoc style comments · 9580503b
      David Sterba authored
      Function name in the comment does not bring much value to code not
      exposed as API and we don't stick to the kdoc format anymore. Update
      formatting of parameter descriptions.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9580503b
    • David Sterba's avatar
      btrfs: move functions comments from qgroup.h to qgroup.c · 33b6b251
      David Sterba authored
      We keep the comments next to the implementation, there were some left
      to move.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33b6b251
    • Anand Jain's avatar
      btrfs: comment about fsid and metadata_uuid relationship · cb6eb475
      Anand Jain authored
      Add a comment explaining the relationship between fsid and metadata_uuid
      in the on-disk superblock and the in-memory struct btrfs_fs_devices.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb6eb475
    • Jiapeng Chong's avatar
      btrfs: qgroup: remove unused helpers for ulist aux data · 12468731
      Jiapeng Chong authored
      These functions are defined in the qgroup.c file, but not called
      anymore since commit "btrfs: qgroup: use qgroup_iterator_nested to in
      qgroup_update_refcnt()" so we can delete them.
      
      fs/btrfs/qgroup.c:149:19: warning: unused function 'qgroup_to_aux'.
      fs/btrfs/qgroup.c:154:36: warning: unused function 'unode_aux_to_qgroup'.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6566Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12468731
    • Qu Wenruo's avatar
      btrfs: qgroup: prealloc btrfs_qgroup_list for __add_relation_rb() · 79ace7b8
      Qu Wenruo authored
      Currently we go GFP_ATOMIC allocation for qgroup relation add, this
      includes the following 3 call sites:
      
      - btrfs_read_qgroup_config()
        This is not really needed, as at that time we're still in single
        thread mode, and no spin lock is held.
      
      - btrfs_add_qgroup_relation()
        This one is holding a spinlock, but we're ensured to add at most one
        relation, thus we can easily do a preallocation and use the
        preallocated memory to avoid GFP_ATOMIC.
      
      - btrfs_qgroup_inherit()
        This is a little more tricky, as we may have as many relationships as
        inherit::num_qgroups.
        Thus we have to properly allocate an array then preallocate all the
        memory.
      
      This patch would remove the GFP_ATOMIC allocation for above involved
      call sites, by doing preallocation before holding the spinlock, and let
      __add_relation_rb() to handle the freeing of the structure.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79ace7b8
    • Qu Wenruo's avatar
      btrfs: qgroup: pre-allocate btrfs_qgroup to reduce GFP_ATOMIC usage · 8d54518b
      Qu Wenruo authored
      Qgroup is the heaviest user of GFP_ATOMIC, but one call site does not
      really need GFP_ATOMIC, that is add_qgroup_rb().
      
      That function only searches the rbtree to find if we already have such
      entry.  If not, then it would try to allocate memory for it.
      
      This means we can afford to pre-allocate such structure unconditionally,
      then free the memory if it's not needed.
      
      Considering this function is not a hot path, only utilized by the
      following functions:
      
      - btrfs_qgroup_inherit()
        For "btrfs subvolume snapshot -i" option.
      
      - btrfs_read_qgroup_config()
        At mount time, and we're ensured there would be no existing rb tree
        entry for each qgroup.
      
      - btrfs_create_qgroup()
      
      Thus we're completely safe to pre-allocate the extra memory for btrfs_qgroup
      structure, and reduce unnecessary GFP_ATOMIC usage.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d54518b
    • Qu Wenruo's avatar
      btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt() · dce28769
      Qu Wenruo authored
      The ulist @qgroups is utilized to record all involved qgroups from both
      old and new roots inside btrfs_qgroup_account_extent().
      
      Due to the fact that qgroup_update_refcnt() itself is already utilizing
      qgroup_iterator, here we have to introduce another list_head,
      btrfs_qgroup::nested_iterator, allowing nested iteration.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dce28769
    • Qu Wenruo's avatar
      btrfs: qgroup: use qgroup_iterator to replace tmp ulist in qgroup_update_refcnt() · a4a81383
      Qu Wenruo authored
      For function qgroup_update_refcnt(), we use @tmp list to iterate all the
      involved qgroups of a subvolume.
      
      It's a perfect match for qgroup_iterator facility, as that @tmp ulist
      has a very limited lifespan (just inside the while() loop).
      
      By migrating to qgroup_iterator, we can get rid of the GFP_ATOMIC memory
      allocation and no error handling is needed.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4a81383