1. 12 Oct, 2023 40 commits
    • Boris Burkov's avatar
      btrfs: rename tree_ref and data_ref owning_root · 610647d7
      Boris Burkov authored
      commit 113479d5 ("btrfs: rename root fields in delayed refs structs")
      changed these from ref_root to owning_root. However, there are many
      circumstances where that name is not really accurate and the root on the
      ref struct _is_ the referring root. In general, these are not the owning
      root, though it does happen in some ref merging cases involving
      overwrites during snapshots and similar.
      
      Simple quotas cares quite a bit about tracking the original owner of an
      extent through delayed refs, so rename these back to free up the name
      for the real owning root (which will live on the generic btrfs_ref and
      the head ref)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      610647d7
    • Boris Burkov's avatar
      btrfs: add helper for recording simple quota deltas · 1e0e9d57
      Boris Burkov authored
      Rather than re-computing shared/exclusive ownership based on backrefs
      and walking roots for implicit backrefs, simple quotas does an increment
      when creating an extent and a decrement when deleting it. Add the API
      for the extent item code to use to track those events.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e0e9d57
    • Boris Burkov's avatar
      btrfs: create qgroup earlier in snapshot creation · 6ed05643
      Boris Burkov authored
      Pull creating the qgroup earlier in the snapshot. This allows simple
      quotas qgroups to see all the metadata writes related to the snapshot
      being created and to be born with the root node accounted.
      
      Note this has an impact on transaction commit where the qgroup creation
      can do a lot of work, allocate memory and take locks. The change is done
      for correctness, potential performance issues will be fixed in the
      future.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ed05643
    • Boris Burkov's avatar
      btrfs: qgroup: flush reservations during quota disable · af0e2aab
      Boris Burkov authored
      The following sequence:
      
        enable simple quotas
        do some writes
            reserve space
            create ordered_extent
      	  release rsv (store rsv_bytes in OE, mark QGROUP_RESERVED bits)
        disable quotas
        enable simple quotas
            set qgroup rsv to 0 on all subvolumes
        ordered_extent finishes
            create delayed ref with rsv_bytes from before
        run delayed ref
            record_simple_quota_delta
      	  free rsv_bytes (0 -> -rsv_delta)
      
      results in us reliably underflowing the subvolume's qgroup rsv counter,
      because disabling/re-enabling quotas toggles reservation counters down
      to 0, but does not remove other file system state which represents
      successful acquisition of qgroup rsv space. Specifically metadata rsv
      counters on the root object and rsv_bytes on ordered_extent objects that
      have released their reservation as well as the corresponding
      QGROUP_RESERVED extent bits.
      
      Normal qgroups gets away with this, I believe because it forces more
      work to happen on transaction commit, but I am not certain it is totally
      safe from the ordered_extent/leaked extent bit variant. Simple quotas
      hits this reliably.
      
      The intent of the fix is to make disable take the time to clear that
      external to qgroups state as well: after flipping off the quota bit on
      fs_info, flush delalloc and ordered extents, clearing the extent bits
      along the way. This makes it so there are no ordered extents or meta
      prealloc hanging around from the first enablement period during the second.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af0e2aab
    • Boris Burkov's avatar
      btrfs: sysfs: add simple_quota incompat feature entry · a744986a
      Boris Burkov authored
      Add an entry in the features directory for the new incompat flag
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a744986a
    • Boris Burkov's avatar
      btrfs: sysfs: expose quota mode via sysfs · 0182764a
      Boris Burkov authored
      Add a new sysfs file /sys/fs/btrfs/<uuid>/qgroups/mode
      which prints out the mode qgroups is running in. The possible modes are
      qgroup, and squota.
      
      If quotas are not enabled, then the qgroups directory will not exist,
      so don't handle that mode.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0182764a
    • Boris Burkov's avatar
      btrfs: qgroup: add new quota mode for simple quotas · 182940f4
      Boris Burkov authored
      Add a new quota mode called "simple quotas". It can be enabled by the
      existing quota enable ioctl via a new command, and sets an incompat
      bit, as the implementation of simple quotas will make backwards
      incompatible changes to the disk format of the extent tree.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      182940f4
    • Boris Burkov's avatar
      btrfs: qgroup: introduce quota mode · 6b0cd63b
      Boris Burkov authored
      In preparation for introducing simple quotas, change from a binary
      setting for quotas to an enum based mode. Initially, the possible modes
      are disabled/full. Full quotas is normal btrfs qgroups.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b0cd63b
    • David Sterba's avatar
      btrfs: merge ordered work callbacks in btrfs_work into one · 078b8b90
      David Sterba authored
      There are two callbacks defined in btrfs_work but only two actually make
      use of them, otherwise there are NULLs. We can get rid of the freeing
      callback making it a special case of the normal work. This reduces the
      size of btrfs_work by 8 bytes, final layout:
      
      struct btrfs_work {
              btrfs_func_t               func;                 /*     0     8 */
              btrfs_ordered_func_t       ordered_func;         /*     8     8 */
              struct work_struct         normal_work;          /*    16    32 */
              struct list_head           ordered_list;         /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct btrfs_workqueue *   wq;                   /*    64     8 */
              long unsigned int          flags;                /*    72     8 */
      
              /* size: 80, cachelines: 2, members: 6 */
              /* last cacheline: 16 bytes */
      };
      
      This in turn reduces size of other structures (on a release config):
      
      - async_chunk			 160 ->  152
      - async_submit_bio		 152 ->  144
      - btrfs_async_delayed_work	 104 ->   96
      - btrfs_caching_control		 176 ->  168
      - btrfs_delalloc_work		 144 ->  136
      - btrfs_fs_info			3608 -> 3600
      - btrfs_ordered_extent		 440 ->  424
      - btrfs_writepage_fixup		 104 ->   96
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      078b8b90
    • Johannes Thumshirn's avatar
      btrfs: add raid stripe tree to features enabled with debug config · e9b9b911
      Johannes Thumshirn authored
      Until the raid stripe tree code is well enough tested and feature
      complete, "hide" it behind CONFIG_BTRFS_DEBUG so only people who
      want to use it are actually using it.
      
      The scrub support may still fail some tests (btrfs/060 and up) and will
      be fixed, RAID5/6 is not supported.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9b9b911
    • Johannes Thumshirn's avatar
      btrfs: tree-checker: add support for raid stripe tree · e0b4077f
      Johannes Thumshirn authored
      Add a tree checker support for RAID stripe tree items, verify:
      
      - alignment
      - presence of the incompat bit
      - supported encoding
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0b4077f
    • Johannes Thumshirn's avatar
      btrfs: tracepoints: add events for raid stripe tree · b5e2c2ff
      Johannes Thumshirn authored
      Add trace events for raid-stripe-tree operations.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b5e2c2ff
    • Johannes Thumshirn's avatar
      btrfs: sysfs: announce presence of raid-stripe-tree · 9f9918a8
      Johannes Thumshirn authored
      If a filesystem with a raid-stripe-tree is mounted, show the RST feature
      in sysfs, currently still under the CONFIG_BTRFS_DEBUG option.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f9918a8
    • Johannes Thumshirn's avatar
      btrfs: add raid stripe tree pretty printer · edde81f1
      Johannes Thumshirn authored
      Decode raid-stripe-tree entries on btrfs_print_tree().
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      edde81f1
    • Johannes Thumshirn's avatar
      btrfs: zoned: support RAID0/1/10 on top of raid stripe tree · 568220fa
      Johannes Thumshirn authored
      When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices
      for data block groups. For metadata block groups, we don't actually
      need anything special, as all metadata I/O is protected by the
      btrfs_zoned_meta_io_lock() already.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      568220fa
    • Johannes Thumshirn's avatar
      btrfs: scrub: implement raid stripe tree support · 9acaa641
      Johannes Thumshirn authored
      A filesystem that uses the raid stripe tree for logical to physical
      address translation can't use the regular scrub path, that reads all
      stripes and then checks if a sector is unused afterwards.
      
      When using the raid stripe tree, this will result in lookup errors, as
      the stripe tree doesn't know the requested logical addresses.
      
      In case we're scrubbing a filesystem which uses the RAID stripe tree for
      multi-device logical to physical address translation, perform an extra
      block mapping step to get the real on-disk stripe length from the stripe
      tree when scrubbing the sectors.
      
      This prevents a double completion of the btrfs_bio caused by splitting the
      underlying bio and ultimately a use-after-free.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9acaa641
    • Johannes Thumshirn's avatar
      btrfs: lookup physical address from stripe extent · 10e27980
      Johannes Thumshirn authored
      Lookup the physical address from the raid stripe tree when a read on an
      RAID volume formatted with the raid stripe tree was attempted.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10e27980
    • Johannes Thumshirn's avatar
      btrfs: delete stripe extent on extent deletion · ca41504e
      Johannes Thumshirn authored
      As each stripe extent is tied to an extent item, delete the stripe extent
      once the corresponding extent item is deleted.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca41504e
    • Johannes Thumshirn's avatar
      btrfs: add support for inserting raid stripe extents · 02c372e1
      Johannes Thumshirn authored
      Add support for inserting stripe extents into the raid stripe tree on
      completion of every write that needs an extra logical-to-physical
      translation when using RAID.
      
      Inserting the stripe extents happens after the data I/O has completed,
      this is done to
      
        a) support zone-append and
        b) rule out the possibility of a RAID-write-hole.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02c372e1
    • Johannes Thumshirn's avatar
      btrfs: read raid stripe tree from disk · 51502090
      Johannes Thumshirn authored
      If we find the raid-stripe-tree on mount, read it from disk. This is
      a backward incompatible feature. The rescue=ignorebadroots mount option
      will skip this tree.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51502090
    • Johannes Thumshirn's avatar
      btrfs: add raid stripe tree definitions · ee129330
      Johannes Thumshirn authored
      Add definitions for the raid stripe tree. This tree will hold information
      about the on-disk layout of the stripes in a RAID set.
      
      Each stripe extent has a 1:1 relationship with an on-disk extent item and
      is doing the logical to per-drive physical address translation for the
      extent item in question.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee129330
    • Qu Wenruo's avatar
      btrfs: warn on tree blocks which are not nodesize aligned · 6d3a6194
      Qu Wenruo authored
      A long time ago, we had some metadata chunks which started at sector
      boundary but not aligned to nodesize boundary.
      
      This led to some older filesystems which can have tree blocks only
      aligned to sectorsize, but not nodesize.
      
      Later 'btrfs check' gained the ability to detect and warn about such tree
      blocks, and kernel fixed the chunk allocation behavior, nowadays those
      tree blocks should be pretty rare.
      
      But in the future, if we want to migrate metadata to folio, we cannot
      have such tree blocks, as filemap_add_folio() requires the page index to
      be aligned with the folio number of pages.  Such unaligned tree blocks
      can lead to VM_BUG_ON().
      
      So this patch adds extra warning for those unaligned tree blocks, as a
      preparation for the future folio migration.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d3a6194
    • Josef Bacik's avatar
      btrfs: don't arbitrarily slow down delalloc if we're committing · 11aeb97b
      Josef Bacik authored
      We have a random schedule_timeout() if the current transaction is
      committing, which seems to be a holdover from the original delalloc
      reservation code.
      
      Remove this, we have the proper flushing stuff, we shouldn't be hoping
      for random timing things to make everything work.  This just induces
      latency for no reason.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11aeb97b
    • Filipe Manana's avatar
      btrfs: remove useless comment from btrfs_pin_extent_for_log_replay() · c967c19e
      Filipe Manana authored
      The comment on top of btrfs_pin_extent_for_log_replay() mentioning that
      the function must be called within a transaction is pointless as of
      commit 9fce5704 ("btrfs: Make btrfs_pin_extent_for_log_replay take
      transaction handle"), since the function now takes a transaction handle
      as its first argument. So remove the comment because it's completely
      useless now.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c967c19e
    • Filipe Manana's avatar
      btrfs: remove stale comment from btrfs_free_extent() · df423ee2
      Filipe Manana authored
      A comment at btrfs_free_extent() mentions the call to btrfs_pin_extent()
      unlocks the pinned mutex, however that mutex is long gone, it was removed
      in 2009 by commit 04018de5 ("Btrfs: kill the pinned_mutex"). So just
      delete the comment.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df423ee2
    • Christoph Hellwig's avatar
      btrfs: zoned: factor out DUP bg handling from btrfs_load_block_group_zone_info · 87463f7e
      Christoph Hellwig authored
      Split the code handling a type DUP block group from
      btrfs_load_block_group_zone_info to make the code more readable.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87463f7e
    • Christoph Hellwig's avatar
      btrfs: zoned: factor out single bg handling from btrfs_load_block_group_zone_info · 9e0e3e74
      Christoph Hellwig authored
      Split the code handling a type single block group from
      btrfs_load_block_group_zone_info to make the code more readable.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e0e3e74
    • Christoph Hellwig's avatar
      btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info · 09a46725
      Christoph Hellwig authored
      Split out a helper for the body of the per-zone loop in
      btrfs_load_block_group_zone_info to make the function easier to read and
      modify.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      09a46725
    • Christoph Hellwig's avatar
      btrfs: zoned: introduce a zone_info struct in btrfs_load_block_group_zone_info · 15c12fcc
      Christoph Hellwig authored
      Add a new zone_info structure to hold per-zone information in
      btrfs_load_block_group_zone_info and prepare for breaking out helpers
      from it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      15c12fcc
    • Filipe Manana's avatar
      btrfs: remove pointless loop from btrfs_update_block_group() · 4d20c1de
      Filipe Manana authored
      When an extent is allocated or freed, we call btrfs_update_block_group()
      to update its block group and space info. An extent always belongs to a
      single block group, it can never span multiple block groups, so the loop
      we have at btrfs_update_block_group() is pointless, as it always has a
      single iteration. The loop was added in the very early days, 2007, when
      the block group code was added in commit 9078a3e1 ("Btrfs: start of
      block group code"), but even back then it seemed pointless.
      
      So remove the loop and assert the block group containing the start offset
      of the extent also contains the whole extent.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d20c1de
    • Filipe Manana's avatar
      btrfs: mark transaction id check as unlikely at btrfs_mark_buffer_dirty() · 4ebe8d47
      Filipe Manana authored
      At btrfs_mark_buffer_dirty(), having a transaction id mismatch is never
      expected to happen and it usually means there's a bug or some memory
      corruption due to a bitflip for example. So mark the condition as unlikely
      to optimize code generation as well as to make it obvious for human
      readers that it is a very unexpected condition.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ebe8d47
    • Filipe Manana's avatar
      btrfs: use btrfs_crit at btrfs_mark_buffer_dirty() · 20cbe460
      Filipe Manana authored
      There's no need to use WARN() at btrfs_mark_buffer_dirty() to print an
      error message, as we have the fs_info pointer we can use btrfs_crit()
      which prints device information and makes the message have a more uniform
      format. As we are already aborting the transaction we already have a stack
      trace printed as well. So replace the use of WARN() with btrfs_crit().
      
      Also slightly reword the message to use 'logical' instead of 'block' as
      it's what is used in other error/warning messages.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20cbe460
    • Filipe Manana's avatar
      btrfs: abort transaction on generation mismatch when marking eb as dirty · 50564b65
      Filipe Manana authored
      When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
      we check if its generation matches the running transaction and if not we
      just print a warning. Such mismatch is an indicator that something really
      went wrong and only printing a warning message (and stack trace) is not
      enough to prevent a corruption. Allowing a transaction to commit with such
      an extent buffer will trigger an error if we ever try to read it from disk
      due to a generation mismatch with its parent generation.
      
      So abort the current transaction with -EUCLEAN if we notice a generation
      mismatch. For this we need to pass a transaction handle to
      btrfs_mark_buffer_dirty() which is always available except in test code,
      in which case we can pass NULL since it operates on dummy extent buffers
      and all test roots have a single node/leaf (root node at level 0).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50564b65
    • Anand Jain's avatar
      btrfs: scan but don't register device on single device filesystem · bc27d6f0
      Anand Jain authored
      After the commit 5f58d783 ("btrfs: free device in btrfs_close_devices
      for a single device filesystem") we unregister the device from the kernel
      memory upon unmounting for a single device.
      
      So, device registration that was performed before mounting if any is no
      longer in the kernel memory.
      
      However, in fact, note that device registration is unnecessary for a
      single-device btrfs filesystem unless it's a seed device.
      
      So for commands like 'btrfs device scan' or 'btrfs device ready' with a
      non-seed single-device btrfs filesystem, they can return success just
      after superblock verification and without the actual device scan.  When
      'device scan --forget' is called on such device no error is returned.
      
      The seed device must remain in the kernel memory to allow the sprout
      device to mount without the need to specify the seed device explicitly.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc27d6f0
    • David Sterba's avatar
      btrfs: rename errno identifiers to error · ed164802
      David Sterba authored
      We sync the kernel files to userspace and the 'errno' symbol is defined
      by standard library, which does not matter in kernel but the parameters
      or local variables could clash. Rename them all.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed164802
    • Filipe Manana's avatar
      btrfs: always reserve space for delayed refs when starting transaction · 28270e25
      Filipe Manana authored
      When starting a transaction (or joining an existing one with
      btrfs_start_transaction()), we reserve space for the number of items we
      want to insert in a btree, but we don't do it for the delayed refs we
      will generate while using the transaction to modify (COW) extent buffers
      in a btree or allocate new extent buffers. Basically how it works:
      
      1) When we start a transaction we reserve space for the number of items
         the caller wants to be inserted/modified/deleted in a btree. This space
         goes to the transaction block reserve;
      
      2) If the delayed refs block reserve is not full, its size is greater
         than the amount of its reserved space, and the flush method is
         BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
         it corresponding to the number of items the caller wants to
         insert/modify/delete in a btree;
      
      3) The size of the delayed refs block reserve is increased when a task
         creates delayed refs after COWing an extent buffer, allocating a new
         one or deleting (freeing) an extent buffer. This happens after the
         the task started or joined a transaction, whenever it calls
         btrfs_update_delayed_refs_rsv();
      
      4) The delayed refs block reserve is then refilled by anyone calling
         btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
         operations or when someone else calls btrfs_start_transaction() with
         a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;
      
      5) As a task COWs or allocates extent buffers, it consumes space from the
         transaction block reserve. When the task releases its transaction
         handle (btrfs_end_transaction()) or it attempts to commit the
         transaction, it releases any remaining space in the transaction block
         reserve that it did not use, as not all space may have been used (due
         to pessimistic space calculation) by calling btrfs_block_rsv_release()
         which will try to add that unused space to the delayed refs block
         reserve (if its current size is greater than its reserved space).
         That transferred space may not be enough to completely fulfill the
         delayed refs block reserve.
      
         Plus we have some tasks that will attempt do modify as many leaves
         as they can before getting -ENOSPC (and then reserving more space and
         retrying), such as hole punching and extent cloning which call
         btrfs_replace_file_extents(). Such tasks can generate therefore a
         high number of delayed refs, for both metadata and data (we can't
         know in advance how many file extent items we will find in a range
         and therefore how many delayed refs for dropping references on data
         extents we will generate);
      
      6) If a transaction starts its commit before the delayed refs block
         reserve is refilled, for example by the transaction kthread or by
         someone who called btrfs_join_transaction() before starting the
         commit, then when running delayed references if we don't have enough
         reserved space in the delayed refs block reserve, we will consume
         space from the global block reserve.
      
      Now this doesn't make a lot of sense because:
      
      1) We should reserve space for delayed references when starting the
         transaction, since we have no guarantees the delayed refs block
         reserve will be refilled;
      
      2) If no refill happens then we will consume from the global block reserve
         when running delayed refs during the transaction commit;
      
      3) If we have a bunch of tasks calling btrfs_start_transaction() with a
         number of items greater than zero and at the time the delayed refs
         reserve is full, then we don't reserve any space at
         btrfs_start_transaction() for the delayed refs that will be generated
         by a task, and we can therefore end up using a lot of space from the
         global reserve when running the delayed refs during a transaction
         commit;
      
      4) There are also other operations that result in bumping the size of the
         delayed refs reserve, such as creating and deleting block groups, as
         well as the need to update a block group item because we allocated or
         freed an extent from the respective block group;
      
      5) If we have a significant gap between the delayed refs reserve's size
         and its reserved space, two very bad things may happen:
      
         1) The reserved space of the global reserve may not be enough and we
            fail the transaction commit with -ENOSPC when running delayed refs;
      
         2) If the available space in the global reserve is enough it may result
            in nearly exhausting it. If the fs has no more unallocated device
            space for allocating a new block group and all the available space
            in existing metadata block groups is not far from the global
            reserve's size before we started the transaction commit, we may end
            up in a situation where after the transaction commit we have too
            little available metadata space, and any future transaction commit
            will fail with -ENOSPC, because although we were able to reserve
            space to start the transaction, we were not able to commit it, as
            running delayed refs generates some more delayed refs (to update the
            extent tree for example) - this includes not even being able to
            commit a transaction that was started with the goal of unlinking a
            file, removing an empty data block group or doing reclaim/balance,
            so there's no way to release metadata space.
      
            In the worst case the next time we mount the filesystem we may
            also fail with -ENOSPC due to failure to commit a transaction to
            cleanup orphan inodes. This later case was reported and hit by
            someone running a SLE (SUSE Linux Enterprise) distribution for
            example - where the fs had no more unallocated space that could be
            used to allocate a new metadata block group, and the available
            metadata space was about 1.5M, not enough to commit a transaction
            to cleanup an orphan inode (or do relocation of data block groups
            that were far from being full).
      
      So improve on this situation by always reserving space for delayed refs
      when calling start_transaction(), and if the flush method is
      BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
      reserve if it's not full. The space reserved for the delayed refs is added
      to a local block reserve that is part of the transaction handle, and when
      a task updates the delayed refs block reserve size, after creating a
      delayed ref, the space is transferred from that local reserve to the
      global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
      local reserve does not have enough space, which may happen for tasks
      that generate a variable and potentially large number of delayed refs
      (such as the hole punching and extent cloning cases mentioned before),
      we transfer any available space and then rely on the current behaviour
      of hoping some other task refills the delayed refs reserve or fallback
      to the global block reserve.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28270e25
    • Filipe Manana's avatar
      btrfs: stop doing excessive space reservation for csum deletion · adb86dbe
      Filipe Manana authored
      Currently when reserving space for deleting the csum items for a data
      extent, when adding or updating a delayed ref head, we determine how
      many leaves of csum items we can have and then pass that number to the
      helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
      space for all tree modifications we need when running delayed references,
      however the amount of space it computes is excessive for deleting csum
      items because:
      
      1) It uses btrfs_calc_insert_metadata_size() which is excessive because
         we only need to delete csum items from the csum tree, we don't need
         to insert any items, so btrfs_calc_metadata_size() is all we need (as
         it computes space needed to delete an item);
      
      2) If the free space tree is enabled, it doubles the amount of space,
         which is pointless for csum deletion since we don't need to touch the
         free space tree or any other tree other than the csum tree.
      
      So improve on this by tracking how many csum deletions we have and using
      a new helper to calculate space for csum deletions (just a wrapper around
      btrfs_calc_metadata_size() with a comment). This reduces the amount of
      space we need to reserve for csum deletions by a factor of 4, and it helps
      reduce the number of times we have to block space reservations and have
      the reclaim task enter the space flushing algorithm (flush delayed items,
      flush delayed refs, etc) in order to satisfy tickets.
      
      For example this results in a total time decrease when unlinking (or
      truncating) files with many extents, as we end up having to block on space
      metadata reservations less often. Example test:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nullb0
        MNT=/mnt/test
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $DEV
        # Use compression to quickly create files with a lot of extents
        # (each with a size of 128K).
        mount -o compress=lzo $DEV $MNT
      
        # 100G gives at least 983040 extents with a size of 128K.
        xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar
      
        # Flush all delalloc and clear all metadata from memory.
        umount $MNT
        mount -o compress=lzo $DEV $MNT
      
        start=$(date +%s%N)
        rm -f $MNT/foobar
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
        echo "rm took $dur milliseconds"
      
        umount $MNT
      
      Before this change rm took: 7504 milliseconds
      After this change rm took:  6574 milliseconds  (-12.4%)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adb86dbe
    • Filipe Manana's avatar
      btrfs: remove pointless initialization at btrfs_delayed_refs_rsv_release() · b6ea3e6a
      Filipe Manana authored
      There's no point in initializing to 0 the local variable 'released' as
      we don't use it before the next assignment to it. So remove the
      initialization. This may help avoid some warnings with clang tools such
      as the one reported/fixed by commit 966de47f ("btrfs: remove redundant
      initialization of variables in log_new_ancestors").
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6ea3e6a
    • Filipe Manana's avatar
      btrfs: reserve space for delayed refs on a per ref basis · 3ee56a58
      Filipe Manana authored
      Currently when reserving space for delayed refs we do it on a per ref head
      basis. This is generally enough because most back refs for an extent end
      up being inlined in the extent item - with the default leaf size of 16K we
      can have at most 33 inline back refs (this is calculated by the macro
      BTRFS_MAX_EXTENT_ITEM_SIZE()). The amount of bytes reserved for each ref
      head is given by btrfs_calc_delayed_ref_bytes(), which basically
      corresponds to a single path for insertion into the extent tree plus
      another path for insertion into the free space tree if it's enabled.
      
      However if we have reached the limit of inline refs or we have a mix of
      inline and non-inline refs, then we will need to insert a non-inline ref
      and update the existing extent item to update the total number of
      references for the extent. This implies we need reserved space for two
      insertion paths in the extent tree, but we only reserved for one path.
      The extent item and the non-inline ref item may be located in different
      leaves, or even if they are located in the same leaf, after updating the
      extent item and before inserting the non-inline ref item, the extent
      buffers in the btree path may have been written (due to memory pressure
      for e.g.), in which case we need to COW the entire path again. In this
      case since we have not reserved enough space for the delayed refs block
      reserve, we will use the global block reserve.
      
      If we are in a situation where the fs has no more unallocated space enough
      to allocate a new metadata block group and available space in the existing
      metadata block groups is close to the maximum size of the global block
      reserve (512M), we may end up consuming too much of the free metadata
      space to the point where we can't commit any future transaction because it
      will fail, with -ENOSPC, during its commit when trying to allocate an
      extent for some COW operation (running delayed refs generated by running
      delayed refs or COWing the root tree's root node at commit_cowonly_roots()
      for example). Such dramatic scenario can happen if we have many delayed
      refs that require the insertion of non-inline ref items, due to too many
      reflinks or snapshots. We also have situations where we use the global
      block reserve because we could not in advance know that we will need
      space to update some trees (block group creation for example), so this
      all adds up to increase the chances of exhausting the global block reserve
      and making any future transaction commit to fail with -ENOSPC and turn
      the fs into RO mode, or fail the mount operation in case the mount needs
      to start and commit a transaction, such as when we have orphans to cleanup
      for example - such case was reported and hit by someone running a SLE
      (SUSE Linux Enterprise) distribution for example - where the fs had no
      more unallocated space that could be used to allocate a new metadata block
      group, and the available metadata space was about 1.5M, not enough to
      commit a transaction to cleanup an orphan inode (or do relocation of data
      block groups that were far from being full).
      
      So reserve space for delayed refs by individual refs and not by ref heads,
      as we may need to COW multiple extent tree paths due to non-inline ref
      items.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ee56a58
    • Filipe Manana's avatar
      btrfs: allow to run delayed refs by bytes to be released instead of count · 8a526c44
      Filipe Manana authored
      When running delayed references, through btrfs_run_delayed_refs(), we can
      specify how many to run, run all existing delayed references and keep
      running delayed references while we can find any. This is controlled with
      the value of the 'count' argument, where a value of 0 means to run all
      delayed references that exist by the time btrfs_run_delayed_refs() is
      called, (unsigned long)-1 means to keep running delayed references while
      we are able find any, and any other value to run that exact number of
      delayed references.
      
      Typically a specific value other than 0 or -1 is used when flushing space
      to try to release a certain amount of bytes for a ticket. In this case
      we just simply calculate how many delayed reference heads correspond to a
      specific amount of bytes, with calc_delayed_refs_nr(). However that only
      takes into account the space reserved for the reference heads themselves,
      and does not account for the space reserved for deleting checksums from
      the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
      in case we are going to delete a data extent. This means we may end up
      running more delayed references than necessary in case we process delayed
      references for deleting a data extent.
      
      So change the logic of btrfs_run_delayed_refs() to take a bytes argument
      to specify how many bytes of delayed references to run/release, using the
      special values of 0 to mean all existing delayed references and U64_MAX
      (or (u64)-1) to keep running delayed references while we can find any.
      
      This prevents running more delayed references than necessary, when we have
      delayed references for deleting data extents, but also makes the upcoming
      changes/patches simpler and it's preparatory work for them.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a526c44