1. 05 Dec, 2022 40 commits
    • David Sterba's avatar
      btrfs: sysfs: convert remaining scnprintf to sysfs_emit · 467761f9
      David Sterba authored
      The sysfs_emit is the safe API for writing to the sysfs files,
      previously converted from scnprintf, there's one left to do in
      btrfs_read_policy_show.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      467761f9
    • Josef Bacik's avatar
      btrfs: do not panic if we can't allocate a prealloc extent state · 5a75034e
      Josef Bacik authored
      We sometimes have to allocate new extent states when clearing or setting
      new bits in an extent io tree.  Generally we preallocate this before
      taking the tree spin lock, but we can use this preallocated extent state
      sometimes and then need to try to do a GFP_ATOMIC allocation under the
      lock.
      
      Unfortunately sometimes this fails, and then we hit the BUG_ON() and
      bring the box down.  This happens roughly 20 times a week in our fleet.
      
      However the vast majority of callers use GFP_NOFS, which means that if
      this GFP_ATOMIC allocation fails, we could simply drop the spin lock, go
      back and allocate a new extent state with our given gfp mask, and begin
      again from where we left off.
      
      For the remaining callers that do not use GFP_NOFS, they are generally
      using GFP_NOWAIT, which still allows for some reclaim.  So allow these
      allocations to attempt to happen outside of the spin lock so we don't
      need to rely on GFP_ATOMIC allocations.
      
      This in essence creates an infinite loop for anything that isn't
      GFP_NOFS.  To address this we may want to migrate to using mempools for
      extent states so that we will always have emergency reserves in order to
      make our allocations.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a75034e
    • Josef Bacik's avatar
      btrfs: remove unused unlock_extent_atomic · da2a071b
      Josef Bacik authored
      As of "btrfs: do not use GFP_ATOMIC in the read endio" we no longer have
      any users of unlock_extent_atomic, remove it.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da2a071b
    • Josef Bacik's avatar
      btrfs: do not use GFP_ATOMIC in the read endio · 48acc47d
      Josef Bacik authored
      We have done read endio in an async thread for a very, very long time,
      which makes the use of GFP_ATOMIC and unlock_extent_atomic() unneeded in
      our read endio path.  We've noticed under heavy memory pressure in our
      fleet that we can fail these allocations, and then often trip a
      BUG_ON(!allocation), which isn't an ideal outcome.  Begin to address
      this by simply not using GFP_ATOMIC, which will allow us to do things
      like actually allocate a extent state when doing
      set_extent_bits(UPTODATE) in the endio handler.
      
      End io handlers are not called in atomic context, besides we have been
      allocating failrec with GFP_NOFS so we'd notice there's a problem.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48acc47d
    • Qu Wenruo's avatar
      btrfs: skip update of block group item if used bytes are the same · 7248e0ce
      Qu Wenruo authored
      [BACKGROUND]
      
      When committing a transaction, we will update block group items for all
      dirty block groups.
      
      But in fact, dirty block groups don't always need to update their block
      group items.
      It's pretty common to have a metadata block group which experienced
      several COW operations, but still have the same amount of used bytes.
      
      In that case, we may unnecessarily COW a tree block doing nothing.
      
      [ENHANCEMENT]
      
      This patch will introduce btrfs_block_group::commit_used member to
      remember the last used bytes, and use that new member to skip
      unnecessary block group item update.
      
      This would be more common for large filesystems, where metadata block
      group can be as large as 1GiB, containing at most 64K metadata items.
      
      In that case, if COW added and then deleted one metadata item near the
      end of the block group, then it's completely possible we don't need to
      touch the block group item at all.
      
      [BENCHMARK]
      
      The change itself can have quite a high chance (20~80%) to skip block
      group item updates in lot of workloads.
      
      As a result, it would result shorter time spent on
      btrfs_write_dirty_block_groups(), and overall reduce the execution time
      of the critical section of btrfs_commit_transaction().
      
      Here comes a fio command, which will do random writes in 4K block size,
      causing a very heavy metadata updates.
      
      fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
          --ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
          --name=random_write --fallocate=none --time_based --fsync_on_close=1
      
      The file size (512M) and number of threads (4) means 2GiB file size in
      total, but during the full 300s run time, my dedicated SATA SSD is able
      to write around 20~25GiB, which is over 10 times the file size.
      
      Thus after we fill the initial 2G, we should not cause much block group
      item updates.
      
      Please note, the fio numbers by themselves don't have much change, but
      if we look deeper, there is some reduced execution time, especially for
      the critical section of btrfs_commit_transaction().
      
      I added extra trace_printk() to measure the following per-transaction
      execution time:
      
      - Critical section of btrfs_commit_transaction()
        By re-using the existing update_commit_stats() function, which
        has already calculated the interval correctly.
      
      - The while() loop for btrfs_write_dirty_block_groups()
        Although this includes the execution time of btrfs_run_delayed_refs(),
        it should still be representative overall.
      
      Both result involves transid 7~30, the same amount of transaction
      committed.
      
      The result looks like this:
      
                            |      Before       |     After      |  Diff
      ----------------------+-------------------+----------------+--------
      Transaction interval  | 229247198.5       | 215016933.6    | -6.2%
      Block group interval  | 23133.33333       | 18970.83333    | -18.0%
      
      The change in block group item updates is more obvious, as skipped block
      group item updates also mean less delayed refs.
      
      And the overall execution time for that block group update loop is
      pretty small, thus we can assume the extent tree is already mostly
      cached.  If we can skip an uncached tree block, it would cause more
      obvious change.
      
      Unfortunately the overall reduction in commit transaction critical
      section is much smaller, as the block group item updates loop is not
      really the major part, at least not for the above fio script.
      
      But still we have a observable reduction in the critical section.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7248e0ce
    • David Sterba's avatar
      btrfs: convert __TRANS_* defines to enum bits · cc37ea61
      David Sterba authored
      The base transaction bits can be defined as bits in a contiguous
      sequence, although right now there's a hole from bit 1 to 8.
      
      The bits are used for btrfs_trans_handle::type, and there's another set
      of TRANS_STATE_* defines that are for btrfs_transaction::state. They are
      mutually exclusive though the hole in the sequence looks like was made
      for the states.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc37ea61
    • David Sterba's avatar
      btrfs: convert QGROUP_* defines to enum bits · e0a8b9a7
      David Sterba authored
      The defines/enums are used only for tracepoints and are not part of the
      on-disk format.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0a8b9a7
    • David Sterba's avatar
      btrfs: convert EXTENT_* bits to enums · d3b4d0fd
      David Sterba authored
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3b4d0fd
    • David Sterba's avatar
    • David Sterba's avatar
      c7321b76
    • David Sterba's avatar
      btrfs: add helper for bit enumeration · d549ff7b
      David Sterba authored
      Define helper macro that can be used in enum {} to utilize the automatic
      increment to define all bits without directly defining the values or
      using additional linear bits.
      
      1. capture the sequence value, N
      2. use the value to define the given enum with N-th bit set
      3. reset the sequence back to N
      
      Use for enums that do not require fixed values for symbolic names (like
      for on-disk structures):
      
      enum {
      	ENUM_BIT(FIRST),
      	ENUM_BIT(SECOND),
      	ENUM_BIT(THIRD)
      };
      
      Where the values would be 0x1, 0x2 and 0x4.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d549ff7b
    • Qu Wenruo's avatar
      btrfs: make module init/exit match their sequence · 5565b8e0
      Qu Wenruo authored
      [BACKGROUND]
      In theory init_btrfs_fs() and exit_btrfs_fs() should match their
      sequence, thus normally they should look like this:
      
          init_btrfs_fs()   |   exit_btrfs_fs()
      ----------------------+------------------------
          init_A();         |
          init_B();         |
          init_C();         |
                            |   exit_C();
                            |   exit_B();
                            |   exit_A();
      
      So is for the error path of init_btrfs_fs().
      
      But it's not the case, some exit functions don't match their init
      functions sequence in init_btrfs_fs().
      
      Furthermore in init_btrfs_fs(), we need to have a new error label for
      each new init function we added.  This is not really expandable,
      especially recently we may add several new functions to init_btrfs_fs().
      
      [ENHANCEMENT]
      The patch will introduce the following things to enhance the situation:
      
      - struct init_sequence
        Just a wrapper of init and exit function pointers.
      
        The init function must use int type as return value, thus some init
        functions need to be updated to return 0.
      
        The exit function can be NULL, as there are some init sequence just
        outputting a message.
      
      - struct mod_init_seq[] array
        This is a const array, recording all the initialization we need to do
        in init_btrfs_fs(), and the order follows the old init_btrfs_fs().
      
      - bool mod_init_result[] array
        This is a bool array, recording if we have initialized one entry in
        mod_init_seq[].
      
        The reason to split mod_init_seq[] and mod_init_result[] is to avoid
        section mismatch in reference.
      
        All init function are in .init.text, but if mod_init_seq[] records
        the @initialized member it can no longer be const, thus will be put
        into .data section, and cause modpost warning.
      
      For init_btrfs_fs() we just call all init functions in their order in
      mod_init_seq[] array, and after each call, setting corresponding
      mod_init_result[] to true.
      
      For exit_btrfs_fs() and error handling path of init_btrfs_fs(), we just
      iterate mod_init_seq[] in reverse order, and skip all uninitialized
      entry.
      
      With this patch, init_btrfs_fs()/exit_btrfs_fs() will be much easier to
      expand and will always follow the strict order.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5565b8e0
    • Filipe Manana's avatar
      btrfs: remove gfp_t flag from btrfs_tree_mod_log_insert_key() · 33cff222
      Filipe Manana authored
      All callers of btrfs_tree_mod_log_insert_key() are now passing a GFP_NOFS
      flag to it, so remove the flag from it and from alloc_tree_mod_elem() and
      use it directly within alloc_tree_mod_elem().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33cff222
    • Filipe Manana's avatar
      btrfs: switch GFP_ATOMIC to GFP_NOFS when fixing up low keys · 879b2221
      Filipe Manana authored
      When fixing up the first key of each node above the current level, at
      fixup_low_keys(), we are doing a GFP_ATOMIC allocation for inserting an
      operation record for the tree mod log. However we can do just fine with
      GFP_NOFS nowadays. The need for GFP_ATOMIC was for the old days when we
      had custom locks with spinning behaviour for extent buffers and we were
      in spinning mode while at fixup_low_keys(). Now we use rw semaphores for
      extent buffer locks, so we can safely use GFP_NOFS.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      879b2221
    • Boris Burkov's avatar
      btrfs: re-check reclaim condition in reclaim worker · 81531225
      Boris Burkov authored
      I have observed the following case play out and lead to unnecessary
      relocations:
      
      1. write a file across multiple block groups
      2. delete the file
      3. several block groups fall below the reclaim threshold
      4. reclaim the first, moving extents into the others
      5. reclaim the others which are now actually very full, leading to poor
         reclaim behavior with lots of writing, allocating new block groups,
         etc.
      
      I believe the risk of missing some reasonable reclaims is worth it
      when traded off against the savings of avoiding overfull reclaims.
      
      Going forward, it could be interesting to make the check more advanced
      (zoned aware, fragmentation aware, etc...) so that it can be a really
      strong signal both at extent delete and reclaim time.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81531225
    • Boris Burkov's avatar
      btrfs: skip reclaim if block_group is empty · cc4804bf
      Boris Burkov authored
      As we delete extents from a block group, at some deletion we cross below
      the reclaim threshold. It is possible we are still in the middle of
      deleting more extents and might soon hit 0. If the block group is empty
      by the time the reclaim worker runs, we will still relocate it.
      
      This works just fine, as relocating an empty block group ultimately
      results in properly deleting it. However, we have more direct ways of
      removing empty block groups in the cleaner thread. Those are either
      async discard or the unused_bgs list. In fact, when we decide whether to
      relocate a block group during extent deletion, we do check for emptiness
      and prefer the discard/unused_bgs mechanisms when possible.
      
      Not using relocation for this case reduces some modest overhead from
      empty bg relocation:
      
      - extra transactions
      - extra metadata use/churn for creating relocation metadata
      - trying to read the extent tree to look for extents (and in this case
        finding none)
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc4804bf
    • Filipe Manana's avatar
      btrfs: avoid unnecessary resolution of indirect backrefs during fiemap · 6976201f
      Filipe Manana authored
      During fiemap, when determining if a data extent is shared or not, if we
      don't find the extent is directly shared, then we need to determine if
      it's shared through subtrees. For that we need to resolve the indirect
      reference we found in order to figure out the path in the inode's fs tree,
      which is a path starting at the fs tree's root node and going down to the
      leaf that contains the file extent item that points to the data extent.
      We then proceed to determine if any extent buffer in that path is shared
      with other trees or not.
      
      However when the generation of the data extent is more recent than the
      last generation used to snapshot the root, we don't need to determine
      the path, since the data extent can not be shared through snapshots.
      For this case we currently still determine the leaf of that path (at
      find_parent_nodes(), but then stop determining the other nodes in the
      path (at btrfs_is_data_extent_shared()) as it's pointless.
      
      So do the check of the data extent's generation earlier, at
      find_parent_nodes(), before trying to resolve the indirect reference to
      determine the leaf in the path. This saves us from doing one expensive
      b+tree search in the fs tree of our target inode, as well as other minor
      work.
      
      The following test was run on a non-debug kernel (Debian's default kernel
      config):
      
         $ cat test-fiemap.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         # Use compression to quickly create files with a lot of extents
         # (each with a size of 128K).
         mount -o compress=lzo $DEV $MNT
      
         # 40G gives 327680 extents, each with a size of 128K.
         xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
         # Add some more files to increase the size of the fs and extent
         # trees (in the real world there's a lot of files and extents
         # from other files).
         xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
         xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
         xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
      
         umount $MNT
         mount -o compress=lzo $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata not cached)"
         echo
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata cached)"
      
         umount $MNT
      
      Before applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 1285 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 742 milliseconds (metadata cached)
      
      After applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 689 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 393 milliseconds (metadata cached)
      
      That's a -46.4% total reduction for the metadata not cached case, and
      a -47.0% reduction for the cached metadata case.
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6976201f
    • Filipe Manana's avatar
      btrfs: avoid duplicated resolution of indirect backrefs during fiemap · 877c1476
      Filipe Manana authored
      During fiemap, when determining if a data extent is shared or not, if we
      don't find the extent is directly shared, then we need to determine if
      it's shared through subtrees. For that we need to resolve the indirect
      reference we found in order to figure out the path in the inode's fs tree,
      which is a path starting at the fs tree's root node and going down to the
      leaf that contains the file extent item that points to the data extent.
      We then proceed to determine if any extent buffer in that path is shared
      with other trees or not.
      
      Currently whenever we find the data extent that a file extent item points
      to is not directly shared, we always resolve the path in the fs tree, and
      then check if any extent buffer in the path is shared. This is a lot of
      work and when we have file extent items that belong to the same leaf, we
      have the same path, so we only need to calculate it once.
      
      This change does that, it keeps track of the current and previous leaf,
      and when we find that a data extent is not directly shared, we try to
      compute the fs tree path only once and then use it for every other file
      extent item in the same leaf, using the existing cached path result for
      the leaf as long as the cache results are valid.
      
      This saves us from doing expensive b+tree searches in the fs tree of our
      target inode, as well as other minor work.
      
      The following test was run on a non-debug kernel (Debian's default kernel
      config):
      
         $ cat test-with-snapshots.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         # Use compression to quickly create files with a lot of extents
         # (each with a size of 128K).
         mount -o compress=lzo $DEV $MNT
      
         # 40G gives 327680 extents, each with a size of 128K.
         xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
         # Add some more files to increase the size of the fs and extent
         # trees (in the real world there's a lot of files and extents
         # from other files).
         xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
         xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
         xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
      
         # Create a snapshot so all the extents become indirectly shared
         # through subtrees, with a generation less than or equals to the
         # generation used to create the snapshot.
         btrfs subvolume snapshot -r $MNT $MNT/snap1
      
         umount $MNT
         mount -o compress=lzo $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata not cached)"
         echo
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata cached)"
      
         umount $MNT
      
      Result before applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 1204 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 729 milliseconds (metadata cached)
      
      Result after applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 732 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 421 milliseconds (metadata cached)
      
      That's a -46.1% total reduction for the metadata not cached case, and
      a -42.2% reduction for the cached metadata case.
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      877c1476
    • Filipe Manana's avatar
      btrfs: move up backref sharedness cache store and lookup functions · 583f4ac5
      Filipe Manana authored
      Move the static functions to lookup and store sharedness check of an
      extent buffer to a location above find_all_parents(), because in the
      next patch the lookup function will be used by find_all_parents().
      The store function is also moved just because it's the counter part
      to the lookup function and it's best to have their definitions close
      together.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      583f4ac5
    • Filipe Manana's avatar
      btrfs: cache sharedness of the last few data extents during fiemap · 73e339e6
      Filipe Manana authored
      During fiemap we process all the file extent items of an inode, by their
      file offset order (left to right b+tree order), and then check if the data
      extent they point at is shared or not. Until now we didn't cache those
      results, we only did it for b+tree nodes/leaves since for each unique
      b+tree path we have access to hundreds of file extent items. However, it
      is also common to repeat checking the sharedness of a particular data
      extent in a very short time window, and the cases that lead to that are
      the following:
      
      1) COW writes.
      
         If have a file extent item like this:
      
                        [ bytenr X, offset = 0, num_bytes = 512K ]
         file offset    0                                        512K
      
         Then a 4K write into file offset 64K happens, we end up with the
         following file extent item layout:
      
                        [ bytenr X, offset = 0, num_bytes = 64K ]
         file offset    0                                       64K
      
                        [ bytenr Y, offset = 0, num_bytes = 4K ]
         file offset   64K                                     68K
      
                        [ bytenr X, offset = 68K, num_bytes = 444K ]
         file offset   68K                                         512K
      
         So during fiemap we well check for the sharedness of the data extent
         with bytenr X twice. Typically for COW writes and for at least
         moderately updated files, we end up with many file extent items that
         point to different sections of the same data extent.
      
      2) Writing into a NOCOW file after a snapshot is taken.
      
         This happens if the target extent was created in a generation older
         than the generation where the last snapshot for the root (the tree the
         inode belongs to) was made.
      
         This leads to a scenario like the previous one.
      
      3) Writing into sections of a preallocated extent.
      
         For example if a file has the following layout:
      
         [ bytenr X, offset = 0, num_bytes = 1M, type = prealloc ]
         0                                                       1M
      
         After doing a 4K write into file offset 0 and another 4K write into
         offset 512K, we get the following layout:
      
            [ bytenr X, offset = 0, num_bytes = 4K, type = regular ]
            0                                                      4K
      
            [ bytenr X, offset = 4K, num_bytes = 508K, type = prealloc ]
           4K                                                          512K
      
            [ bytenr X, offset = 512K, num_bytes = 4K, type = regular ]
         512K                                                         516K
      
            [ bytenr X, offset = 516K, num_bytes = 508K, type = prealloc ]
         516K                                                            1M
      
         So we end up with 4 consecutive file extent items pointing to the data
         extent at bytenr X.
      
      4) Hole punching in the middle of an extent.
      
         For example if a file has the following file extent item:
      
         [ bytenr X, offset = 0, num_bytes = 8M ]
         0                                      8M
      
         And then hole is punched for the file range [4M, 6M[, we our file
         extent item split into two:
      
         [ bytenr X, offset = 0, num_bytes = 4M  ]
         0                                       4M
      
         [ 2M hole, implicit or explicit depending on NO_HOLES feature ]
         4M                                                            6M
      
         [ bytenr X, offset = 6M, num_bytes = 2M  ]
         6M                                       8M
      
         Again, we end up with two file extent items pointing to the same
         data extent.
      
      5) When reflinking (clone and deduplication) within the same file.
         This is probably the least common case of all.
      
      In cases 1, 2, 4 and 4, when we have multiple file extent items that point
      to the same data extent, their distance is usually short, typically
      separated by a few slots in a b+tree leaf (or across sibling leaves). For
      case 5, the distance can vary a lot, but it's typically the less common
      case.
      
      This change caches the result of the sharedness checks for data extents,
      but only for the last 8 extents that we notice that our inode refers to
      with multiple file extent items. Whenever we want to check if a data
      extent is shared, we lookup the cache which consists of doing a linear
      scan of an 8 elements array, and if we find the data extent there, we
      return the result and don't check the extent tree and delayed refs.
      
      The array/cache is small so that doing the search has no noticeable
      negative impact on the performance in case we don't have file extent items
      within a distance of 8 slots that point to the same data extent.
      
      Slots in the cache/array are overwritten in a simple round robin fashion,
      as that approach fits very well.
      
      Using this simple approach with only the last 8 data extents seen is
      effective as usually when multiple file extents items point to the same
      data extent, their distance is within 8 slots. It also uses very little
      memory and the time to cache a result or lookup the cache is negligible.
      
      The following test was run on non-debug kernel (Debian's default kernel
      config) to measure the impact in the case of COW writes (first example
      given above), where we run fiemap after overwriting 33% of the blocks of
      a file:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         FILE_SIZE=$((1 * 1024 * 1024  * 1024))
      
         # Create the file full of 1M extents.
         xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar
      
         block_count=$((FILE_SIZE / 4096))
         # Overwrite about 33% of the file blocks.
         overwrite_count=$((block_count / 3))
      
         echo -e "\nOverwriting $overwrite_count 4K blocks (out of $block_count)..."
         RANDOM=123
         for ((i = 1; i <= $overwrite_count; i++)); do
             off=$(((RANDOM % block_count) * 4096))
             xfs_io -c "pwrite -S 0xcd $off 4K" $MNT/foobar > /dev/null
             echo -ne "\r$i blocks overwritten..."
         done
         echo -e "\n"
      
         # Unmount and mount to clear all cached metadata.
         umount $MNT
         mount $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds"
      
         umount $MNT
      
      Result before applying this patch:
      
         fiemap took 128 milliseconds
      
      Result after applying this patch:
      
         fiemap took 92 milliseconds   (-28.1%)
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      73e339e6
    • Filipe Manana's avatar
      btrfs: remove useless logic when finding parent nodes · 56f5c199
      Filipe Manana authored
      At find_parent_nodes(), at its last step, when iterating over all direct
      references, we are checking if we have a share context and if we have
      a reference with a different root from the one in the share context.
      However that logic is pointless because of two reasons:
      
      1) After the previous patch in the series (subject "btrfs: remove roots
         ulist when checking data extent sharedness"), the roots argument is
         always NULL when using a share check context (struct share_check), so
         this code is never triggered;
      
      2) Even before that previous patch, we could not hit this code because
         if we had a reference with a root different from the one in our share
         context, then we would have exited earlier when doing either of the
         following:
      
            - Adding a second direct ref to the direct refs red black tree
              resulted in extent_is_shared() returning true when called from
              add_direct_ref() -> add_prelim_ref(), after processing delayed
              references or while processing references in the extent tree;
      
            - When adding a second reference to the indirect refs red black
              tree (same as above, extent_is_shared() returns true);
      
            - If we only have one indirect reference and no direct references,
              then when resolving it at resolve_indirect_refs() we immediately
              return that the target extent is shared, therefore never reaching
              that loop that iterates over all direct references at
              find_parent_nodes();
      
            - If we have 1 indirect reference and 1 direct reference, then we
              also exit early because extent_is_shared() ends up returning true
              when called through add_prelim_ref() (by add_direct_ref() or
              add_indirect_ref()) or add_delayed_refs(). Same applies as when
              having a combination of direct, indirect and indirect with missing
              key references.
      
         This logic had been obsoleted since commit 3ec4d323 ("btrfs:
         allow backref search checks for shared extents"), which introduced the
         early exits in case an extent is shared.
      
      So just remove that logic, and assert at find_parent_nodes() that when we
      have a share context we don't have a roots ulist and that we haven't found
      the extent to be directly shared after processing delayed references and
      all references from the extent tree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      56f5c199
    • Filipe Manana's avatar
      btrfs: remove roots ulist when checking data extent sharedness · b6296858
      Filipe Manana authored
      Currently btrfs_is_data_extent_shared() is passing a ulist for the roots
      argument of find_parent_nodes(), however it does not use that ulist for
      anything and for this context that list always ends up with at most one
      element.
      
      Since find_parent_nodes() is able to deal with a NULL ulist for its roots
      argument, make btrfs_is_data_extent_shared() pass it NULL and avoid the
      burden of allocating memory for the unnused roots ulist, initializing it,
      releasing it and allocating one struct ulist_node for it during the call
      to find_parent_nodes().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6296858
    • Filipe Manana's avatar
      btrfs: move ulists to data extent sharedness check context · 84a7949d
      Filipe Manana authored
      When calling btrfs_is_data_extent_shared() we pass two ulists that were
      allocated by the caller. This is because the single caller, fiemap, calls
      btrfs_is_data_extent_shared() multiple times and the ulists can be reused,
      instead of allocating new ones before each call and freeing them after
      each call.
      
      Now that we have a context structure/object that we pass to
      btrfs_is_data_extent_shared(), we can move those ulists to it, and hide
      their allocation and the context's allocation in a helper function, as
      well as the freeing of the ulists and the context object. This allows to
      reduce the number of parameters passed to btrfs_is_data_extent_shared(),
      the need to pass the ulists from extent_fiemap() to fiemap_process_hole()
      and having the caller deal with allocating and releasing the ulists.
      
      Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since
      that's a much better name as it reflects what the list is used for (and
      matching the argument name for find_parent_nodes()).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84a7949d
    • Filipe Manana's avatar
      btrfs: turn the backref sharedness check cache into a context object · 61dbb952
      Filipe Manana authored
      Right now we are using a struct btrfs_backref_shared_cache to pass state
      across multiple btrfs_is_data_extent_shared() calls. The structure's name
      closely follows its current purpose, which is to cache previous checks
      for the sharedness of metadata extents. However we will start using the
      structure for more things other than caching sharedness checks, so rename
      it to struct btrfs_backref_share_check_ctx.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61dbb952
    • Filipe Manana's avatar
      btrfs: directly pass the inode to btrfs_is_data_extent_shared() · ceb707da
      Filipe Manana authored
      Currently we pass a root and an inode number as arguments for
      btrfs_is_data_extent_shared() and the inode number is always from an
      inode that belongs to that root (it wouldn't make sense otherwise).
      In every context that we call btrfs_is_data_extent_shared() (fiemap only),
      we have an inode available, so directly pass the inode to the function
      instead of a root and inode number. This reduces the number of parameters
      and it makes the function's signature conform to most other functions we
      have.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ceb707da
    • Filipe Manana's avatar
      btrfs: remove checks for a 0 inode number during backref walking · a0a5472a
      Filipe Manana authored
      When doing backref walking to determine if an extent is shared, we are
      testing if the inode number, stored in the 'inum' field of struct
      share_check, is 0. However that can never be case, since the all instances
      of the structure are created at btrfs_is_data_extent_shared(), which
      always initializes it with the inode number from a fs tree (and the number
      for any inode from any tree can never be 0). So remove the checks.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a0a5472a
    • Filipe Manana's avatar
      btrfs: remove checks for a root with id 0 during backref walking · c9024219
      Filipe Manana authored
      When doing backref walking to determine if an extent is shared, we are
      testing the root_objectid of the given share_check struct is 0, but that
      is an impossible case, since btrfs_is_data_extent_shared() always
      initializes the root_objectid field with the id of the given root, and
      no root can have an objectid of 0. So remove those checks.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9024219
    • Filipe Manana's avatar
      btrfs: drop redundant bflags initialization when allocating extent buffer · 206c1d32
      Filipe Manana authored
      When allocating an extent buffer, at __alloc_extent_buffer(), there's no
      point in explicitly assigning zero to the bflags field of the new extent
      buffer because we allocated it with kmem_cache_zalloc().
      
      So just remove the redundant initialization, it saves one mov instruction
      in the generated assembly code for x86_64 ("movq $0x0,0x10(%rax)").
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      206c1d32
    • Filipe Manana's avatar
      btrfs: drop pointless memset when cloning extent buffer · b98c6cd5
      Filipe Manana authored
      At btrfs_clone_extent_buffer(), before allocating the pages array for the
      new extent buffer we are calling memset() to zero out the pages array of
      the extent buffer. This is pointless however, because the extent buffer
      already has every element in its pages array pointing to NULL, as it was
      allocated with kmem_cache_zalloc(). The memset() was introduced with
      commit dd137dd1 ("btrfs: factor out allocating an array of pages"),
      but even before that commit we already depended on the pages array being
      initialized to NULL for the error paths that need to call
      btrfs_release_extent_buffer().
      
      So remove the memset(), it's useless and slightly increases the object
      text size.
      
      Before this change:
      
         $ size fs/btrfs/extent_io.o
            text	   data	    bss	    dec	    hex	filename
           70580	   5469	     40	  76089	  12939	fs/btrfs/extent_io.o
      
      After this change:
      
         $ size fs/btrfs/extent_io.o
            text	   data	    bss	    dec	    hex	filename
           70564	   5469	     40	  76073	  12929	fs/btrfs/extent_io.o
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b98c6cd5
    • Filipe Manana's avatar
      btrfs: skip unnecessary delalloc search during fiemap and lseek · a2853ffc
      Filipe Manana authored
      During fiemap and lseek (hole and data seeking), there's no point in
      iterating the inode's io tree to count delalloc bits if the inode's
      delalloc bytes counter has a value of zero, as that counter is updated
      whenever we set a range for delalloc or clear a range from delalloc.
      
      So skip the counting and io tree iteration if the inode's delalloc bytes
      counter has a value of zero. This helps save time when processing a file
      range corresponding to a hole or prealloc (unwritten) extent.
      
      This patch is part of a series comprised of the following patches:
      
        btrfs: get the next extent map during fiemap/lseek more efficiently
        btrfs: skip unnecessary extent map searches during fiemap and lseek
        btrfs: skip unnecessary delalloc search during fiemap and lseek
      
      The following test was performed on a release kernel (Debian's default
      kernel config) before and after applying those 3 patches.
      
         # Wrapper to call fiemap in extent count only mode.
         # (struct fiemap::fm_extent_count set to 0)
         $ cat fiemap.c
         #include <stdio.h>
         #include <unistd.h>
         #include <stdlib.h>
         #include <fcntl.h>
         #include <errno.h>
         #include <string.h>
         #include <sys/ioctl.h>
         #include <linux/fs.h>
         #include <linux/fiemap.h>
      
         int main(int argc, char **argv)
         {
                  struct fiemap fiemap = { 0 };
                  int fd;
      
                  if (argc != 2) {
                          printf("usage: %s <path>\n", argv[0]);
                          return 1;
                  }
                  fd = open(argv[1], O_RDONLY);
                  if (fd < 0) {
                          fprintf(stderr, "error opening file: %s\n",
                                  strerror(errno));
                          return 1;
                  }
      
                  /* fiemap.fm_extent_count set to 0, to count extents only. */
                  fiemap.fm_length = FIEMAP_MAX_OFFSET;
                  if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
                          fprintf(stderr, "fiemap error: %s\n",
                                  strerror(errno));
                          return 1;
                  }
                  close(fd);
                  printf("fm_mapped_extents = %d\n", fiemap.fm_mapped_extents);
      
                  return 0;
         }
      
         $ gcc -o fiemap fiemap.c
      
      And the wrapper shell script that creates a file with many holes and runs
      fiemap against it:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         FILE_SIZE=$((1 * 1024 * 1024 * 1024))
         echo -n > $MNT/foobar
         for ((off = 0; off < $FILE_SIZE; off += 8192)); do
                 xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null
         done
      
         # flush all delalloc
         sync
      
         start=$(date +%s%N)
         ./fiemap $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds"
      
         umount $MNT
      
      Result before applying patchset:
      
         fm_mapped_extents = 131072
         fiemap took 63 milliseconds
      
      Result after applying patchset:
      
         fm_mapped_extents = 131072
         fiemap took 39 milliseconds   (-38.1%)
      
      Running the same test for a 512M file instead of a 1G file, gave the
      following results.
      
      Result before applying patchset:
      
         fm_mapped_extents = 65536
         fiemap took 29 milliseconds
      
      Result after applying patchset:
      
         fm_mapped_extents = 65536
         fiemap took 20 milliseconds    (-31.0%)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a2853ffc
    • Filipe Manana's avatar
      btrfs: skip unnecessary extent map searches during fiemap and lseek · 013f9c70
      Filipe Manana authored
      If we have no outstanding extents it means we don't have any extent maps
      corresponding to delalloc that is flushing, as when an ordered extent is
      created we increment the number of outstanding extents to 1 and when we
      remove the ordered extent we decrement them by 1. So skip extent map tree
      searches if the number of outstanding ordered extents is 0, saving time as
      the tree is not empty if we have previously made some reads or flushed
      delalloc, as in those cases it can have a very large number of extent maps
      for files with many extents.
      
      This helps save time when processing a file range corresponding to a hole
      or prealloc (unwritten) extent.
      
      The next patch in the series has a performance test in its changelog and
      its subject is:
      
          "btrfs: skip unnecessary delalloc search during fiemap and lseek"
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      013f9c70
    • Filipe Manana's avatar
      btrfs: get the next extent map during fiemap/lseek more efficiently · d47704bd
      Filipe Manana authored
      At find_delalloc_subrange(), when we need to get the next extent map, we
      do a full search on the extent map tree (a red black tree). This is fine
      but it's a lot more efficient to simply use rb_next(), which typically
      requires iterating over less nodes of the tree and never needs to compare
      the ranges of nodes with the one we are looking for.
      
      So add a public helper to extent_map.{h,c} to get the extent map that
      immediately follows another extent map, using rb_next(), and use that
      helper at find_delalloc_subrange().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d47704bd
    • Qu Wenruo's avatar
      btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate · 88074c8b
      Qu Wenruo authored
      For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio).
      
      We call cache_rbio_pages() to mark a qualified rbio ready for cache.
      
      The timing happens at:
      
      - finish_rmw()
      
        At this timing, we have already read all necessary sectors, along with
        the rbio sectors, we have covered all data stripes.
      
      - __raid_recover_end_io()
      
        At this timing, we have rebuild the rbio, thus all data sectors
        involved (either from stripe or bio list) are uptodate now.
      
      Thus at the timing of cache_rbio_pages(), we should have all data
      sectors uptodate.
      
      This patch will make it explicit that all data sectors are uptodate at
      cache_rbio_pages() timing, mostly to prepare for the incoming
      verification at RMW time.
      
      This patch will add:
      
      - Extra ASSERT()s in cache_rbio_pages()
        This is to make sure all data sectors, which are not covered by bio,
        are already uptodate.
      
      - Extra ASSERT()s in steal_rbio()
        Since only cached rbio can be stolen, thus every data sector should
        already be uptodate in the source rbio.
      
      - Update __raid_recover_end_io() to update recovered sector->uptodate
        Previously __raid_recover_end_io() will only mark failed sectors
        uptodate if it's doing an RMW.
      
        But this can trigger new ASSERT()s, as for recovery case, a recovered
        failed sector will not be marked uptodate, and trigger ASSERT() in
        later cache_rbio_pages() call.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88074c8b
    • Qu Wenruo's avatar
      btrfs: raid56: allocate memory separately for rbio pointers · 797d74b7
      Qu Wenruo authored
      Currently inside alloc_rbio(), we allocate a larger memory to contain
      the following members:
      
      - struct btrfs_raid_rbio itself
      - stripe_pages array
      - bio_sectors array
      - stripe_sectors array
      - finish_pointers array
      
      Then update rbio pointers to point the extra space after the rbio
      structure itself.
      
      Thus it introduced a complex CONSUME_ALLOC() macro to help the thing.
      
      This is too hacky, and is going to make later pointers expansion harder.
      
      This patch will change it to use regular kcalloc() for each pointer
      inside btrfs_raid_bio, making the later expansion much easier.
      
      And introduce a helper free_raid_bio_pointers() to free up all the
      pointer members in btrfs_raid_bio, which will be used in both
      free_raid_bio() and error path of alloc_rbio().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      797d74b7
    • Qu Wenruo's avatar
      btrfs: raid56: cleanup for function __free_raid_bio() · ff2b64a2
      Qu Wenruo authored
      The cleanup involves two things:
      
      - Remove the "__" prefix
        There is no naming confliction.
      
      - Remove the forward declaration
        There is no special function call involved.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff2b64a2
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY · 765c3fe9
      Josef Bacik authored
      Inside of FB, as well as some user reports, we've had a consistent
      problem of occasional ENOSPC transaction aborts.  Inside FB we were
      seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
      low occurrence rate given the size of our fleet, but it's not nothing.
      
      There are two causes of this particular problem.
      
      First is delayed allocation.  The reservation system for delalloc
      assumes that contiguous dirty ranges will result in 1 file extent item.
      However if there is memory pressure that results in fragmented writeout,
      or there is fragmentation in the block groups, this won't necessarily be
      true.  Consider the case where we do a single 256MiB write to a file and
      then close it.  We will have 1 reservation for the inode update, the
      reservations for the checksum updates, and 1 reservation for the file
      extent item.  At some point later we decide to write this entire range
      out, but we're so fragmented that we break this into 100 different file
      extents.  Since we've already closed the file and are no longer writing
      to it there's nothing to trigger a refill of the delalloc block rsv to
      satisfy the 99 new file extent reservations we need.  At this point we
      exhaust our delalloc reservation, and we begin to steal from the global
      reserve.  If you have enough of these cases going in parallel you can
      easily exhaust the global reserve, get an ENOSPC at
      btrfs_alloc_tree_block() time, and then abort the transaction.
      
      The other case is the delayed refs reserve.  The delayed refs reserve
      updates its size based on outstanding delayed refs and dirty block
      groups.  However we only refill this block reserve when returning
      excess reservations and when we call btrfs_start_transaction(root, X).
      We will reserve 2*X credits at transaction start time, and fill in X
      into the delayed refs reserve to make sure it stays topped off.
      Generally this works well, but clearly has downsides.  If we do a
      particularly delayed ref heavy operation we may never catch up in our
      reservations.  Additionally running delayed refs generates more delayed
      refs, and at that point we may be committing the transaction and have no
      way to trigger a refill of our delayed refs rsv.  Then a similar thing
      occurs with the delalloc reserve.
      
      Generally speaking we well over-reserve in all of our block rsvs.  If we
      reserve 1 credit we're usually reserving around 264k of space, but we'll
      often not use any of that reservation, or use a few blocks of that
      reservation.  We can be reasonably sure that as long as you were able to
      reserve space up front for your operation you'll be able to find space
      on disk for that reservation.
      
      So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY.  This
      gets used in the case that we've exhausted our reserve and the global
      reserve.  It simply forces a reservation if we have enough actual space
      on disk to make the reservation, which is almost always the case.  This
      keeps us from hitting ENOSPC aborts in these odd occurrences where we've
      not kept up with the delayed work.
      
      Fixing this in a complete way is going to be relatively complicated and
      time consuming.  This patch is what I discussed with Filipe earlier this
      year, and what I put into our kernels inside FB.  With this patch we're
      down to 1-2 ENOSPC aborts per week, which is a significant reduction.
      This is a decent stop gap until we can work out a more wholistic
      solution to these two corner cases.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      765c3fe9
    • Josef Bacik's avatar
      btrfs: move the btrfs_verity_descriptor_item defs up in ctree.h · 7a66eda3
      Josef Bacik authored
      These are wrapped in CONFIG_FS_VERITY, but we can have the definitions
      without verity enabled.  Move these definitions up with the other
      accessor helpers.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a66eda3
    • Josef Bacik's avatar
      btrfs: move btrfs_next_old_item into ctree.c · 890d2b1a
      Josef Bacik authored
      This uses btrfs_header_nritems, which I will be moving out of ctree.h.
      In order to avoid needing to include the relevant header in ctree.h,
      simply move this helper function into ctree.c.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename parameters ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      890d2b1a
    • Josef Bacik's avatar
      btrfs: move free space cachep's out of ctree.h · eda517fd
      Josef Bacik authored
      This is local to the free-space-cache.c code, remove it from ctree.h and
      inode.c, create new init/exit functions for the cachep, and move it
      locally to free-space-cache.c.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eda517fd
    • Josef Bacik's avatar
      btrfs: move btrfs_path_cachep out of ctree.h · 226463d7
      Josef Bacik authored
      This is local to the ctree code, remove it from ctree.h and inode.c,
      create new init/exit functions for the cachep, and move it locally to
      ctree.c.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      226463d7