1. 11 Oct, 2022 5 commits
    • Filipe Manana's avatar
      btrfs: fix processing of delayed data refs during backref walking · 4fc7b572
      Filipe Manana authored
      When processing delayed data references during backref walking and we are
      using a share context (we are being called through fiemap), whenever we
      find a delayed data reference for an inode different from the one we are
      interested in, then we immediately exit and consider the data extent as
      shared. This is wrong, because:
      
      1) This might be a DROP reference that will cancel out a reference in the
         extent tree;
      
      2) Even if it's an ADD reference, it may be followed by a DROP reference
         that cancels it out.
      
      In either case we should not exit immediately.
      
      Fix this by never exiting when we find a delayed data reference for
      another inode - instead add the reference and if it does not cancel out
      other delayed reference, we will exit early when we call
      extent_is_shared() after processing all delayed references. If we find
      a drop reference, then signal the code that processes references from
      the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
      immediately if it finds there a reference for another inode, since we
      have delayed drop references that may cancel it out. In this later case
      we exit once we don't have references in the rb trees that cancel out
      each other and have two references for different inodes.
      
      Example reproducer for case 1):
      
         $ cat test-1.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      Example reproducer for case 2):
      
         $ cat test-2.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         # Flush delayed references to the extent tree and commit current
         # transaction.
         sync
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      After this patch, after deleting bar in both tests, the extent is not
      reported with the 0x2000 flag anymore, it gets only the flag 0x1
      (which is FIEMAP_EXTENT_LAST):
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
      These tests will later be converted to a test case for fstests.
      
      Fixes: dc046b10 ("Btrfs: make fiemap not blow when you have lots of snapshots")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4fc7b572
    • David Sterba's avatar
      btrfs: delete stale comments after merge conflict resolution · 295a53cc
      David Sterba authored
      There are two comments in btrfs_cache_block_group that I left when
      resolving conflict between commits ced8ecf0 "btrfs: fix space cache
      corruption and potential double allocations" and 527c490f "btrfs:
      delete btrfs_wait_space_cache_v1_finished".
      
      The former reworked the caching logic to wait until the caching ends in
      btrfs_cache_block_group while the latter only open coded the waiting.
      Both removed btrfs_wait_space_cache_v1_finished, the correct code is
      with the waiting and returning error. Thus the conflict resolution was
      OK.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      295a53cc
    • Josef Bacik's avatar
      btrfs: unlock locked extent area if we have contention · 9e769bd7
      Josef Bacik authored
      In production we hit the following deadlock
      
      task 1			task 2			task 3
      ------			------			------
      fiemap(file)		falloc(file)		fsync(file)
      						  write(0, 1MiB)
      						  btrfs_commit_transaction()
      						    wait_on(!pending_ordered)
      			  lock(512MiB, 1GiB)
      			  start_transaction
      			    wait_on_transaction
      
        lock(0, 1GiB)
          wait_extent_bit(512MiB)
      
      task 4
      ------
      finish_ordered_extent(0, 1MiB)
        lock(0, 1MiB)
        **DEADLOCK**
      
      This occurs because when task 1 does it's lock, it locks everything from
      0-512MiB, and then waits for the 512MiB chunk to unlock.  task 2 will
      never unlock because it's waiting on the transaction commit to happen,
      the transaction commit is waiting for the outstanding ordered extents,
      and then the ordered extent thread is blocked waiting on the 0-1MiB
      range to unlock.
      
      To fix this we have to clear anything we've locked so far, wait for the
      extent_state that we contended on, and then try to re-lock the entire
      range again.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e769bd7
    • David Sterba's avatar
      btrfs: send: update command for protocol version check · c86eab81
      David Sterba authored
      For a protocol and command compatibility we have a helper that hasn't
      been updated for v3 yet. We use it for verity so update where necessary.
      
      Fixes: 38622010 ("btrfs: send: add support for fs-verity")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c86eab81
    • Boris Burkov's avatar
      btrfs: send: allow protocol version 3 with CONFIG_BTRFS_DEBUG · 9971a741
      Boris Burkov authored
      We haven't finalized send stream v3 yet, so gate the send stream version
      behind CONFIG_BTRFS_DEBUG as we want some way to test it.
      
      The original verity send did not check the protocol version, so add that
      actual protection as well.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9971a741
  2. 07 Oct, 2022 1 commit
    • Filipe Manana's avatar
      btrfs: add missing path cache update during fiemap · 96dbcc00
      Filipe Manana authored
      When looking the stored result for a cached path node, if the stored
      result is valid and has a value of true, we must update all the nodes for
      all levels below it with a result of true as well. This is necessary when
      moving from one leaf in the fs tree to the next one, as well as when
      moving from a node at any level to the next node at the same level.
      
      Currently this logic is missing as it was somehow forgotten by a recent
      patch with the subject: "btrfs: speedup checking for extent sharedness
      during fiemap".
      
      This adds the missing logic, which is the counter part to what we do
      when adding a shared node to the cache at store_backref_shared_cache().
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96dbcc00
  3. 29 Sep, 2022 25 commits
    • Tetsuo Handa's avatar
      btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer · cbddcc4f
      Tetsuo Handa authored
      syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for
      commit bc877d28 ("btrfs: Deduplicate extent_buffer init code")
      missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must
      not be moved to after clean_tree_block() because clean_tree_block() is
      calling btrfs_header_generation() since commit 55c69072 ("Btrfs:
      Fix extent_buffer usage when nodesize != leafsize").
      
      Since memzero_extent_buffer() will reset "struct btrfs_header" part, we
      can't move btrfs_set_header_generation() to before memzero_extent_buffer().
      Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block().
      
      Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1]
      Reported-by: default avatarsyzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com>
      Fixes: bc877d28 ("btrfs: Deduplicate extent_buffer init code")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbddcc4f
    • Filipe Manana's avatar
      btrfs: drop extent map range more efficiently · db21370b
      Filipe Manana authored
      Currently when dropping extent maps for a file range, through
      btrfs_drop_extent_map_range(), we do the following non-optimal things:
      
      1) We lookup for extent maps one by one, always starting the search from
         the root of the extent map tree. This is not efficient if we have
         multiple extent maps in the range;
      
      2) We check on every iteration if we have the 'split' and 'split2' spare
         extent maps in case we need to split an extent map that intersects our
         range but also crosses its boundaries (to the left, to the right or
         both cases). If our target range is for example:
      
             [2M, 8M)
      
         And we have 3 extents maps in the range:
      
             [1M, 3M) [3M, 6M) [6M, 10M[
      
         The on the first iteration we allocate two extent maps for 'split' and
         'split2', and use the 'split' to split the first extent map, so after
         the split we set 'split' to 'split2' and then set 'split2' to NULL.
      
         On the second iteration, we don't need to split the second extent map,
         but because 'split2' is now NULL, we allocate a new extent map for
         'split2'.
      
         On the third iteration we need to split the third extent map, so we
         use the extent map pointed by 'split'.
      
         So we ended up allocating 3 extent maps for splitting, but all we
         needed was 2 extent maps. We never need to allocate more than 2,
         because extent maps that need to be split are always the first one
         and the last one in the target range.
      
      Improve on this by:
      
      1) Using rb_next() to move on to the next extent map. This results in
         iterating over less nodes of the tree and it does not require comparing
         the ranges of nodes to our start/end offset;
      
      2) Allocate the 2 extent maps for splitting before entering the loop and
         never allocate more than 2. In practice it's very rare to have the
         combination of both extent map allocations fail, since we have a
         dedicated slab for extent maps, and also have the need to split two
         extent maps.
      
      This patch is part of a patchset comprised of the following patches:
      
         btrfs: fix missed extent on fsync after dropping extent maps
         btrfs: move btrfs_drop_extent_cache() to extent_map.c
         btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
         btrfs: use cond_resched_rwlock_write() during inode eviction
         btrfs: move open coded extent map tree deletion out of inode eviction
         btrfs: add helper to replace extent map range with a new extent map
         btrfs: remove the refcount warning/check at free_extent_map()
         btrfs: remove unnecessary extent map initializations
         btrfs: assert tree is locked when clearing extent map from logging
         btrfs: remove unnecessary NULL pointer checks when searching extent maps
         btrfs: remove unnecessary next extent map search
         btrfs: avoid pointless extent map tree search when flushing delalloc
         btrfs: drop extent map range more efficiently
      
      And the following fio test was done before and after applying the whole
      patchset, on a non-debug kernel (Debian's default kernel config) on a 12
      cores Intel box with 64G of ram:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [writers]
         rw=randwrite
         fsync=8
         fallocate=none
         group_reporting=1
         direct=0
         bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
         ioengine=psync
         filesize=2G
         runtime=300
         time_based
         directory=$MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         fio /tmp/fio-job.ini
      
         umount $MNT
      
      Result before applying the patchset:
      
         WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec
      
      Result after applying the patchset:
      
         WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db21370b
    • Filipe Manana's avatar
      btrfs: avoid pointless extent map tree search when flushing delalloc · b54bb865
      Filipe Manana authored
      When flushing delalloc, in COW mode at cow_file_range(), before entering
      the loop that allocates extents and creates ordered extents, we do a call
      to btrfs_drop_extent_map_range() for the whole range. This is pointless
      because in the loop we call create_io_em(), which will also call
      btrfs_drop_extent_map_range() before inserting the new extent map.
      
      So remove that call at cow_file_range() not only because it is not needed,
      but also because it will make the btrfs_drop_extent_map_range() calls made
      from create_io_em() waste time searching the extent map tree, and that
      tree can be large for files with many extents. It also makes us waste time
      at btrfs_drop_extent_map_range() allocating and freeing the split extent
      maps for nothing.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b54bb865
    • Filipe Manana's avatar
      btrfs: remove unnecessary next extent map search · 6c05813e
      Filipe Manana authored
      At __tree_search(), and its single caller __lookup_extent_mapping(), there
      is no point in finding the next extent map that starts after the search
      offset if we were able to find the previous extent map that ends before
      our search offset, because __lookup_extent_mapping() ignores the next
      acceptable extent map if we were able to find the previous one.
      
      So just return immediately if we were able to find the previous extent
      map, therefore avoiding wasting time iterating the tree looking for the
      next extent map which will not be used by __lookup_extent_mapping().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c05813e
    • Filipe Manana's avatar
      btrfs: remove unnecessary NULL pointer checks when searching extent maps · 08f088dd
      Filipe Manana authored
      The previous and next pointer arguments passed to __tree_search() are
      never NULL as the only caller of this function, __lookup_extent_mapping(),
      always passes the address of two on stack pointers. So remove the NULL
      checks and add assertions to verify the pointers.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08f088dd
    • Filipe Manana's avatar
      btrfs: assert tree is locked when clearing extent map from logging · 74333c7d
      Filipe Manana authored
      When calling clear_em_logging() we should have a write lock on the extent
      map tree, as we will try to merge the extent map with the previous and
      next ones in the tree. So assert that we have a write lock.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74333c7d
    • Filipe Manana's avatar
      btrfs: remove unnecessary extent map initializations · 2e0cdaa0
      Filipe Manana authored
      When allocating an extent map, we use kmem_cache_zalloc() which guarantees
      the returned memory is initialized to zeroes, therefore it's pointless
      to initialize the generation and flags of the extent map to zero again.
      
      Remove those initializations, as they are pointless and slightly increase
      the object text size.
      
      Before removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9241	    274	     24	   9539	   2543	fs/btrfs/extent_map.o
      
      After removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9209	    274	     24	   9507	   2523	fs/btrfs/extent_map.o
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e0cdaa0
    • Filipe Manana's avatar
      btrfs: remove the refcount warning/check at free_extent_map() · ad5d6e91
      Filipe Manana authored
      At free_extent_map(), it's pointless to have a WARN_ON() to check if the
      refcount of the extent map is zero. Such check is already done by the
      refcount_t module and refcount_dec_and_test(), which loudly complains if
      we try to decrement a reference count that is currently 0.
      
      The WARN_ON() dates back to the time when used a regular atomic_t type
      for the reference counter, before we switched to the refcount_t type.
      The main goal of the refcount_t type/module is precisely to catch such
      types of bugs and loudly complain if they happen.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad5d6e91
    • Filipe Manana's avatar
      btrfs: add helper to replace extent map range with a new extent map · a1ba4c08
      Filipe Manana authored
      We have several places that need to drop all the extent maps in a given
      file range and then add a new extent map for that range. Currently they
      call btrfs_drop_extent_map_range() to delete all extent maps in the range
      and then keep trying to add the new extent map in a loop that keeps
      retrying while the insertion of the new extent map fails with -EEXIST.
      
      So instead of repeating this logic, add a helper to extent_map.c that
      does these steps and name it btrfs_replace_extent_map_range(). Also add
      a comment about why the retry loop is necessary.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1ba4c08
    • Filipe Manana's avatar
      btrfs: move open coded extent map tree deletion out of inode eviction · 9c9d1b4f
      Filipe Manana authored
      Move the loop that removes all the extent maps from the inode's extent
      map tree during inode eviction out of inode.c and into extent_map.c, to
      btrfs_drop_extent_map_range(). Anything manipulating extent maps or the
      extent map tree should be in extent_map.c.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c9d1b4f
    • Filipe Manana's avatar
      btrfs: use cond_resched_rwlock_write() during inode eviction · 99ba0c81
      Filipe Manana authored
      At evict_inode_truncate_pages(), instead of manually checking if
      rescheduling is needed, then unlock the extent map tree, reschedule and
      then write lock again the tree, use the helper cond_resched_rwlock_write()
      which does all that.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99ba0c81
    • Filipe Manana's avatar
      btrfs: use extent_map_end() at btrfs_drop_extent_map_range() · f3109e33
      Filipe Manana authored
      Instead of open coding the end offset calculation of an extent map, use
      the helper extent_map_end() and cache its result in a local variable,
      since it's used several times.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3109e33
    • Filipe Manana's avatar
      btrfs: move btrfs_drop_extent_cache() to extent_map.c · 4c0c8cfc
      Filipe Manana authored
      The function btrfs_drop_extent_cache() doesn't really belong at file.c
      because what it does is drop a range of extent maps for a file range.
      It directly allocates and manipulates extent maps, by dropping,
      splitting and replacing them in an extent map tree, so it should be
      located at extent_map.c, where all manipulations of an extent map tree
      and its extent maps are supposed to be done.
      
      So move it out of file.c and into extent_map.c. Additionally do the
      following changes:
      
      1) Rename it into btrfs_drop_extent_map_range(), as this makes it more
         clear about what it does. The term "cache" is a bit confusing as it's
         not widely used, "extent maps" or "extent mapping" is much more common;
      
      2) Change its 'skip_pinned' argument from int to bool;
      
      3) Turn several of its local variables from int to bool, since they are
         used as booleans;
      
      4) Move the declaration of some variables out of the function's main
         scope and into the scopes where they are used;
      
      5) Remove pointless assignment of false to 'modified' early in the while
         loop, as later that variable is set and it's not used before that
         second assignment;
      
      6) Remove checks for NULL before calling free_extent_map().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c0c8cfc
    • Filipe Manana's avatar
      btrfs: fix missed extent on fsync after dropping extent maps · cef7820d
      Filipe Manana authored
      When dropping extent maps for a range, through btrfs_drop_extent_cache(),
      if we find an extent map that starts before our target range and/or ends
      before the target range, and we are not able to allocate extent maps for
      splitting that extent map, then we don't fail and simply remove the entire
      extent map from the inode's extent map tree.
      
      This is generally fine, because in case anyone needs to access the extent
      map, it can just load it again later from the respective file extent
      item(s) in the subvolume btree. However, if that extent map is new and is
      in the list of modified extents, then a fast fsync will miss the parts of
      the extent that were outside our range (that needed to be split),
      therefore not logging them. Fix that by marking the inode for a full
      fsync. This issue was introduced after removing BUG_ON()s triggered when
      the split extent map allocations failed, done by commit 7014cdb4
      ("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and
      the fast fsync path already existed but was very recent.
      
      Also, in the case where we could allocate extent maps for the split
      operations but then fail to add a split extent map to the tree, mark the
      inode for a full fsync as well. This is not supposed to ever fail, and we
      assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is
      not set), it's the correct thing to do to make sure a fast fsync will not
      miss a new extent.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cef7820d
    • Jeff Layton's avatar
      btrfs: remove stale prototype of btrfs_write_inode · 3050dfa6
      Jeff Layton authored
      This function no longer exists, was removed in 3c427693 ("Btrfs: fix
      btrfs_write_inode vs delayed iput deadlock").
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3050dfa6
    • Stefan Roesch's avatar
      btrfs: enable nowait async buffered writes · 926078b2
      Stefan Roesch authored
      Enable nowait async buffered writes in btrfs_do_write_iter() and
      btrfs_file_open().
      
      In this version encoded buffered writes have the optimization not
      enabled. Encoded writes are enabled by using an ioctl. io_uring
      currently does not support ioctls. This might be enabled in the future.
      
      Performance results:
      
        For fio the following results have been obtained with a queue depth of
        1 and 4k block size (runtime 600 secs):
      
                       sequential writes:
                       without patch           with patch      libaio     psync
        iops:              55k                    134k          117K       148K
        bw:               221MB/s                 538MB/s       469MB/s    592MB/s
        clat:           15286ns                    82ns         994ns     6340ns
      
      For an io depth of 1, the new patch improves throughput by over two
      times (compared to the existing behavior, where buffered writes are
      processed by an io-worker process) and also the latency is considerably
      reduced. To achieve the same or better performance with the existing
      code an io depth of 4 is required.  Increasing the iodepth further does
      not lead to improvements.
      
      The tests have been run like this:
      
      ./fio --name=seq-writers --ioengine=psync --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      ./fio --name=seq-writers --ioengine=io_uring --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      ./fio --name=seq-writers --ioengine=libaio --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      
      Testing:
        This patch has been tested with xfstests, fsx, fio. xfstests shows no new
        diffs compared to running without the patch series.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      926078b2
    • Stefan Roesch's avatar
      btrfs: assert nowait mode is not used for some btree search functions · c922b016
      Stefan Roesch authored
      Adds nowait asserts to btree search functions which are not used by
      buffered IO and direct IO paths.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c922b016
    • Stefan Roesch's avatar
      btrfs: make btrfs_buffered_write nowait compatible · 965f47ae
      Stefan Roesch authored
      We need to avoid unconditionally calling balance_dirty_pages_ratelimited
      as it could wait for some reason. Use balance_dirty_pages_ratelimited_flags
      with the BDP_ASYNC in case the buffered write is nowait, returning
      EAGAIN eventually.
      
      It also moves the function after the again label. This can cause the
      function to be called a bit later, but this should have no impact in the
      real world.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      965f47ae
    • Stefan Roesch's avatar
      btrfs: plumb NOWAIT through the write path · 304e45ac
      Stefan Roesch authored
      We have everywhere setup for nowait, plumb NOWAIT through the write path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      304e45ac
    • Stefan Roesch's avatar
      btrfs: make lock_and_cleanup_extent_if_need nowait compatible · 2fcab928
      Stefan Roesch authored
      Add the nowait parameter to lock_and_cleanup_extent_if_need(). If the
      nowait parameter is specified we try to lock the extent in nowait mode.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2fcab928
    • Stefan Roesch's avatar
      btrfs: make prepare_pages nowait compatible · fc226000
      Stefan Roesch authored
      Add nowait parameter to the prepare_pages function. In case nowait is
      specified for an async buffered write request, do a nowait allocation or
      return -EAGAIN.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc226000
    • Josef Bacik's avatar
      btrfs: make btrfs_check_nocow_lock nowait compatible · 80f9d241
      Josef Bacik authored
      Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add
      a nowait flag to btrfs_check_nocow_lock so it can be used by the write
      path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      80f9d241
    • Josef Bacik's avatar
      btrfs: add btrfs_try_lock_ordered_range · d2c7a19f
      Josef Bacik authored
      For IOCB_NOWAIT we're going to want to use try lock on the extent lock,
      and simply bail if there's an ordered extent in the range because the
      only choice there is to wait for the ordered extent to complete.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2c7a19f
    • Josef Bacik's avatar
      btrfs: add the ability to use NO_FLUSH for data reservations · 1daedb1d
      Josef Bacik authored
      In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH
      data reservations, so plumb this through the delalloc reservation
      system.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1daedb1d
    • Josef Bacik's avatar
      btrfs: make can_nocow_extent nowait compatible · 26ce9114
      Josef Bacik authored
      If we have NOWAIT specified on our IOCB and we're writing into a
      PREALLOC or NOCOW extent then we need to be able to tell
      can_nocow_extent that we don't want to wait on any locks or metadata IO.
      Fix can_nocow_extent to allow for NOWAIT.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      26ce9114
  4. 26 Sep, 2022 9 commits