1. 02 Nov, 2022 2 commits
    • Filipe Manana's avatar
      btrfs: fix inode list leak during backref walking at find_parent_nodes() · 92876eec
      Filipe Manana authored
      During backref walking, at find_parent_nodes(), if we are dealing with a
      data extent and we get an error while resolving the indirect backrefs, at
      resolve_indirect_refs(), or in the while loop that iterates over the refs
      in the direct refs rbtree, we end up leaking the inode lists attached to
      the direct refs we have in the direct refs rbtree that were not yet added
      to the refs ulist passed as argument to find_parent_nodes(). Since they
      were not yet added to the refs ulist and prelim_release() does not free
      the lists, on error the caller can only free the lists attached to the
      refs that were added to the refs ulist, all the remaining refs get their
      inode lists never freed, therefore leaking their memory.
      
      Fix this by having prelim_release() always free any attached inode list
      to each ref found in the rbtree, and have find_parent_nodes() set the
      ref's inode list to NULL once it transfers ownership of the inode list
      to a ref added to the refs ulist passed to find_parent_nodes().
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92876eec
    • Filipe Manana's avatar
      btrfs: fix inode list leak during backref walking at resolve_indirect_refs() · 5614dc3a
      Filipe Manana authored
      During backref walking, at resolve_indirect_refs(), if we get an error
      we jump to the 'out' label and call ulist_free() on the 'parents' ulist,
      which frees all the elements in the ulist - however that does not free
      any inode lists that may be attached to elements, through the 'aux' field
      of a ulist node, so we end up leaking lists if we have any attached to
      the unodes.
      
      Fix this by calling free_leaf_list() instead of ulist_free() when we exit
      from resolve_indirect_refs(). The static function free_leaf_list() is
      moved up for this to be possible and it's slightly simplified by removing
      unnecessary code.
      
      Fixes: 3301958b ("Btrfs: add inodes before dropping the extent lock in find_all_leafs")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5614dc3a
  2. 31 Oct, 2022 2 commits
  3. 25 Oct, 2022 1 commit
    • Qu Wenruo's avatar
      btrfs: don't use btrfs_chunk::sub_stripes from disk · 76a66ba1
      Qu Wenruo authored
      [BUG]
      There are two reports (the earliest one from LKP, a more recent one from
      kernel bugzilla) that we can have some chunks with 0 as sub_stripes.
      
      This will cause divide-by-zero errors at btrfs_rmap_block, which is
      introduced by a recent kernel patch ac067734 ("btrfs: merge
      calculations for simple striped profiles in btrfs_rmap_block"):
      
      		if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
      				 BTRFS_BLOCK_GROUP_RAID10)) {
      			stripe_nr = stripe_nr * map->num_stripes + i;
      			stripe_nr = div_u64(stripe_nr, map->sub_stripes); <<<
      		}
      
      [CAUSE]
      From the more recent report, it has been proven that we have some chunks
      with 0 as sub_stripes, mostly caused by older mkfs.
      
      It turns out that the mkfs.btrfs fix is only introduced in 6718ab4d33aa
      ("btrfs-progs: Initialize sub_stripes to 1 in btrfs_alloc_data_chunk")
      which is included in v5.4 btrfs-progs release.
      
      So there would be quite some old filesystems with such 0 sub_stripes.
      
      [FIX]
      Just don't trust the sub_stripes values from disk.
      
      We have a trusted btrfs_raid_array[] to fetch the correct sub_stripes
      numbers for each profile and that are fixed.
      
      By this, we can keep the compatibility with older filesystems while
      still avoid divide-by-zero bugs.
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Reported-by: default avatarViktor Kuzmin <kvaster@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216559
      Fixes: ac067734 ("btrfs: merge calculations for simple striped profiles in btrfs_rmap_block")
      CC: stable@vger.kernel.org # 6.0
      Reviewed-by: default avatarSu Yue <glass@fydeos.io>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      76a66ba1
  4. 24 Oct, 2022 7 commits
    • David Sterba's avatar
      btrfs: fix type of parameter generation in btrfs_get_dentry · 2398091f
      David Sterba authored
      The type of parameter generation has been u32 since the beginning,
      however all callers pass a u64 generation, so unify the types to prevent
      potential loss.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2398091f
    • BingJing Chang's avatar
      btrfs: send: fix send failure of a subcase of orphan inodes · 9b8be45f
      BingJing Chang authored
      Commit 9ed0a72e ("btrfs: send: fix failures when processing inodes with
      no links") tries to fix all incremental send cases of orphan inodes the
      send operation will meet. However, there's still a bug causing the corner
      subcase fails with a ENOENT error.
      
      Here's shortened steps of that subcase:
      
        $ btrfs subvolume create vol
        $ touch vol/foo
      
        $ btrfs subvolume snapshot -r vol snap1
        $ btrfs subvolume snapshot -r vol snap2
      
        # Turn the second snapshot to RW mode and delete the file while
        # holding an open file descriptor on it
        $ btrfs property set snap2 ro false
        $ exec 73<snap2/foo
        $ rm snap2/foo
      
        # Set the second snapshot back to RO mode and do an incremental send
        # with an unusal reverse order
        $ btrfs property set snap2 ro true
        $ btrfs send -p snap2 snap1 > /dev/null
        At subvol snap1
        ERROR: send ioctl failed with -2: No such file or directory
      
      It's subcase 3 of BTRFS_COMPARE_TREE_CHANGED in the commit 9ed0a72e
      ("btrfs: send: fix failures when processing inodes with no links"). And
      it's not a common case. We still have not met it in the real world.
      Theoretically, this case can happen in a batch cascading snapshot backup.
      In cascading backups, the receive operation in the middle may cause orphan
      inodes to appear because of the open file descriptors on the snapshot files
      during receiving. And if we don't do the batch snapshot backups in their
      creation order, then we can have an inode, which is an orphan in the parent
      snapshot but refers to a file in the send snapshot. Since an orphan inode
      has no paths, the send operation will fail with a ENOENT error if it
      tries to generate a path for it.
      
      In that patch, this subcase will be treated as an inode with a new
      generation. However, when the routine tries to delete the old paths in
      the parent snapshot, the function process_all_refs() doesn't check whether
      there are paths recorded or not before it calls the function
      process_recorded_refs(). And the function process_recorded_refs() try
      to get the first path in the parent snapshot in the beginning. Since it has
      no paths in the parent snapshot, the send operation fails.
      
      To fix this, we can easily put a link count check to avoid entering the
      deletion routine like what we do a link count check to avoid creating a
      new one. Moreover, we can assume that the function process_all_refs()
      can always collect references to process because we know it has a
      positive link count.
      
      Fixes: 9ed0a72e ("btrfs: send: fix failures when processing inodes with no links")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b8be45f
    • Qu Wenruo's avatar
      btrfs: make thaw time super block check to also verify checksum · 3d17adea
      Qu Wenruo authored
      Previous commit a05d3c91 ("btrfs: check superblock to ensure the fs
      was not modified at thaw time") only checks the content of the super
      block, but it doesn't really check if the on-disk super block has a
      matching checksum.
      
      This patch will add the checksum verification to thaw time superblock
      verification.
      
      This involves the following extra changes:
      
      - Export btrfs_check_super_csum()
        As we need to call it in super.c.
      
      - Change the argument list of btrfs_check_super_csum()
        Instead of passing a char *, directly pass struct btrfs_super_block *
        pointer.
      
      - Verify that our checksum type didn't change before checking the
        checksum value, like it's done at mount time
      
      Fixes: a05d3c91 ("btrfs: check superblock to ensure the fs was not modified at thaw time")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d17adea
    • Josef Bacik's avatar
      btrfs: fix tree mod log mishandling of reallocated nodes · 968b7158
      Josef Bacik authored
      We have been seeing the following panic in production
      
        kernel BUG at fs/btrfs/tree-mod-log.c:677!
        invalid opcode: 0000 [#1] SMP
        RIP: 0010:tree_mod_log_rewind+0x1b4/0x200
        RSP: 0000:ffffc9002c02f890 EFLAGS: 00010293
        RAX: 0000000000000003 RBX: ffff8882b448c700 RCX: 0000000000000000
        RDX: 0000000000008000 RSI: 00000000000000a7 RDI: ffff88877d831c00
        RBP: 0000000000000002 R08: 000000000000009f R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000100c40 R12: 0000000000000001
        R13: ffff8886c26d6a00 R14: ffff88829f5424f8 R15: ffff88877d831a00
        FS:  00007fee1d80c780(0000) GS:ffff8890400c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fee1963a020 CR3: 0000000434f33002 CR4: 00000000007706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
         btrfs_get_old_root+0x12b/0x420
         btrfs_search_old_slot+0x64/0x2f0
         ? tree_mod_log_oldest_root+0x3d/0xf0
         resolve_indirect_ref+0xfd/0x660
         ? ulist_alloc+0x31/0x60
         ? kmem_cache_alloc_trace+0x114/0x2c0
         find_parent_nodes+0x97a/0x17e0
         ? ulist_alloc+0x30/0x60
         btrfs_find_all_roots_safe+0x97/0x150
         iterate_extent_inodes+0x154/0x370
         ? btrfs_search_path_in_tree+0x240/0x240
         iterate_inodes_from_logical+0x98/0xd0
         ? btrfs_search_path_in_tree+0x240/0x240
         btrfs_ioctl_logical_to_ino+0xd9/0x180
         btrfs_ioctl+0xe2/0x2ec0
         ? __mod_memcg_lruvec_state+0x3d/0x280
         ? do_sys_openat2+0x6d/0x140
         ? kretprobe_dispatcher+0x47/0x70
         ? kretprobe_rethook_handler+0x38/0x50
         ? rethook_trampoline_handler+0x82/0x140
         ? arch_rethook_trampoline_callback+0x3b/0x50
         ? kmem_cache_free+0xfb/0x270
         ? do_sys_openat2+0xd5/0x140
         __x64_sys_ioctl+0x71/0xb0
         do_syscall_64+0x2d/0x40
      
      Which is this code in tree_mod_log_rewind()
      
      	switch (tm->op) {
              case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
      		BUG_ON(tm->slot < n);
      
      This occurs because we replay the nodes in order that they happened, and
      when we do a REPLACE we will log a REMOVE_WHILE_FREEING for every slot,
      starting at 0.  'n' here is the number of items in this block, which in
      this case was 1, but we had 2 REMOVE_WHILE_FREEING operations.
      
      The actual root cause of this was that we were replaying operations for
      a block that shouldn't have been replayed.  Consider the following
      sequence of events
      
      1. We have an already modified root, and we do a btrfs_get_tree_mod_seq().
      2. We begin removing items from this root, triggering KEY_REPLACE for
         it's child slots.
      3. We remove one of the 2 children this root node points to, thus triggering
         the root node promotion of the remaining child, and freeing this node.
      4. We modify a new root, and re-allocate the above node to the root node of
         this other root.
      
      The tree mod log looks something like this
      
      	logical 0	op KEY_REPLACE (slot 1)			seq 2
      	logical 0	op KEY_REMOVE (slot 1)			seq 3
      	logical 0	op KEY_REMOVE_WHILE_FREEING (slot 0)	seq 4
      	logical 4096	op LOG_ROOT_REPLACE (old logical 0)	seq 5
      	logical 8192	op KEY_REMOVE_WHILE_FREEING (slot 1)	seq 6
      	logical 8192	op KEY_REMOVE_WHILE_FREEING (slot 0)	seq 7
      	logical 0	op LOG_ROOT_REPLACE (old logical 8192)	seq 8
      
      >From here the bug is triggered by the following steps
      
      1.  Call btrfs_get_old_root() on the new_root.
      2.  We call tree_mod_log_oldest_root(btrfs_root_node(new_root)), which is
          currently logical 0.
      3.  tree_mod_log_oldest_root() calls tree_mod_log_search_oldest(), which
          gives us the KEY_REPLACE seq 2, and since that's not a
          LOG_ROOT_REPLACE we incorrectly believe that we don't have an old
          root, because we expect that the most recent change should be a
          LOG_ROOT_REPLACE.
      4.  Back in tree_mod_log_oldest_root() we don't have a LOG_ROOT_REPLACE,
          so we don't set old_root, we simply use our existing extent buffer.
      5.  Since we're using our existing extent buffer (logical 0) we call
          tree_mod_log_search(0) in order to get the newest change to start the
          rewind from, which ends up being the LOG_ROOT_REPLACE at seq 8.
      6.  Again since we didn't find an old_root we simply clone logical 0 at
          it's current state.
      7.  We call tree_mod_log_rewind() with the cloned extent buffer.
      8.  Set n = btrfs_header_nritems(logical 0), which would be whatever the
          original nritems was when we COWed the original root, say for this
          example it's 2.
      9.  We start from the newest operation and work our way forward, so we
          see LOG_ROOT_REPLACE which we ignore.
      10. Next we see KEY_REMOVE_WHILE_FREEING for slot 0, which triggers the
          BUG_ON(tm->slot < n), because it expects if we've done this we have a
          completely empty extent buffer to replay completely.
      
      The correct thing would be to find the first LOG_ROOT_REPLACE, and then
      get the old_root set to logical 8192.  In fact making that change fixes
      this particular problem.
      
      However consider the much more complicated case.  We have a child node
      in this tree and the above situation.  In the above case we freed one
      of the child blocks at the seq 3 operation.  If this block was also
      re-allocated and got new tree mod log operations we would have a
      different problem.  btrfs_search_old_slot(orig root) would get down to
      the logical 0 root that still pointed at that node.  However in
      btrfs_search_old_slot() we call tree_mod_log_rewind(buf) directly.  This
      is not context aware enough to know which operations we should be
      replaying.  If the block was re-allocated multiple times we may only
      want to replay a range of operations, and determining what that range is
      isn't possible to determine.
      
      We could maybe solve this by keeping track of which root the node
      belonged to at every tree mod log operation, and then passing this
      around to make sure we're only replaying operations that relate to the
      root we're trying to rewind.
      
      However there's a simpler way to solve this problem, simply disallow
      reallocations if we have currently running tree mod log users.  We
      already do this for leaf's, so we're simply expanding this to nodes as
      well.  This is a relatively uncommon occurrence, and the problem is
      complicated enough I'm worried that we will still have corner cases in
      the reallocation case.  So fix this in the most straightforward way
      possible.
      
      Fixes: bd989ba3 ("Btrfs: add tree modification log functions")
      CC: stable@vger.kernel.org # 3.3+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      968b7158
    • David Sterba's avatar
      btrfs: reorder btrfs_bio for better packing · ae0e5df4
      David Sterba authored
      After changes in commit 917f32a2 ("btrfs: give struct btrfs_bio a
      real end_io handler") the layout of btrfs_bio can be improved.  There
      are two holes and the structure size is 264 bytes on release build. By
      reordering the iterator we can get rid of the holes and the size is 256
      bytes which fits to slabs much better.
      
      Final layout:
      
      struct btrfs_bio {
      	unsigned int               mirror_num;           /*     0     4 */
      	struct bvec_iter           iter;                 /*     4    20 */
      	u64                        file_offset;          /*    24     8 */
      	struct btrfs_device *      device;               /*    32     8 */
      	u8 *                       csum;                 /*    40     8 */
      	u8                         csum_inline[64];      /*    48    64 */
      	/* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
      	btrfs_bio_end_io_t         end_io;               /*   112     8 */
      	void *                     private;              /*   120     8 */
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	struct work_struct         end_io_work;          /*   128    32 */
      	struct bio                 bio;                  /*   160    96 */
      
      	/* size: 256, cachelines: 4, members: 10 */
      };
      
      Fixes: 917f32a2 ("btrfs: give struct btrfs_bio a real end_io handler")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae0e5df4
    • Qu Wenruo's avatar
      btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed · ab4c54c6
      Qu Wenruo authored
      Currently if full_stripe_write() failed to allocate the pages for
      parity, it will call __free_raid_bio() first, then return -ENOMEM.
      
      But some caller of full_stripe_write() will also call __free_raid_bio()
      again, this would cause double freeing.
      
      And it's not a logically sound either, normally we should either free
      the memory at the same level where we allocated it, or let endio to
      handle everything.
      
      So this patch will solve the double freeing by make
      raid56_parity_write() to handle the error and free the rbio.
      
      Just like what we do in raid56_parity_recover().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab4c54c6
    • Qu Wenruo's avatar
      btrfs: raid56: properly handle the error when unable to find the missing stripe · f15fb2cd
      Qu Wenruo authored
      In raid56_alloc_missing_rbio(), if we can not determine where the
      missing device is inside the full stripe, we just BUG_ON().
      
      This is not necessary especially the only caller inside scrub.c is
      already properly checking the return value, and will treat it as a
      memory allocation failure.
      
      Fix the error handling by:
      
      - Add an extra warning for the reason
        Although personally speaking it may be better to be an ASSERT().
      
      - Properly free the allocated rbio
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f15fb2cd
  5. 14 Oct, 2022 1 commit
  6. 11 Oct, 2022 7 commits
    • Filipe Manana's avatar
      btrfs: ignore fiemap path cache if we have multiple leaves for a data extent · 63c84b46
      Filipe Manana authored
      The path cache used during fiemap used to determine the sharedness of
      extent buffers in a path from a leaf containing a file extent item
      pointing to our data extent up to the root node of the tree, is meant to
      be used for a single path. Having a single path is by far the most common
      case, and therefore worth to optimize for, but it's possible to actually
      have multiple paths because we have 2 or more leaves.
      
      If we have multiple leaves, the 'level' variable keeps getting incremented
      in each iteration of the while loop at btrfs_is_data_extent_shared(),
      which means we will treat the second leaf in the 'tmp' ulist as a level 1
      node, and so forth. In the worst case this can lead to getting a level
      greater than or equals to BTRFS_MAX_LEVEL (8), which will trigger a
      WARN_ON_ONCE() in the functions to lookup from or store in the path cache
      (lookup_backref_shared_cache() and store_backref_shared_cache()). If the
      current level never goes beyond 8, due to shared nodes in the paths and
      a fs tree height smaller than 8, it can still result in incorrectly
      marking one leaf as shared because some other leaf is shared and is stored
      one level below that other leaf, as when storing a true sharedness value
      in the cache results in updating the sharedness to true of all entries in
      the cache below the current level.
      
      Having multiple leaves happens in a case like the following:
      
        - We have a file extent item point to data extent at bytenr X, for
          a file range [0, 1M[ for example;
      
        - At this moment we have an extent data ref for the extent, with
          an offset of 0 and a count of 1;
      
        - A write into the middle of the extent happens, file range [64K, 128K)
          so the file extent item is split into two (at btrfs_drop_extents()):
      
          1) One for file range [0, 64K), with a length (num_bytes field) of
             64K and an extent offset of 0;
      
          2) Another one for file range [128K, 1M), with a length of 896K
             (1M - 128K) and an extent offset of 128K.
      
        - At this moment the two file extent items are located in the same
          leaf;
      
        - A new file extent item for the range [64K, 128K), pointing to a new
          data extent, is inserted in the leaf. This results in a leaf split
          and now those two file extent items pointing to data extent X end
          up located in different leaves;
      
        - Once delayed refs are run, we still have a single extent data ref
          item for our data extent at bytenr X, for offset 0, but now with a
          count of 2 instead of 1;
      
        - So during fiemap, at btrfs_is_data_extent_shared(), after we call
          find_parent_nodes() for the data extent, we get two leaves, since
          we have two file extent items point to data extent at bytenr X that
          are located in two different leaves.
      
      So skip the use of the path cache when we get more than one leaf.
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63c84b46
    • Filipe Manana's avatar
      btrfs: fix processing of delayed tree block refs during backref walking · 943553ef
      Filipe Manana authored
      During backref walking, when processing a delayed reference with a type of
      BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there:
      
      1) We are accessing the delayed references extent_op, and its key, without
         the protection of the delayed ref head's lock;
      
      2) If there's no extent op for the delayed ref head, we end up with an
         uninitialized key in the stack, variable 'tmp_op_key', and then pass
         it to add_indirect_ref(), which adds the reference to the indirect
         refs rb tree.
      
         This is wrong, because indirect references should have a NULL key
         when we don't have access to the key, and in that case they should be
         added to the indirect_missing_keys rb tree and not to the indirect rb
         tree.
      
         This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting
         from freeing an extent buffer, therefore with a count of -1, it will
         not cancel out the corresponding reference we have in the extent tree
         (with a count of 1), since both references end up in different rb
         trees.
      
         When using fiemap, where we often need to check if extents are shared
         through shared subtrees resulting from snapshots, it means we can
         incorrectly report an extent as shared when it's no longer shared.
         However this is temporary because after the transaction is committed
         the extent is no longer reported as shared, as running the delayed
         reference results in deleting the tree block reference from the extent
         tree.
      
         Outside the fiemap context, the result is unpredictable, as the key was
         not initialized but it's used when navigating the rb trees to insert
         and search for references (prelim_ref_compare()), and we expect all
         references in the indirect rb tree to have valid keys.
      
      The following reproducer triggers the second bug:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount -o compress $DEV $MNT
      
         # With a compressed 128M file we get a tree height of 2 (level 1 root).
         xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo
      
         btrfs subvolume snapshot $MNT $MNT/snap
      
         # Fiemap should output 0x2008 in the flags column.
         # 0x2000 means shared extent
         # 0x8 means encoded extent (because it's compressed)
         echo
         echo "fiemap after snapshot, range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         # Overwrite one extent and fsync to flush delalloc and COW a new path
         # in the snapshot's tree.
         #
         # After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type
         # BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent
         # buffer in the path.
         #
         # In the extent tree we have inline references of type
         # BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent
         # buffers, so they should cancel each other, and the extent buffers in
         # the fs tree should no longer be considered as shared.
         #
         echo "Overwriting file range [120M, 120M + 128K)..."
         xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo
         xfs_io -c "fsync" $MNT/snap/foo
      
         # Fiemap should output 0x8 in the flags column. The extent in the range
         # [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs
         # tree.
         echo
         echo "fiemap after overwrite range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         umount $MNT
      
      Running it before this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
      The extent in the range [120M, 120M + 128K) is still reported as shared
      (0x2000 bit set) after overwriting that range and flushing delalloc, which
      is not correct - an entire path was COWed in the snapshot's tree and the
      extent is now only referenced by the original fs tree.
      
      Running it after this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256   0x8
      
      Now the extent is not reported as shared anymore.
      
      So fix this by passing a NULL key pointer to add_indirect_ref() when
      processing a delayed reference for a tree block if there's no extent op
      for our delayed ref head with a defined key. Also access the extent op
      only after locking the delayed ref head's lock.
      
      The reproducer will be converted later to a test case for fstests.
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Fixes: a6dbceaf ("btrfs: Remove unused op_key var from add_delayed_refs")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      943553ef
    • Filipe Manana's avatar
      btrfs: fix processing of delayed data refs during backref walking · 4fc7b572
      Filipe Manana authored
      When processing delayed data references during backref walking and we are
      using a share context (we are being called through fiemap), whenever we
      find a delayed data reference for an inode different from the one we are
      interested in, then we immediately exit and consider the data extent as
      shared. This is wrong, because:
      
      1) This might be a DROP reference that will cancel out a reference in the
         extent tree;
      
      2) Even if it's an ADD reference, it may be followed by a DROP reference
         that cancels it out.
      
      In either case we should not exit immediately.
      
      Fix this by never exiting when we find a delayed data reference for
      another inode - instead add the reference and if it does not cancel out
      other delayed reference, we will exit early when we call
      extent_is_shared() after processing all delayed references. If we find
      a drop reference, then signal the code that processes references from
      the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
      immediately if it finds there a reference for another inode, since we
      have delayed drop references that may cancel it out. In this later case
      we exit once we don't have references in the rb trees that cancel out
      each other and have two references for different inodes.
      
      Example reproducer for case 1):
      
         $ cat test-1.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      Example reproducer for case 2):
      
         $ cat test-2.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         # Flush delayed references to the extent tree and commit current
         # transaction.
         sync
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      After this patch, after deleting bar in both tests, the extent is not
      reported with the 0x2000 flag anymore, it gets only the flag 0x1
      (which is FIEMAP_EXTENT_LAST):
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
      These tests will later be converted to a test case for fstests.
      
      Fixes: dc046b10 ("Btrfs: make fiemap not blow when you have lots of snapshots")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4fc7b572
    • David Sterba's avatar
      btrfs: delete stale comments after merge conflict resolution · 295a53cc
      David Sterba authored
      There are two comments in btrfs_cache_block_group that I left when
      resolving conflict between commits ced8ecf0 "btrfs: fix space cache
      corruption and potential double allocations" and 527c490f "btrfs:
      delete btrfs_wait_space_cache_v1_finished".
      
      The former reworked the caching logic to wait until the caching ends in
      btrfs_cache_block_group while the latter only open coded the waiting.
      Both removed btrfs_wait_space_cache_v1_finished, the correct code is
      with the waiting and returning error. Thus the conflict resolution was
      OK.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      295a53cc
    • Josef Bacik's avatar
      btrfs: unlock locked extent area if we have contention · 9e769bd7
      Josef Bacik authored
      In production we hit the following deadlock
      
      task 1			task 2			task 3
      ------			------			------
      fiemap(file)		falloc(file)		fsync(file)
      						  write(0, 1MiB)
      						  btrfs_commit_transaction()
      						    wait_on(!pending_ordered)
      			  lock(512MiB, 1GiB)
      			  start_transaction
      			    wait_on_transaction
      
        lock(0, 1GiB)
          wait_extent_bit(512MiB)
      
      task 4
      ------
      finish_ordered_extent(0, 1MiB)
        lock(0, 1MiB)
        **DEADLOCK**
      
      This occurs because when task 1 does it's lock, it locks everything from
      0-512MiB, and then waits for the 512MiB chunk to unlock.  task 2 will
      never unlock because it's waiting on the transaction commit to happen,
      the transaction commit is waiting for the outstanding ordered extents,
      and then the ordered extent thread is blocked waiting on the 0-1MiB
      range to unlock.
      
      To fix this we have to clear anything we've locked so far, wait for the
      extent_state that we contended on, and then try to re-lock the entire
      range again.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e769bd7
    • David Sterba's avatar
      btrfs: send: update command for protocol version check · c86eab81
      David Sterba authored
      For a protocol and command compatibility we have a helper that hasn't
      been updated for v3 yet. We use it for verity so update where necessary.
      
      Fixes: 38622010 ("btrfs: send: add support for fs-verity")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c86eab81
    • Boris Burkov's avatar
      btrfs: send: allow protocol version 3 with CONFIG_BTRFS_DEBUG · 9971a741
      Boris Burkov authored
      We haven't finalized send stream v3 yet, so gate the send stream version
      behind CONFIG_BTRFS_DEBUG as we want some way to test it.
      
      The original verity send did not check the protocol version, so add that
      actual protection as well.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9971a741
  7. 07 Oct, 2022 1 commit
    • Filipe Manana's avatar
      btrfs: add missing path cache update during fiemap · 96dbcc00
      Filipe Manana authored
      When looking the stored result for a cached path node, if the stored
      result is valid and has a value of true, we must update all the nodes for
      all levels below it with a result of true as well. This is necessary when
      moving from one leaf in the fs tree to the next one, as well as when
      moving from a node at any level to the next node at the same level.
      
      Currently this logic is missing as it was somehow forgotten by a recent
      patch with the subject: "btrfs: speedup checking for extent sharedness
      during fiemap".
      
      This adds the missing logic, which is the counter part to what we do
      when adding a shared node to the cache at store_backref_shared_cache().
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96dbcc00
  8. 29 Sep, 2022 19 commits
    • Tetsuo Handa's avatar
      btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer · cbddcc4f
      Tetsuo Handa authored
      syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for
      commit bc877d28 ("btrfs: Deduplicate extent_buffer init code")
      missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must
      not be moved to after clean_tree_block() because clean_tree_block() is
      calling btrfs_header_generation() since commit 55c69072 ("Btrfs:
      Fix extent_buffer usage when nodesize != leafsize").
      
      Since memzero_extent_buffer() will reset "struct btrfs_header" part, we
      can't move btrfs_set_header_generation() to before memzero_extent_buffer().
      Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block().
      
      Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1]
      Reported-by: default avatarsyzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com>
      Fixes: bc877d28 ("btrfs: Deduplicate extent_buffer init code")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbddcc4f
    • Filipe Manana's avatar
      btrfs: drop extent map range more efficiently · db21370b
      Filipe Manana authored
      Currently when dropping extent maps for a file range, through
      btrfs_drop_extent_map_range(), we do the following non-optimal things:
      
      1) We lookup for extent maps one by one, always starting the search from
         the root of the extent map tree. This is not efficient if we have
         multiple extent maps in the range;
      
      2) We check on every iteration if we have the 'split' and 'split2' spare
         extent maps in case we need to split an extent map that intersects our
         range but also crosses its boundaries (to the left, to the right or
         both cases). If our target range is for example:
      
             [2M, 8M)
      
         And we have 3 extents maps in the range:
      
             [1M, 3M) [3M, 6M) [6M, 10M[
      
         The on the first iteration we allocate two extent maps for 'split' and
         'split2', and use the 'split' to split the first extent map, so after
         the split we set 'split' to 'split2' and then set 'split2' to NULL.
      
         On the second iteration, we don't need to split the second extent map,
         but because 'split2' is now NULL, we allocate a new extent map for
         'split2'.
      
         On the third iteration we need to split the third extent map, so we
         use the extent map pointed by 'split'.
      
         So we ended up allocating 3 extent maps for splitting, but all we
         needed was 2 extent maps. We never need to allocate more than 2,
         because extent maps that need to be split are always the first one
         and the last one in the target range.
      
      Improve on this by:
      
      1) Using rb_next() to move on to the next extent map. This results in
         iterating over less nodes of the tree and it does not require comparing
         the ranges of nodes to our start/end offset;
      
      2) Allocate the 2 extent maps for splitting before entering the loop and
         never allocate more than 2. In practice it's very rare to have the
         combination of both extent map allocations fail, since we have a
         dedicated slab for extent maps, and also have the need to split two
         extent maps.
      
      This patch is part of a patchset comprised of the following patches:
      
         btrfs: fix missed extent on fsync after dropping extent maps
         btrfs: move btrfs_drop_extent_cache() to extent_map.c
         btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
         btrfs: use cond_resched_rwlock_write() during inode eviction
         btrfs: move open coded extent map tree deletion out of inode eviction
         btrfs: add helper to replace extent map range with a new extent map
         btrfs: remove the refcount warning/check at free_extent_map()
         btrfs: remove unnecessary extent map initializations
         btrfs: assert tree is locked when clearing extent map from logging
         btrfs: remove unnecessary NULL pointer checks when searching extent maps
         btrfs: remove unnecessary next extent map search
         btrfs: avoid pointless extent map tree search when flushing delalloc
         btrfs: drop extent map range more efficiently
      
      And the following fio test was done before and after applying the whole
      patchset, on a non-debug kernel (Debian's default kernel config) on a 12
      cores Intel box with 64G of ram:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [writers]
         rw=randwrite
         fsync=8
         fallocate=none
         group_reporting=1
         direct=0
         bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
         ioengine=psync
         filesize=2G
         runtime=300
         time_based
         directory=$MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         fio /tmp/fio-job.ini
      
         umount $MNT
      
      Result before applying the patchset:
      
         WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec
      
      Result after applying the patchset:
      
         WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db21370b
    • Filipe Manana's avatar
      btrfs: avoid pointless extent map tree search when flushing delalloc · b54bb865
      Filipe Manana authored
      When flushing delalloc, in COW mode at cow_file_range(), before entering
      the loop that allocates extents and creates ordered extents, we do a call
      to btrfs_drop_extent_map_range() for the whole range. This is pointless
      because in the loop we call create_io_em(), which will also call
      btrfs_drop_extent_map_range() before inserting the new extent map.
      
      So remove that call at cow_file_range() not only because it is not needed,
      but also because it will make the btrfs_drop_extent_map_range() calls made
      from create_io_em() waste time searching the extent map tree, and that
      tree can be large for files with many extents. It also makes us waste time
      at btrfs_drop_extent_map_range() allocating and freeing the split extent
      maps for nothing.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b54bb865
    • Filipe Manana's avatar
      btrfs: remove unnecessary next extent map search · 6c05813e
      Filipe Manana authored
      At __tree_search(), and its single caller __lookup_extent_mapping(), there
      is no point in finding the next extent map that starts after the search
      offset if we were able to find the previous extent map that ends before
      our search offset, because __lookup_extent_mapping() ignores the next
      acceptable extent map if we were able to find the previous one.
      
      So just return immediately if we were able to find the previous extent
      map, therefore avoiding wasting time iterating the tree looking for the
      next extent map which will not be used by __lookup_extent_mapping().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c05813e
    • Filipe Manana's avatar
      btrfs: remove unnecessary NULL pointer checks when searching extent maps · 08f088dd
      Filipe Manana authored
      The previous and next pointer arguments passed to __tree_search() are
      never NULL as the only caller of this function, __lookup_extent_mapping(),
      always passes the address of two on stack pointers. So remove the NULL
      checks and add assertions to verify the pointers.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08f088dd
    • Filipe Manana's avatar
      btrfs: assert tree is locked when clearing extent map from logging · 74333c7d
      Filipe Manana authored
      When calling clear_em_logging() we should have a write lock on the extent
      map tree, as we will try to merge the extent map with the previous and
      next ones in the tree. So assert that we have a write lock.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74333c7d
    • Filipe Manana's avatar
      btrfs: remove unnecessary extent map initializations · 2e0cdaa0
      Filipe Manana authored
      When allocating an extent map, we use kmem_cache_zalloc() which guarantees
      the returned memory is initialized to zeroes, therefore it's pointless
      to initialize the generation and flags of the extent map to zero again.
      
      Remove those initializations, as they are pointless and slightly increase
      the object text size.
      
      Before removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9241	    274	     24	   9539	   2543	fs/btrfs/extent_map.o
      
      After removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9209	    274	     24	   9507	   2523	fs/btrfs/extent_map.o
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e0cdaa0
    • Filipe Manana's avatar
      btrfs: remove the refcount warning/check at free_extent_map() · ad5d6e91
      Filipe Manana authored
      At free_extent_map(), it's pointless to have a WARN_ON() to check if the
      refcount of the extent map is zero. Such check is already done by the
      refcount_t module and refcount_dec_and_test(), which loudly complains if
      we try to decrement a reference count that is currently 0.
      
      The WARN_ON() dates back to the time when used a regular atomic_t type
      for the reference counter, before we switched to the refcount_t type.
      The main goal of the refcount_t type/module is precisely to catch such
      types of bugs and loudly complain if they happen.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad5d6e91
    • Filipe Manana's avatar
      btrfs: add helper to replace extent map range with a new extent map · a1ba4c08
      Filipe Manana authored
      We have several places that need to drop all the extent maps in a given
      file range and then add a new extent map for that range. Currently they
      call btrfs_drop_extent_map_range() to delete all extent maps in the range
      and then keep trying to add the new extent map in a loop that keeps
      retrying while the insertion of the new extent map fails with -EEXIST.
      
      So instead of repeating this logic, add a helper to extent_map.c that
      does these steps and name it btrfs_replace_extent_map_range(). Also add
      a comment about why the retry loop is necessary.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1ba4c08
    • Filipe Manana's avatar
      btrfs: move open coded extent map tree deletion out of inode eviction · 9c9d1b4f
      Filipe Manana authored
      Move the loop that removes all the extent maps from the inode's extent
      map tree during inode eviction out of inode.c and into extent_map.c, to
      btrfs_drop_extent_map_range(). Anything manipulating extent maps or the
      extent map tree should be in extent_map.c.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c9d1b4f
    • Filipe Manana's avatar
      btrfs: use cond_resched_rwlock_write() during inode eviction · 99ba0c81
      Filipe Manana authored
      At evict_inode_truncate_pages(), instead of manually checking if
      rescheduling is needed, then unlock the extent map tree, reschedule and
      then write lock again the tree, use the helper cond_resched_rwlock_write()
      which does all that.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99ba0c81
    • Filipe Manana's avatar
      btrfs: use extent_map_end() at btrfs_drop_extent_map_range() · f3109e33
      Filipe Manana authored
      Instead of open coding the end offset calculation of an extent map, use
      the helper extent_map_end() and cache its result in a local variable,
      since it's used several times.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3109e33
    • Filipe Manana's avatar
      btrfs: move btrfs_drop_extent_cache() to extent_map.c · 4c0c8cfc
      Filipe Manana authored
      The function btrfs_drop_extent_cache() doesn't really belong at file.c
      because what it does is drop a range of extent maps for a file range.
      It directly allocates and manipulates extent maps, by dropping,
      splitting and replacing them in an extent map tree, so it should be
      located at extent_map.c, where all manipulations of an extent map tree
      and its extent maps are supposed to be done.
      
      So move it out of file.c and into extent_map.c. Additionally do the
      following changes:
      
      1) Rename it into btrfs_drop_extent_map_range(), as this makes it more
         clear about what it does. The term "cache" is a bit confusing as it's
         not widely used, "extent maps" or "extent mapping" is much more common;
      
      2) Change its 'skip_pinned' argument from int to bool;
      
      3) Turn several of its local variables from int to bool, since they are
         used as booleans;
      
      4) Move the declaration of some variables out of the function's main
         scope and into the scopes where they are used;
      
      5) Remove pointless assignment of false to 'modified' early in the while
         loop, as later that variable is set and it's not used before that
         second assignment;
      
      6) Remove checks for NULL before calling free_extent_map().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c0c8cfc
    • Filipe Manana's avatar
      btrfs: fix missed extent on fsync after dropping extent maps · cef7820d
      Filipe Manana authored
      When dropping extent maps for a range, through btrfs_drop_extent_cache(),
      if we find an extent map that starts before our target range and/or ends
      before the target range, and we are not able to allocate extent maps for
      splitting that extent map, then we don't fail and simply remove the entire
      extent map from the inode's extent map tree.
      
      This is generally fine, because in case anyone needs to access the extent
      map, it can just load it again later from the respective file extent
      item(s) in the subvolume btree. However, if that extent map is new and is
      in the list of modified extents, then a fast fsync will miss the parts of
      the extent that were outside our range (that needed to be split),
      therefore not logging them. Fix that by marking the inode for a full
      fsync. This issue was introduced after removing BUG_ON()s triggered when
      the split extent map allocations failed, done by commit 7014cdb4
      ("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and
      the fast fsync path already existed but was very recent.
      
      Also, in the case where we could allocate extent maps for the split
      operations but then fail to add a split extent map to the tree, mark the
      inode for a full fsync as well. This is not supposed to ever fail, and we
      assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is
      not set), it's the correct thing to do to make sure a fast fsync will not
      miss a new extent.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cef7820d
    • Jeff Layton's avatar
      btrfs: remove stale prototype of btrfs_write_inode · 3050dfa6
      Jeff Layton authored
      This function no longer exists, was removed in 3c427693 ("Btrfs: fix
      btrfs_write_inode vs delayed iput deadlock").
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3050dfa6
    • Stefan Roesch's avatar
      btrfs: enable nowait async buffered writes · 926078b2
      Stefan Roesch authored
      Enable nowait async buffered writes in btrfs_do_write_iter() and
      btrfs_file_open().
      
      In this version encoded buffered writes have the optimization not
      enabled. Encoded writes are enabled by using an ioctl. io_uring
      currently does not support ioctls. This might be enabled in the future.
      
      Performance results:
      
        For fio the following results have been obtained with a queue depth of
        1 and 4k block size (runtime 600 secs):
      
                       sequential writes:
                       without patch           with patch      libaio     psync
        iops:              55k                    134k          117K       148K
        bw:               221MB/s                 538MB/s       469MB/s    592MB/s
        clat:           15286ns                    82ns         994ns     6340ns
      
      For an io depth of 1, the new patch improves throughput by over two
      times (compared to the existing behavior, where buffered writes are
      processed by an io-worker process) and also the latency is considerably
      reduced. To achieve the same or better performance with the existing
      code an io depth of 4 is required.  Increasing the iodepth further does
      not lead to improvements.
      
      The tests have been run like this:
      
      ./fio --name=seq-writers --ioengine=psync --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      ./fio --name=seq-writers --ioengine=io_uring --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      ./fio --name=seq-writers --ioengine=libaio --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      
      Testing:
        This patch has been tested with xfstests, fsx, fio. xfstests shows no new
        diffs compared to running without the patch series.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      926078b2
    • Stefan Roesch's avatar
      btrfs: assert nowait mode is not used for some btree search functions · c922b016
      Stefan Roesch authored
      Adds nowait asserts to btree search functions which are not used by
      buffered IO and direct IO paths.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c922b016
    • Stefan Roesch's avatar
      btrfs: make btrfs_buffered_write nowait compatible · 965f47ae
      Stefan Roesch authored
      We need to avoid unconditionally calling balance_dirty_pages_ratelimited
      as it could wait for some reason. Use balance_dirty_pages_ratelimited_flags
      with the BDP_ASYNC in case the buffered write is nowait, returning
      EAGAIN eventually.
      
      It also moves the function after the again label. This can cause the
      function to be called a bit later, but this should have no impact in the
      real world.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      965f47ae
    • Stefan Roesch's avatar
      btrfs: plumb NOWAIT through the write path · 304e45ac
      Stefan Roesch authored
      We have everywhere setup for nowait, plumb NOWAIT through the write path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      304e45ac