1. 16 May, 2022 40 commits
    • Filipe Manana's avatar
      btrfs: avoid double nocow check when doing nowait dio writes · d7a8ab4e
      Filipe Manana authored
      When doing a NOWAIT direct IO write we are checking twice if we can COW
      into the target file range using can_nocow_extent() - once at the very
      beginning of the write path, at btrfs_write_check() via
      check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write().
      
      The can_nocow_extent() function does a lot of expensive things - searching
      for the file extent item in the inode's subvolume tree, searching for the
      extent item in the extent tree, checking delayed references, etc, so it
      isn't a very cheap call.
      
      We can remove the first check at btrfs_write_check(), and add there a
      quick check to verify if the inode has the NODATACOW or PREALLOC flags,
      and quickly bail out if it doesn't have neither of those flags, as that
      means we have to COW and therefore can't comply with the NOWAIT semantics.
      
      After this we do only one call to can_nocow_extent(), while we are at
      btrfs_get_blocks_direct_write(), where we have already locked the file
      range and we did a try lock on the range before, at
      btrfs_dio_iomap_begin() (since the previous patch in the series).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7a8ab4e
    • Filipe Manana's avatar
      btrfs: avoid blocking nowait dio when locking file range · 59094403
      Filipe Manana authored
      If we are doing a NOWAIT direct IO read/write, we can block when locking
      the file range at btrfs_dio_iomap_begin(), as it's possible the range (or
      a part of it) is already locked by another task (mmap writes, another
      direct IO read/write racing with us, fiemap, etc). We are also waiting for
      completion of any ordered extent we find in the range, which also can
      block us for a significant amount of time.
      
      There's also the incorrect fallback to buffered IO (returning -ENOTBLK)
      when we are dealing with a NOWAIT request and we can't proceed. In this
      case we should be returning -EAGAIN, as falling back to buffered IO can
      result in blocking for many different reasons, so that the caller can
      delegate a retry to a context where blocking is more acceptable.
      
      Fix these cases by:
      
      1) Doing a try lock on the file range and failing with -EAGAIN if we
         can not lock right away;
      
      2) Fail with -EAGAIN if we find an ordered extent;
      
      3) Return -EAGAIN instead of -ENOTBLK when we need to fallback to
         buffered IO and we have a NOWAIT request.
      
      This will also allow us to avoid a duplicated check that verifies if we
      are able to do a NOCOW write for NOWAIT direct IO writes, done in the
      next patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59094403
    • Filipe Manana's avatar
      btrfs: avoid blocking on page locks with nowait dio on compressed range · b023e675
      Filipe Manana authored
      If we are doing NOWAIT direct IO read/write and our inode has compressed
      extents, we call filemap_fdatawrite_range() against the range in order
      to wait for compressed writeback to complete, since the generic code at
      iomap_dio_rw() calls filemap_write_and_wait_range() once, which is not
      enough to wait for compressed writeback to complete.
      
      This call to filemap_fdatawrite_range() can block on page locks, since
      the first writepages() on a range that we will try to compress results
      only in queuing a work to compress the data while holding the pages
      locked.
      
      Even though the generic code at iomap_dio_rw() will do the right thing
      and return -EAGAIN for NOWAIT requests in case there are pages in the
      range, we can still end up at btrfs_dio_iomap_begin() with pages in the
      range because either of the following can happen:
      
      1) Memory mapped writes, as we haven't locked the range yet;
      
      2) Buffered reads might have started, which lock the pages, and we do
         the filemap_fdatawrite_range() call before locking the file range.
      
      So don't call filemap_fdatawrite_range() at btrfs_dio_iomap_begin() if we
      are doing a NOWAIT read/write. Instead call filemap_range_needs_writeback()
      to check if there are any locked, dirty, or under writeback pages, and
      return -EAGAIN if that's the case.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b023e675
    • Jonathan Lassoff's avatar
      btrfs: add messages to printk index · b0a66a31
      Jonathan Lassoff authored
      In order for end users to quickly react to new issues that come up in
      production, it is proving useful to leverage this printk indexing
      system. This printk index enables kernel developers to use calls to
      printk() with changeable ad-hoc format strings, while still enabling end
      users to detect changes and develop a semi-stable interface for
      detecting and parsing these messages.
      
      So that detailed Btrfs messages are captured by this printk index, this
      patch wraps btrfs_printk and btrfs_handle_fs_error with macros.
      
      Example of the generated list:
      https://lore.kernel.org/lkml/12588e13d51a9c3bf59467d3fc1ac2162f1275c1.1647539056.git.jof@thejof.comSigned-off-by: default avatarJonathan Lassoff <jof@thejof.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0a66a31
    • Qu Wenruo's avatar
      btrfs: tree-checker: check extent buffer owner against owner rootid · 88c602ab
      Qu Wenruo authored
      Btrfs doesn't check whether the tree block respects the root owner.
      This means, if a tree block referred by a parent in extent tree, but has
      owner of 5, btrfs can still continue reading the tree block, as long as
      it doesn't trigger other sanity checks.
      
      Normally this is fine, but combined with the empty tree check in
      check_leaf(), if we hit an empty extent tree, but the root node has
      csum tree owner, we can let such extent buffer to sneak in.
      
      Shrink the hole by:
      
      - Do extra eb owner check at tree read time
      
      - Make sure the root owner extent buffer exactly matches the root id.
      
      Unfortunately we can't yet completely patch the hole, there are several
      call sites can't pass all info we need:
      
      - For reloc/log trees
        Their owner is key::offset, not key::objectid.
        We need the full root key to do that accurate check.
      
        For now, we just skip the ownership check for those trees.
      
      - For add_data_references() of relocation
        That call site doesn't have any parent/ownership info, as all the
        bytenrs are all from btrfs_find_all_leafs().
      
      - For direct backref items walk
        Direct backref items records the parent bytenr directly, thus unlike
        indirect backref item, we don't do a full tree search.
      
        Thus in that case, we don't have full parent owner to check.
      
      For the later two cases, they all pass 0 as @owner_root, thus we can
      skip those cases if @owner_root is 0.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88c602ab
    • Filipe Manana's avatar
      btrfs: add and use helper to assert an inode range is clean · 63c34cb4
      Filipe Manana authored
      We have four different scenarios where we don't expect to find ordered
      extents after locking a file range:
      
      1) During plain fallocate;
      2) During hole punching;
      3) During zero range;
      4) During reflinks (both cloning and deduplication).
      
      This is because in all these cases we follow the pattern:
      
      1) Lock the inode's VFS lock in exclusive mode;
      
      2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
         mmap writes;
      
      3) Flush delalloc in a file range and wait for all ordered extents
         to complete - both done through btrfs_wait_ordered_range();
      
      4) Lock the file range in the inode's io_tree.
      
      So add a helper that asserts that we don't have ordered extents for a
      given range. Make the four scenarios listed above use this helper after
      locking the respective file range.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63c34cb4
    • Filipe Manana's avatar
      btrfs: remove ordered extent check and wait during hole punching and zero range · 55961c8a
      Filipe Manana authored
      For hole punching and zero range we have this loop that checks if we have
      ordered extents after locking the file range, and if so unlock the range,
      wait for ordered extents, and retry until we don't find more ordered
      extents.
      
      This logic was needed in the past because:
      
      1) Direct IO writes within the i_size boundary did not take the inode's
         VFS lock. This was because that lock used to be a mutex, then some
         years ago it was switched to a rw semaphore (commit 9902af79
         ("parallel lookups: actual switch to rwsem")), and then btrfs was
         changed to take the VFS inode's lock in shared mode for writes that
         don't cross the i_size boundary (commit e9adabb9 ("btrfs: use
         shared lock for direct writes within EOF"));
      
      2) We could race with memory mapped writes, because memory mapped writes
         don't acquire the inode's VFS lock. We don't have that race anymore,
         as we have a rw semaphore to synchronize memory mapped writes with
         fallocate (and reflinking too). That change happened with commit
         8d9b4a16 ("btrfs: exclude mmap from happening during all
         fallocate operations").
      
      So stop looking for ordered extents after locking the file range when
      doing hole punching and zero range operations.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55961c8a
    • Filipe Manana's avatar
      btrfs: lock the inode first before flushing range when punching hole · bd6526d0
      Filipe Manana authored
      When doing hole punching we are flushing delalloc and waiting for ordered
      extents to complete before locking the inode (VFS lock and the btrfs
      specific i_mmap_lock). This is fine because even if a write happens after
      we call btrfs_wait_ordered_range() and before we lock the inode (call
      btrfs_inode_lock()), we will notice the write at
      btrfs_punch_hole_lock_range() and flush delalloc and wait for its ordered
      extent.
      
      We can however make this simpler by locking first the inode an then call
      btrfs_wait_ordered_range(), which will allow us to remove the ordered
      extent lookup logic from btrfs_punch_hole_lock_range() in the next patch.
      It also makes the behaviour the same as plain fallocate, hole punching
      and reflinks.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd6526d0
    • Filipe Manana's avatar
      btrfs: remove ordered extent check and wait during fallocate · ffa8fc60
      Filipe Manana authored
      For fallocate() we have this loop that checks if we have ordered extents
      after locking the file range, and if so unlock the range, wait for ordered
      extents, and retry until we don't find more ordered extents.
      
      This logic was needed in the past because:
      
      1) Direct IO writes within the i_size boundary did not take the inode's
         VFS lock. This was because that lock used to be a mutex, then some
         years ago it was switched to a rw semaphore (commit 9902af79
         ("parallel lookups: actual switch to rwsem")), and then btrfs was
         changed to take the VFS inode's lock in shared mode for writes that
         don't cross the i_size boundary (commit e9adabb9 ("btrfs: use
         shared lock for direct writes within EOF"));
      
      2) We could race with memory mapped writes, because memory mapped writes
         don't acquire the inode's VFS lock. We don't have that race anymore,
         as we have a rw semaphore to synchronize memory mapped writes with
         fallocate (and reflinking too). That change happened with commit
         8d9b4a16 ("btrfs: exclude mmap from happening during all
         fallocate operations").
      
      So stop looking for ordered extents after locking the file range when
      doing a plain fallocate.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffa8fc60
    • Filipe Manana's avatar
      btrfs: remove inode_dio_wait() calls when starting reflink operations · 1c6cbbbe
      Filipe Manana authored
      When starting a reflink operation we have these calls to inode_dio_wait()
      which used to be needed because direct IO writes that don't cross the
      i_size boundary did not take the inode's VFS lock, so we could race with
      them and end up with ordered extents in target range after calling
      btrfs_wait_ordered_range().
      
      However that is not the case anymore, because the inode's VFS lock was
      changed from a mutex to a rw semaphore, by commit 9902af79
      ("parallel lookups: actual switch to rwsem"), and several years later we
      started to lock the inode's VFS lock in shared mode for direct IO writes
      that don't cross the i_size boundary (commit e9adabb9 ("btrfs: use
      shared lock for direct writes within EOF")).
      
      So remove those inode_dio_wait() calls.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c6cbbbe
    • Filipe Manana's avatar
      btrfs: remove useless dio wait call when doing fallocate zero range · 831e1ee6
      Filipe Manana authored
      When starting a fallocate zero range operation, before getting the first
      extent map for the range, we make a call to inode_dio_wait().
      
      This logic was needed in the past because direct IO writes within the
      i_size boundary did not take the inode's VFS lock. This was because that
      lock used to be a mutex, then some years ago it was switched to a rw
      semaphore (by commit 9902af79 ("parallel lookups: actual switch to
      rwsem")), and then btrfs was changed to take the VFS inode's lock in
      shared mode for writes that don't cross the i_size boundary (done in
      commit e9adabb9 ("btrfs: use shared lock for direct writes within
      EOF")). The lockless direct IO writes could result in a race with the
      zero range operation, resulting in the later getting a stale extent
      map for the range.
      
      So remove this no longer needed call to inode_dio_wait(), as fallocate
      takes the inode's VFS lock in exclusive mode and direct IO writes within
      i_size take that same lock in shared mode.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      831e1ee6
    • Filipe Manana's avatar
      btrfs: only reserve the needed data space amount during fallocate · 47e1d1c7
      Filipe Manana authored
      During a plain fallocate, we always start by reserving an amount of data
      space that matches the length of the range passed to fallocate. When we
      already have extents allocated in that range, we may end up trying to
      reserve a lot more data space then we need, which can result in several
      undesired behaviours:
      
      1) We fail with -ENOSPC. For example the passed range has a length
         of 1G, but there's only one hole with a size of 1M in that range;
      
      2) We temporarily reserve excessive data space that could be used by
         other operations happening concurrently;
      
      3) By reserving much more data space then we need, we can end up
         doing expensive things like triggering dellaloc for other inodes,
         waiting for the ordered extents to complete, trigger transaction
         commits, allocate new block groups, etc.
      
      Example:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f -b 1g $DEV
        mount $DEV $MNT
      
        # Create a file with a size of 600M and two holes, one at [200M, 201M[
        # and another at [401M, 402M[
        xfs_io -f -c "pwrite -S 0xab 0 200M" \
                  -c "pwrite -S 0xcd 201M 200M" \
                  -c "pwrite -S 0xef 402M 198M" \
                  $MNT/foobar
      
        # Now call fallocate against the whole file range, see if it fails
        # with -ENOSPC or not - it shouldn't since we only need to allocate
        # 2M of data space.
        xfs_io -c "falloc 0 600M" $MNT/foobar
      
        umount $MNT
      
        $ ./test.sh
        (...)
        wrote 209715200/209715200 bytes at offset 0
        200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
        wrote 209715200/209715200 bytes at offset 210763776
        200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
        wrote 207618048/207618048 bytes at offset 421527552
        198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
        fallocate: No space left on device
        $
      
      So fix this by not allocating an amount of data space that matches the
      length of the range passed to fallocate. Instead allocate an amount of
      data space that corresponds to the sum of the sizes of each hole found
      in the range. This reservation now happens after we have locked the file
      range, which is safe since we know at this point there's no delalloc
      in the range because we've taken the inode's VFS lock in exclusive mode,
      we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
      delalloc and waited for all ordered extents in the range to complete.
      
      This type of failure actually seems to happen in practice with systemd,
      and we had at least one report about this in a very long thread which
      is referenced by the Link tag below.
      
      Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47e1d1c7
    • Sweet Tea Dorminy's avatar
      btrfs: restore inode creation before xattr setting · 6c3636eb
      Sweet Tea Dorminy authored
      According to the tree checker, "all xattrs with a given objectid follow
      the inode with that objectid in the tree" is an invariant. This was
      broken by the recent change "btrfs: move common inode creation code into
      btrfs_create_new_inode()", which moved acl creation and property
      inheritance (stored in xattrs) to before inode insertion into the tree.
      As a result, under certain timings, the xattrs could be written to the
      tree before the inode, causing the tree checker to report violation of
      the invariant.
      
      Move property inheritance and acl creation back to their old ordering
      after the inode insertion.
      Suggested-by: default avatarOmar Sandoval <osandov@osandov.com>
      Reported-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c3636eb
    • Omar Sandoval's avatar
      btrfs: move common inode creation code into btrfs_create_new_inode() · caae78e0
      Omar Sandoval authored
      All of our inode creation code paths duplicate the calls to
      btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
      additionally duplicates property inheritance and the call to
      btrfs_set_inode_index(). Fix this by moving the common code into
      btrfs_create_new_inode(). This accomplishes a few things at once:
      
      1. It reduces code duplication.
      
      2. It allows us to set up the inode completely before inserting the
         inode item, removing calls to btrfs_update_inode().
      
      3. It fixes a leak of an inode on disk in some error cases. For example,
         in btrfs_create(), if btrfs_new_inode() succeeds, then we have
         inserted an inode item and its inode ref. However, if something after
         that fails (e.g., btrfs_init_inode_security()), then we end the
         transaction and then decrement the link count on the inode. If the
         transaction is committed and the system crashes before the failed
         inode is deleted, then we leak that inode on disk. Instead, this
         refactoring aborts the transaction when we can't recover more
         gracefully.
      
      4. It exposes various ways that subvolume creation diverges from mkdir
         in terms of inheriting flags, properties, permissions, and POSIX
         ACLs, a lot of which appears to be accidental. This patch explicitly
         does _not_ change the existing non-standard behavior, but it makes
         those differences more clear in the code and documents them so that
         we can discuss whether they should be changed.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      caae78e0
    • Omar Sandoval's avatar
      btrfs: reserve correct number of items for inode creation · 3538d68d
      Omar Sandoval authored
      The various inode creation code paths do not account for the compression
      property, POSIX ACLs, or the parent inode item when starting a
      transaction. Fix it by refactoring all of these code paths to use a new
      function, btrfs_new_inode_prepare(), which computes the correct number
      of items. To do so, it needs to know whether POSIX ACLs will be created,
      so move the ACL creation into that function. To reduce the number of
      arguments that need to be passed around for inode creation, define
      struct btrfs_new_inode_args containing all of the relevant information.
      
      btrfs_new_inode_prepare() will also be a good place to set up the
      fscrypt context and encrypted filename in the future.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3538d68d
    • Omar Sandoval's avatar
      btrfs: factor out common part of btrfs_{mknod,create,mkdir}() · 5f465bf1
      Omar Sandoval authored
      btrfs_{mknod,create,mkdir}() are now identical other than the inode
      initialization and some inconsequential function call order differences.
      Factor out the common code to reduce code duplication.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f465bf1
    • Omar Sandoval's avatar
      btrfs: allocate inode outside of btrfs_new_inode() · a1fd0c35
      Omar Sandoval authored
      Instead of calling new_inode() and inode_init_owner() inside of
      btrfs_new_inode(), do it in the callers. This allows us to pass in just
      the inode instead of the mnt_userns and mode and removes the need for
      memalloc_nofs_{save,restores}() since we do it before starting a
      transaction. In create_subvol(), it also means we no longer have to look
      up the inode again to instantiate it. This also paves the way for some
      more cleanups in later patches.
      
      This also removes the comments about Smack checking i_op, which are no
      longer true since commit 5d6c3191 ("xattr: Add
      __vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
      IOP_XATTR, which is set based on sb->s_xattr.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1fd0c35
    • Qu Wenruo's avatar
      btrfs: warn when extent buffer leak test fails · b95b78e6
      Qu Wenruo authored
      Although we have btrfs_extent_buffer_leak_debug_check() (enabled by
      CONFIG_BTRFS_DEBUG option) to detect and warn QA testers that we have
      some extent buffer leakage, it's just pr_err(), not noisy enough for
      fstests to cache.
      
      So here we trigger a WARN_ON() if the allocated_ebs list is not empty.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b95b78e6
    • Anand Jain's avatar
      btrfs: use a local variable for fs_devices pointer in btrfs_dev_replace_finishing · b67d73c1
      Anand Jain authored
      In the function btrfs_dev_replace_finishing, we dereferenced
      fs_info->fs_devices 6 times. Use keep local variable for that.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b67d73c1
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in btrfs_listxattr · 184b3d19
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      184b3d19
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in btrfs_read_chunk_tree · 43cb1478
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      43cb1478
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in btrfs_unlink_all_paths · 3d64f060
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d64f060
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in process_all_extents · 9930e9d4
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9930e9d4
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in process_all_new_xattrs · 69e43177
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      69e43177
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in process_all_refs · 649b9635
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      649b9635
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in is_ancestor · 35a68080
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35a68080
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in can_rmdir · 18f80f1f
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18f80f1f
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in did_create_dir · 6dcee260
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6dcee260
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in btrfs_real_readdir · a8ce68fd
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8ce68fd
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item · 9dcbe16f
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9dcbe16f
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in mark_block_group_to_copy · 9bc5fc04
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9bc5fc04
    • Gabriel Niebler's avatar
      btrfs: use btrfs_for_each_slot in find_first_block_group · 36dfbbe2
      Gabriel Niebler authored
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36dfbbe2
    • Gabriel Niebler's avatar
      btrfs: introduce btrfs_for_each_slot iterator macro · 62142be3
      Gabriel Niebler authored
      There is a common pattern when searching for a key in btrfs:
      
      * Call btrfs_search_slot to find the slot for the key
      * Enter an endless loop:
        * If the found slot is larger than the no. of items in the current
          leaf, check the next leaf
        * If it's still not found in the next leaf, terminate the loop
        * Otherwise do something with the found key
        * Increment the current slot and continue
      
      To reduce code duplication, we can replace this code pattern with an
      iterator macro, similar to the existing for_each_X macros found
      elsewhere in the kernel.  This also makes the code easier to understand
      for newcomers by putting a name to the encapsulated functionality.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62142be3
    • Qu Wenruo's avatar
      btrfs: scrub: rename scrub_bio::pagev and related members · e360d2f5
      Qu Wenruo authored
      Since the subpage support for scrub, one page no longer always represents
      one sector, thus scrub_bio::pagev and scrub_bio::sector_count are no
      longer accurate.
      
      Rename them to scrub_bio::sectors and scrub_bio::sector_count respectively.
      This also involves scrub_ctx::pages_per_bio and other macros involved.
      
      Now the renaming of pages involved in scrub is be finished.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e360d2f5
    • Qu Wenruo's avatar
      btrfs: scrub: rename scrub_page to scrub_sector · 46343501
      Qu Wenruo authored
      Since the subpage support of scrub, scrub_sector is in fact just
      representing one sector.
      
      Thus the name scrub_page is no longer correct, rename it to
      scrub_sector.
      
      This also involves the following renames:
      
      - spage -> sector
        Normally we would just replace "page" with "sector" and result
        something like "ssector".
        But the repeating 's' is not really eye friendly.
      
        So here we just simple use "sector", as there is nothing from MM layer
        called "sector" to cause any confusion.
      
      - scrub_parity::spages -> sectors_list
        Normally we use plural to indicate an array, not a list.
        Rename it to @sectors_list to be more explicit on the list part.
      
      - Also reformat and update comments that get changed
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46343501
    • Qu Wenruo's avatar
      btrfs: scrub: rename members related to scrub_block::pagev · 7e737cbc
      Qu Wenruo authored
      The following will be renamed in this patch:
      
      - scrub_block::pagev -> sectors
      
      - scrub_block::page_count -> sector_count
      
      - SCRUB_MAX_PAGES_PER_BLOCK -> SCRUB_MAX_SECTORS_PER_BLOCK
      
      - page_num -> sector_num to iterate scrub_block::sectors
      
      For now scrub_page is not yet renamed to keep the patch reasonable and
      it will be updated in a followup.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e737cbc
    • Filipe Manana's avatar
      btrfs: remove trivial wrapper btrfs_read_buffer() · 6a2e9dc4
      Filipe Manana authored
      The function btrfs_read_buffer() is useless, it just calls
      btree_read_extent_buffer_pages() with exactly the same arguments.
      
      So remove it and rename btree_read_extent_buffer_pages() to
      btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_"
      prefix (since it's used outside disk-io.c) and the name is clear enough
      about what it does.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a2e9dc4
    • Filipe Manana's avatar
      btrfs: update outdated comment for read_block_for_search() · 376a21d7
      Filipe Manana authored
      The comment at the top of read_block_for_search() is very outdated, as it
      refers to the blocking versus spinning path locking modes. We no longer
      have these two locking modes after we switched the btree locks from custom
      code to rw semaphores. So update the comment to stop referring to the
      blocking mode and put it more up to date.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      376a21d7
    • Filipe Manana's avatar
      btrfs: release upper nodes when reading stale btree node from disk · b246666e
      Filipe Manana authored
      When reading a btree node (or leaf), at read_block_for_search(), if we
      can't find its extent buffer in the cache (the fs_info->buffer_radix
      radix tree), then we unlock all upper level nodes before reading the
      btree node/leaf from disk, to prevent blocking other tasks for too long.
      
      However if we find that the extent buffer is in the cache but it is not
      up to date, we don't unlock upper level nodes before reading it from disk,
      potentially blocking other tasks on upper level nodes for too long.
      
      Fix this inconsistent behaviour by unlocking upper level nodes if we need
      to read a node/leaf from disk because its in-memory extent buffer is not
      up to date. If we unlocked upper level nodes then we must return -EAGAIN
      to the caller, just like the case where the extent buffer is not cached in
      memory. And like that case, we determine if upper level nodes are locked
      by checking only if the parent node is locked - if it isn't, then no other
      upper level nodes are locked.
      
      This is actually a rare case, as if we have an extent buffer in memory,
      it typically has the uptodate flag set and passes all the checks done by
      btrfs_buffer_uptodate().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b246666e
    • Filipe Manana's avatar
      btrfs: avoid unnecessary btree search restarts when reading node · 4bb59055
      Filipe Manana authored
      When reading a btree node, at read_block_for_search(), if we don't find
      the node's (or leaf) extent buffer in the cache, we will read it from
      disk. Since that requires waiting on IO, we release all upper level nodes
      from our path before reading the target node/leaf, and then return -EAGAIN
      to the caller, which will make the caller restart the while btree search.
      
      However we are causing the restart of btree search even for cases where
      it is not necessary:
      
      1) We have a path with ->skip_locking set to true, typically when doing
         a search on a commit root, so we are never holding locks on any node;
      
      2) We are doing a read search (the "ins_len" argument passed to
         btrfs_search_slot() is 0), or we are doing a search to modify an
         existing key (the "cow" argument passed to btrfs_search_slot() has
         a value of 1 and "ins_len" is 0), in which case we never hold locks
         for upper level nodes;
      
      3) We are doing a search to insert or delete a key, in which case we may
         or may not have upper level nodes locked. That depends on the current
         minimum write lock levels at btrfs_search_slot(), if we had to split
         or merge parent nodes, if we had to COW upper level nodes and if
         we ever visited slot 0 of an upper level node. It's still common to
         not have upper level nodes locked, but our current node must be at
         least at level 1, for insertions, or at least at level 2 for deletions.
         In these cases when we have locks on upper level nodes, they are always
         write locks.
      
      These cases where we are not holding locks on upper level nodes far
      outweigh the cases where we are holding locks, so it's completely wasteful
      to retry the whole search when we have no upper nodes locked.
      
      So change the logic to not return -EAGAIN, and make the caller retry the
      search, when we don't have the parent node locked - when it's not locked
      it means no other upper level nodes are locked as well.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4bb59055