1. 07 Oct, 2020 40 commits
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW · bf59a5a2
      Josef Bacik authored
      For similar reasons as BTRFS_NESTING_COW, we need
      BTRFS_NESTING_LEFT/RIGHT_COW.  The pattern is this
      
      lock leaf -> BTRFS_NESTING_NORMAL
        cow leaf -> BTRFS_NESTING_COW
          split leaf
            lock left -> BTRFS_NESTING_LEFT
              cow left -> BTRFS_NESTING_LEFT_COW
      
      We need this in order to indicate to lockdep that these locks are
      discrete and are being taken in a safe order.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf59a5a2
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT · bf77467a
      Josef Bacik authored
      Our lockdep maps are based on rootid+level, however in some cases we
      will lock adjacent blocks on the same level, namely in searching forward
      or in split/balance.  Because of this lockdep will complain, so we need
      a separate subclass to indicate to lockdep that these are different
      locks.
      
      lock leaf -> BTRFS_NESTING_NORMAL
        cow leaf -> BTRFS_NESTING_COW
          split leaf
             lock left -> BTRFS_NESTING_LEFT
             lock right -> BTRFS_NESTING_RIGHT
      
      The above graph illustrates the need for this new nesting subclass.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf77467a
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks · 9631e4cc
      Josef Bacik authored
      When we COW a block we are holding a lock on the original block, and
      then we lock the new COW block.  Because our lockdep maps are based on
      root + level, this will make lockdep complain.  We need a way to
      indicate a subclass for locking the COW'ed block, so plumb through our
      btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
      and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.
      
      The reason I've added all this extra infrastructure is because there
      will be need of different nesting classes in follow up patches.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9631e4cc
    • Josef Bacik's avatar
      btrfs: add nesting tags to the locking helpers · fd7ba1c1
      Josef Bacik authored
      We will need these when we switch to an rwsem, so plumb in the
      infrastructure here to use later on.  I violate the 80 character limit
      some here because it'll be cleaned up later.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd7ba1c1
    • Josef Bacik's avatar
      btrfs: introduce btrfs_path::recurse · 51899412
      Josef Bacik authored
      Our current tree locking stuff allows us to recurse with read locks if
      we're already holding the write lock.  This is necessary for the space
      cache inode, as we could be holding a lock on the root_tree root when we
      need to cache a block group, and thus need to be able to read down the
      root_tree to read in the inode cache.
      
      We can get away with this in our current locking, but we won't be able
      to with a rwsem.  Handle this by purposefully annotating the places
      where we require recursion, so that in the future we can maybe come up
      with a way to avoid the recursion.  In the case of the free space inode,
      this will be superseded by the free space tree.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51899412
    • Josef Bacik's avatar
      btrfs: rename extent_buffer::lock_nested to extent_buffer::lock_recursed · 329ced79
      Josef Bacik authored
      Nested locking with lockdep and everything else refers to lock hierarchy
      within the same lock map.  This is how we indicate the same locks for
      different objects are ok to take in a specific order, for our use case
      that would be to take the lock on a leaf and then take a lock on an
      adjacent leaf.
      
      What ->lock_nested _actually_ refers to is if we happen to already be
      holding the write lock on the extent buffer and we're allowing a read
      lock to be taken on that extent buffer, which is recursion.  Rename this
      so we don't get confused when we switch to a rwsem and have to start
      using the _nested helpers.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      329ced79
    • Nikolay Borisov's avatar
      btrfs: don't opencode sync_blockdev in btrfs_init_new_device · b9ba017f
      Nikolay Borisov authored
      Instead of opencoding filemap_write_and_wait simply call syncblockdev as
      it makes it abundantly clear what's going on and why this is used. No
      semantics changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9ba017f
    • Nikolay Borisov's avatar
      btrfs: remove redundant code from btrfs_free_stale_devices · 4ae312e9
      Nikolay Borisov authored
      Following the refactor of btrfs_free_stale_devices in
      7bcb8164 ("btrfs: use device_list_mutex when removing stale devices")
      fs_devices are freed after they have been iterated by the inner
      list_for_each so the use-after-free fixed by introducing the break in
      fd649f10 ("btrfs: Fix use-after-free when cleaning up fs_devs with
      a single stale device") is no longer necessary. Just remove it
      altogether. No functional changes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ae312e9
    • Nikolay Borisov's avatar
      btrfs: refactor locked condition in btrfs_init_new_device · 44cab9ba
      Nikolay Borisov authored
      Invert unlocked to locked and exploit the fact it can only ever be
      modified if we are adding a new device to a seed filesystem. This allows
      to simplify the check in error: label. No semantics changes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44cab9ba
    • Nikolay Borisov's avatar
      btrfs: use RCU for quick device check in btrfs_init_new_device · f4cfa9bd
      Nikolay Borisov authored
      When adding a new device there's a mandatory check to see if a device is
      being duplicated to the filesystem it's added to. Since this is a
      read-only operations not necessary to take device_list_mutex and can simply
      make do with an rcu-readlock.
      
      Using just RCU is safe because there won't be another device add delete
      running in parallel as btrfs_init_new_device is called only from
      btrfs_ioctl_add_dev.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4cfa9bd
    • Qu Wenruo's avatar
      btrfs: ctree: check key order before merging tree blocks · d16c702f
      Qu Wenruo authored
      [BUG]
      With a crafted image, btrfs can panic at btrfs_del_csums():
      
        kernel BUG at fs/btrfs/ctree.c:3188!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
        RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
        RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
        RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
        R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
        FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
        Call Trace:
        truncate_one_csum+0xac/0xf0
        btrfs_del_csums+0x24f/0x3a0
        __btrfs_free_extent.isra.72+0x5a7/0xbe0
        __btrfs_run_delayed_refs+0x539/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace 93bf9db00e6c374e ]---
      
      [CAUSE]
      This crafted image has a tricky key order corruption:
      
        checksum tree key (CSUM_TREE ROOT_ITEM 0)
        node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
                ...
                key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
                key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
                ...
      
        leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
                item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
                        range start 73785344 end 75497472 length 1712128
                item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
                        range start 75497472 end 75501568 length 4096
                item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
                        range start 75501568 end 77283328 length 1781760
                item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
                        range start 77283328 end 77287424 length 4096
                item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
                        range start 4120596480 end 4120903680 length 307200
        leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
                item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
                        range start 77594624 end 79306752 length 1712128
                ...
      
      Note the item 4 key of leaf 29757440, which is obviously too large, and
      even larger than the first key of the next leaf.
      
      However it still follows the key order in that tree block, thus tree
      checker is unable to detect it at read time, since tree checker can only
      work inside one leaf, thus such complex corruption can't be detected in
      advance.
      
      [FIX]
      The next time to detect such problem is at tree block merge time,
      which is in push_node_left(), balance_node_right(), push_leaf_left() or
      push_leaf_right().
      
      Now we check if the key order of the right-most key of the left node is
      larger than the left-most key of the right node.
      
      By this we don't need to call the full tree-checker, while still keeping
      the key order correct as key order in each node is already checked by
      tree checker thus we only need to check the above two slots.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d16c702f
    • Qu Wenruo's avatar
      btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref() · 07cce5cf
      Qu Wenruo authored
      [BUG]
      With a crafted image, btrfs can panic at insert_inline_extent_backref():
      
        kernel BUG at fs/btrfs/extent-tree.c:1857!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1117 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:insert_inline_extent_backref+0xcc/0xe0
        RSP: 0018:ffffac4dc1287be8 EFLAGS: 00010293
        RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000001
        RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffffac4dc1287c28 R08: ffffac4dc1287ab8 R09: ffffac4dc1287ac0
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff8febef88a540 R14: ffff8febeaa7bc30 R15: 0000000000000000
        FS: 0000000000000000(0000) GS:ffff8febf7a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f663ace94c0 CR3: 0000000235698006 CR4: 00000000000206f0
        Call Trace:
        ? _cond_resched+0x1a/0x50
        __btrfs_inc_extent_ref.isra.64+0x7e/0x240
        ? btrfs_merge_delayed_refs+0xa5/0x330
        __btrfs_run_delayed_refs+0x653/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace 2ad8b3de903cf825 ]---
      
      [CAUSE]
      Due to extent tree corruption (still valid by itself, but bad cross
      ref), we can allocate an extent which is still in extent tree.  The
      offending tree block of that case is from csum tree.  The newly
      allocated tree block is also for csum tree.
      
      Then we will try to insert a tree block ref for the existing tree block
      ref.
      
      For a tree extent item, tree block can never be shared directly by the
      same tree twice.  We have such BUG_ON() to prevent such problem, but
      this is not a proper error handling.
      
      [FIX]
      Replace that BUG_ON() with proper error message and leaf dump for debug
      build.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202829Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07cce5cf
    • Qu Wenruo's avatar
      btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent() · 1c2a07f5
      Qu Wenruo authored
      __btrfs_free_extent() is doing two things:
      
      1. Reduce the refs number of an extent backref
         Either it's an inline extent backref (inside EXTENT/METADATA item) or
         a keyed extent backref (SHARED_* item).
         We only need to locate that backref line, either reduce the number or
         remove the backref line completely.
      
      2. Update the refs count in EXTENT/METADATA_ITEM
      
      During step 1), we will try to locate the EXTENT/METADATA_ITEM without
      triggering another btrfs_search_slot() as fast path.
      
      Only when we fail to locate that item, we will trigger another
      btrfs_search_slot() to get that EXTENT/METADATA_ITEM after we
      updated/deleted the backref line.
      
      And we have a lot of strict checks on things like refs_to_drop against
      extent refs and special case checks for single ref extents.
      
      There are 7 BUG_ON()s, although they're doing correct checks, they can
      be triggered by crafted images.
      
      This patch improves the function:
      
      - Introduce two examples to show what __btrfs_free_extent() is doing
        One inline backref case and one keyed case.  Should cover most cases.
      
      - Kill all BUG_ON()s with proper error message and optional leaf dump
      
      - Add comment to show the overall flow
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202819
      [ The report triggers one BUG_ON() in __btrfs_free_extent() ]
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c2a07f5
    • Qu Wenruo's avatar
      btrfs: extent_io: do extra check for extent buffer read write functions · f98b6215
      Qu Wenruo authored
      Although we have start, len check for extent buffer reader/write (e.g.
      read_extent_buffer()), these checks have limitations:
      
      - No overflow check
        Values like start = 1024 len = -1024 can still pass the basic
         (start + len) > eb->len check.
      
      - Checks are not consistent
        For read_extent_buffer() we only check (start + len) against eb->len.
        While for memcmp_extent_buffer() we also check start against eb->len.
      
      - Different error reporting mechanism
        We use WARN() in read_extent_buffer() but BUG() in
        memcpy_extent_buffer().
      
      - Still modify memory if the request is obviously wrong
        In read_extent_buffer() even we find (start + len) > eb->len, we still
        call memset(dst, 0, len), which can easily cause memory access error
        if start + len overflows.
      
      To address above problems, this patch creates a new common function to
      check such access, check_eb_range().
      
      - Add overflow check
        This function checks start, start + len against eb->len and overflow
        check.
      
      - Unified checks
      
      - Unified error reports
        Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
        And also do btrfs_warn() message for non-debug build.
      
      - Exit ASAP if check fails
        No more possible memory corruption.
      
      - Add extra comment for @start @len used in those functions as it's
        sometimes confused with the logical addressing instead of a range
        inside the eb space
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
      [ Inspired by above report, the report itself is already addressed ]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ use check_add_overflow ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f98b6215
    • Nikolay Borisov's avatar
      btrfs: rework error detection in init_tree_roots · 217f5004
      Nikolay Borisov authored
      To avoid duplicating 3 lines of code the error detection logic in
      init_tree_roots is somewhat quirky. It first checks for the presence of
      any error condition, then checks for the specific condition to perform
      any specific actions. That's spurious because directly checking for
      each respective error condition and doing the necessary steps is more
      obvious. While at it change the -EUCLEAN to -EIO in case the extent
      buffer is not read correctly, this is in line with other sites which
      return -EIO when the eb couldn't be read.
      
      Additionally it results in smaller code and the code reads
      more linearly:
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-95 (-95)
      Function                                     old     new   delta
      open_ctree                                 17243   17148     -95
      Total: Before=113104, After=113009, chg -0.08%
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      217f5004
    • Qu Wenruo's avatar
      btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations · e85fde51
      Qu Wenruo authored
      [BUG]
      When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
      
        generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
      
      And with the following metadata leak:
      
        BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace a6cfd45ba80e4e06 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
        BTRFS info (device dm-3): disk space caching is enabled
        BTRFS info (device dm-3): has skinny extents
      
      [CAUSE]
      The qgroup preallocated meta rsv operations of that offending root are:
      
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
      
      It's pretty obvious that, we reserve qgroup meta rsv in
      btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
      release/convert calls in btrfs_subvolume_release_metadata().
      
      This leads to the leakage.
      
      [FIX]
      To fix this bug, we should follow what we're doing in
      btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
      add it to block_rsv->qgroup_rsv_reserved.
      
      And free the qgroup reserved metadata space when releasing the
      block_rsv.
      
      To do this, we need to change the btrfs_subvolume_release_metadata() to
      accept btrfs_root, and record the qgroup_to_release number, and call
      btrfs_qgroup_convert_reserved_meta() for it.
      
      Fixes: 733e03a0 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e85fde51
    • Qu Wenruo's avatar
      btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode · b4c5d8fd
      Qu Wenruo authored
      For delayed inode facility, qgroup metadata is reserved for it, and
      later freed.
      
      However we're freeing more bytes than we reserved.
      In btrfs_delayed_inode_reserve_metadata():
      
      	num_bytes = btrfs_calc_metadata_size(fs_info, 1);
      	...
      		ret = btrfs_qgroup_reserve_meta_prealloc(root,
      				fs_info->nodesize, true);
      		...
      		if (!ret) {
      			node->bytes_reserved = num_bytes;
      
      But in btrfs_delayed_inode_release_metadata():
      
      	if (qgroup_free)
      		btrfs_qgroup_free_meta_prealloc(node->root,
      				node->bytes_reserved);
      	else
      		btrfs_qgroup_convert_reserved_meta(node->root,
      				node->bytes_reserved);
      
      This means, we're always releasing more qgroup metadata rsv than we have
      reserved.
      
      This won't trigger selftest warning, as btrfs qgroup metadata rsv has
      extra protection against cases like quota enabled half-way.
      
      But we still need to fix this problem any way.
      
      This patch will use the same num_bytes for qgroup metadata rsv so we
      could handle it correctly.
      
      Fixes: f218ea6c ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4c5d8fd
    • Josef Bacik's avatar
      btrfs: do not hold device_list_mutex when closing devices · 425c6ed6
      Josef Bacik authored
      The following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
      ------------------------------------------------------
      fsstress/8739 is trying to acquire lock:
      ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70
      
      but task is already holding lock:
      ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #10 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x129/0x210
             btrfs_page_mkwrite+0x6a/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      -> #9 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x68/0x90
             _copy_to_user+0x1e/0x80
             perf_read+0x141/0x2c0
             vfs_read+0xad/0x1b0
             ksys_read+0x5f/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #8 (&cpuctx_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x88/0x150
             perf_event_init+0x1db/0x20b
             start_kernel+0x3ae/0x53c
             secondary_startup_64+0xa4/0xb0
      
      -> #7 (pmus_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x4f/0x150
             cpuhp_invoke_callback+0xb1/0x900
             _cpu_up.constprop.26+0x9f/0x130
             cpu_up+0x7b/0xc0
             bringup_nonboot_cpus+0x4f/0x60
             smp_init+0x26/0x71
             kernel_init_freeable+0x110/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #6 (cpu_hotplug_lock){++++}-{0:0}:
             cpus_read_lock+0x39/0xb0
             kmem_cache_create_usercopy+0x28/0x230
             kmem_cache_create+0x12/0x20
             bioset_init+0x15e/0x2b0
             init_bio+0xa3/0xaa
             do_one_initcall+0x5a/0x2e0
             kernel_init_freeable+0x1f4/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #5 (bio_slab_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             bioset_init+0xbc/0x2b0
             __blk_alloc_queue+0x6f/0x2d0
             blk_mq_init_queue_data+0x1b/0x70
             loop_add+0x110/0x290 [loop]
             fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
             do_one_initcall+0x5a/0x2e0
             do_init_module+0x5a/0x220
             load_module+0x2459/0x26e0
             __do_sys_finit_module+0xba/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #4 (loop_ctl_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             lo_open+0x18/0x50 [loop]
             __blkdev_get+0xec/0x570
             blkdev_get+0xe8/0x150
             do_dentry_open+0x167/0x410
             path_openat+0x7c9/0xa80
             do_filp_open+0x93/0x100
             do_sys_openat2+0x22a/0x2e0
             do_sys_open+0x4b/0x80
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             blkdev_put+0x1d/0x120
             close_fs_devices.part.31+0x84/0x130
             btrfs_close_devices+0x44/0xb0
             close_ctree+0x296/0x2b2
             generic_shutdown_super+0x69/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_run_dev_stats+0x49/0x480
             commit_cowonly_roots+0xb5/0x2a0
             btrfs_commit_transaction+0x516/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_commit_transaction+0x4bb/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1272/0x2310
             lock_acquire+0x9e/0x360
             __mutex_lock+0x9f/0x930
             btrfs_record_root_in_trans+0x43/0x70
             start_transaction+0xd1/0x5d0
             btrfs_dirty_inode+0x42/0xd0
             file_update_time+0xc8/0x110
             btrfs_page_mkwrite+0x10c/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_pagefaults);
                                     lock(&mm->mmap_lock#2);
                                     lock(sb_pagefaults);
        lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by fsstress/8739:
       #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
       #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
       #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0
      
      stack backtrace:
      CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
      Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
      Call Trace:
       dump_stack+0x78/0xa0
       check_noncircular+0x165/0x180
       __lock_acquire+0x1272/0x2310
       ? btrfs_get_alloc_profile+0x150/0x210
       lock_acquire+0x9e/0x360
       ? btrfs_record_root_in_trans+0x43/0x70
       __mutex_lock+0x9f/0x930
       ? btrfs_record_root_in_trans+0x43/0x70
       ? lock_acquire+0x9e/0x360
       ? join_transaction+0x5d/0x450
       ? find_held_lock+0x2d/0x90
       ? btrfs_record_root_in_trans+0x43/0x70
       ? join_transaction+0x3d5/0x450
       ? btrfs_record_root_in_trans+0x43/0x70
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       ? handle_mm_fault+0x5e/0x1730
       do_page_mkwrite+0x4d/0xc0
       ? __do_fault+0x32/0x150
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7faa6c9969c4
      
      Was seen in testing.  The fix is similar to that of
      
        btrfs: open device without device_list_mutex
      
      where we're holding the device_list_mutex and then grab the bd_mutex,
      which pulls in a bunch of dependencies under the bd_mutex.  We only ever
      call btrfs_close_devices() on mount failure or unmount, so we're save to
      not have the device_list_mutex here.  We're already holding the
      uuid_mutex which keeps us safe from any external modification of the
      fs_devices.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      425c6ed6
    • Josef Bacik's avatar
      btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks · 62cf5391
      Josef Bacik authored
      When closing and freeing the source device we could end up doing our
      final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
      we want to be holding as few locks as possible, so move this call
      outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
      we're modifying the fs_devices we need to make sure we're holding the
      uuid_mutex here, so take that as well.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62cf5391
    • Nikolay Borisov's avatar
      btrfs: remove alloc_list splice in btrfs_prepare_sprout · 68abf360
      Nikolay Borisov authored
      btrfs_prepare_sprout is called when the first rw device is added to a
      seed filesystem. This means the filesystem can't have its alloc_list
      be non-empty, since seed filesystems are read only. Simply remove the
      code altogether.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68abf360
    • Nikolay Borisov's avatar
      btrfs: document some invariants of seed code · 427c8fdd
      Nikolay Borisov authored
      Without good understanding of how seed devices works it's hard to grok
      some of what the code in open_seed_devices or btrfs_prepare_sprout does.
      
      Add comments hopefully reducing some of the cognitive load.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      427c8fdd
    • Nikolay Borisov's avatar
      btrfs: switch seed device to list api · 944d3f9f
      Nikolay Borisov authored
      While this patch touches a bunch of files the conversion is
      straighforward. Instead of using the implicit linked list anchored at
      btrfs_fs_devices::seed the code is switched to using
      list_for_each_entry.
      
      Previous patches in the series already factored out code that processed
      both main and seed devices so in those cases the factored out functions
      are called on the main fs_devices and then on every seed dev inside
      list_for_each_entry.
      
      Using list api also allows to simplify deletion from the seed dev list
      performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
      substituting a while() loop with a simple list_del_init.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      944d3f9f
    • Nikolay Borisov's avatar
      btrfs: simplify setting/clearing fs_info to btrfs_fs_devices · c4989c2f
      Nikolay Borisov authored
      It makes no sense to have sysfs-related routines be responsible for
      properly initialising the fs_info pointer of struct btrfs_fs_device.
      Instead this can be streamlined by making it the responsibility of
      btrfs_init_devices_late to initialize it. That function already
      initializes fs_info of every individual device in btrfs_fs_devices.
      
      As far as clearing it is concerned it makes sense to move it to
      close_fs_devices. That function is only called when struct
      btrfs_fs_devices is no longer in use - either for holding seeds or
      main devices for a mounted filesystem.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4989c2f
    • Nikolay Borisov's avatar
      btrfs: make close_fs_devices return void · 54eed6ae
      Nikolay Borisov authored
      The return value of this function conveys absolutely no information.
      All callers already check the state of fs_devices->opened to decide how
      to proceed. So convert the function to returning void. While at it make
      btrfs_close_devices also return void.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54eed6ae
    • Nikolay Borisov's avatar
      btrfs: factor out loop logic from btrfs_free_extra_devids · 3712ccb7
      Nikolay Borisov authored
      This prepares the code to switching seeds devices to a proper list.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3712ccb7
    • Nikolay Borisov's avatar
      btrfs: factor out reada loop in __reada_start_machine · dc0ab488
      Nikolay Borisov authored
      This is in preparation for moving fs_devices to proper lists.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc0ab488
    • Nikolay Borisov's avatar
      btrfs: remove err variable from btrfs_get_extent · 1028d1c4
      Nikolay Borisov authored
      There's no practical reason too use 'err' as a variable to convey
      errors. In fact it's value is either set explicitly in the beginning of
      the function or it simply takes the value of 'ret'. Not conforming to
      the usual pattern of having ret be the only variable used to convey
      errors makes the code more error prone to bugs. In fact one such bug
      was introduced by 6bf9e4bd ("btrfs: inode: Verify inode mode toi
      avoid NULL pointer dereference") by assigning the error value to 'ret'
      and not 'err'.
      
      Let's fix that issue and make the function less tricky by leaving only
      ret to convey error values.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1028d1c4
    • Josef Bacik's avatar
      btrfs: dio iomap DSYNC workaround · 0eb79294
      Josef Bacik authored
      iomap dio will run generic_write_sync() for us if the iocb is DSYNC.
      This is problematic for us because of 2 reasons:
      
      1. we hold the inode_lock() during this operation, and we take it in
         generic_write_sync()
      2. we hold a read lock on the dio_sem but take the write lock in fsync
      
      Since we don't want to rip out this code right now, but reworking the
      locking is a bit much to do at this point, work around this problem with
      this masterpiece of a patch.
      
      First, we clear DSYNC on the iocb so that the iomap stuff doesn't know
      that it needs to handle the sync.  We save this fact in
      current->journal_info, because we need to see do special things once
      we're in iomap_begin, and we have no way to pass private information
      into iomap_dio_rw().
      
      Next we specify a separate iomap_dio_ops for sync, which implements an
      ->end_io() callback that gets called when the dio completes.  This is
      important for AIO, because we really do need to run generic_write_sync()
      if we complete asynchronously.  However if we're still in the submitting
      context when we enter ->end_io() we clear the flag so that the submitter
      knows they're the ones that needs to run generic_write_sync().
      
      This is meant to be temporary.  We need to work out how to eliminate the
      inode_lock() and the dio_sem in our fsync and use another mechanism to
      protect these operations.
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0eb79294
    • Goldwyn Rodrigues's avatar
      btrfs: switch to iomap for direct IO · f85781fb
      Goldwyn Rodrigues authored
      We're using direct io implementation based on buffer heads. This patch
      switches to the new iomap infrastructure.
      
      Switch from __blockdev_direct_IO() to iomap_dio_rw().  Rename
      btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
      iomap_begin() for iomap direct I/O functions. This function allocates
      and locks all the blocks required for the I/O.  btrfs_submit_direct() is
      used as the submit_io() hook for direct I/O ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore: set it to noop.
      
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
      
        - for writes, call __endio_write_update_ordered()
        - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f85781fb
    • Qu Wenruo's avatar
      btrfs: add owner and fs_info to alloc_state io_tree · 154f7cb8
      Qu Wenruo authored
      Commit 1c11b63e ("btrfs: replace pending/pinned chunks lists with io
      tree") introduced btrfs_device::alloc_state extent io tree, but it
      doesn't initialize the fs_info and owner member.
      
      This means the following features are not properly supported:
      
      - Fs owner report for insert_state() error
        Without fs_info initialized, although btrfs_err() won't panic, it
        won't output which fs is causing the error.
      
      - Wrong owner for trace events
        alloc_state will get the owner as pinned extents.
      
      Fix this by assiging proper fs_info and owner for
      btrfs_device::alloc_state.
      
      Fixes: 1c11b63e ("btrfs: replace pending/pinned chunks lists with io tree")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      154f7cb8
    • Marcos Paulo de Souza's avatar
      btrfs: make read_block_group_item return void · 4c448ce8
      Marcos Paulo de Souza authored
      Since it's inclusion on 9afc6649 ("btrfs: block-group: refactor how
      we read one block group item") this function always returned 0, so there
      is no need to check for the returned value.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c448ce8
    • Leon Romanovsky's avatar
      btrfs: sysfs: fix unused-but-set-variable warnings · 24646481
      Leon Romanovsky authored
      The compilation with W=1 generates the following warnings:
       fs/btrfs/sysfs.c:1630:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
        1630 |  int ret;
             |      ^~~
       fs/btrfs/sysfs.c:1629:6: warning: variable 'features' set but not used [-Wunused-but-set-variable]
        1629 |  u64 features;
             |      ^~~~~~~~
      
      [ The unused variables are leftover from e410e34f ("Revert "btrfs:
        synchronize incompat feature bits with sysfs files""), which needs
        to be properly fixed by moving feature bit manipulation from the sysfs
        context.  Silence the warning to save pepople time, we got several
        reports. ]
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24646481
    • Filipe Manana's avatar
      btrfs: make fast fsyncs wait only for writeback · 48778179
      Filipe Manana authored
      Currently regardless of a full or a fast fsync we always wait for ordered
      extents to complete, and then start logging the inode after that. However
      for fast fsyncs we can just wait for the writeback to complete, we don't
      need to wait for the ordered extents to complete since we use the list of
      modified extents maps to figure out which extents we must log and we can
      get their checksums directly from the ordered extents that are still in
      flight, otherwise look them up from the checksums tree.
      
      Until commit b5e6c3e1 ("btrfs: always wait on ordered extents at
      fsync time"), for fast fsyncs, we used to start logging without even
      waiting for the writeback to complete first, we would wait for it to
      complete after logging, while holding a transaction open, which lead to
      performance issues when using cgroups and probably for other cases too,
      as wait for IO while holding a transaction handle should be avoided as
      much as possible. After that, for fast fsyncs, we started to wait for
      ordered extents to complete before starting to log, which adds some
      latency to fsyncs and we even got at least one report about a performance
      drop which bisected to that particular change:
      
      https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
      
      This change makes fast fsyncs only wait for writeback to finish before
      starting to log the inode, instead of waiting for both the writeback to
      finish and for the ordered extents to complete. This brings back part of
      the logic we had that extracts checksums from in flight ordered extents,
      which are not yet in the checksums tree, and making sure transaction
      commits wait for the completion of ordered extents previously logged
      (by far most of the time they have already completed by the time a
      transaction commit starts, resulting in no wait at all), to avoid any
      data loss if an ordered extent completes after the transaction used to
      log an inode is committed, followed by a power failure.
      
      When there are no other tasks accessing the checksums and the subvolume
      btrees, the ordered extent completion is pretty fast, typically taking
      100 to 200 microseconds only in my observations. However when there are
      other tasks accessing these btrees, ordered extent completion can take a
      lot more time due to lock contention on nodes and leaves of these btrees.
      I've seen cases over 2 milliseconds, which starts to be significant. In
      particular when we do have concurrent fsyncs against different files there
      is a lot of contention on the checksums btree, since we have many tasks
      writing the checksums into the btree and other tasks that already started
      the logging phase are doing lookups for checksums in the btree.
      
      This change also turns all ranged fsyncs into full ranged fsyncs, which
      is something we already did when not using the NO_HOLES features or when
      doing a full fsync. This is to guarantee we never miss checksums due to
      writeback having been triggered only for a part of an extent, and we end
      up logging the full extent but only checksums for the written range, which
      results in missing checksums after log replay. Allowing ranged fsyncs to
      operate again only in the original range, when using the NO_HOLES feature
      and doing a fast fsync is doable but requires some non trivial changes to
      the writeback path, which can always be worked on later if needed, but I
      don't think they are a very common use case.
      
      Several tests were performed using fio for different numbers of concurrent
      jobs, each writing and fsyncing its own file, for both sequential and
      random file writes. The tests were run on bare metal, no virtualization,
      on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
      with a kernel configuration that is the default of typical distributions
      (debian in this case), without debug options enabled (kasan, kmemleak,
      slub debug, debug of page allocations, lock debugging, etc).
      
      The following script that calls fio was used:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 5 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
        WRITE_MODE=$5
      
        if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
          echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
          exit 1
        fi
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=$WRITE_MODE
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The results were the following:
      
      *************************
      *** sequential writes ***
      *************************
      
      ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
      
      After patch:
      
      WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
      (+9.8%, -8.8% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
      
      After patch:
      
      WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
      (+21.5% throughput, -17.8% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
      
      After patch:
      
      WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
      (+28.7% throughput, -22.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
      
      After patch:
      
      WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
      (+35.6% throughput, -25.2% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
      
      After patch:
      
      WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
      (+34.1% throughput, -25.6% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
      
      After patch:
      
      WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
      (+19.1% throughput, -16.4% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
      
      After patch:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
      (+23.1% throughput, -18.7% runtime)
      
      ************************
      ***   random writes  ***
      ************************
      
      ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
      
      After patch:
      
      WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
      (+0.9% throughput, -1.7% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
      
      After patch:
      
      WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
      (+2.3% throughput, -2.0% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
      
      After patch:
      
      WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
      (+15.6% throughput, -13.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
      
      After patch:
      
      WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
      (+11.6% throughput, -10.7% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
      
      After patch:
      
      WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
      (+3.8% throughput, -3.8% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
      
      After patch:
      
      WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
      (+12.7% throughput, -11.2% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
      
      After patch:
      
      WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
      (+6.3% throughput, -6.0% runtime)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48778179
    • Filipe Manana's avatar
      btrfs: do not commit logs and transactions during link and rename operations · 75b463d2
      Filipe Manana authored
      Since commit d4682ba0 ("Btrfs: sync log after logging new name") we
      started to commit logs, and fallback to transaction commits when we failed
      to log the new names or commit the logs, after link and rename operations
      when the target inodes (or their parents) were previously logged in the
      current transaction. This was to avoid losing directories despite an
      explicit fsync on them when they are ancestors of some inode that got a
      new named logged, due to a link or rename operation. However that adds the
      cost of starting IO and waiting for it to complete, which can cause higher
      latencies for applications.
      
      Instead of doing that, just make sure that when we log a new name for an
      inode we don't mark any of its ancestors as logged, so that if any one
      does an fsync against any of them, without doing any other change on them,
      the fsync commits the log. This way we only pay the cost of a log commit
      (or a transaction commit if something goes wrong or a new block group was
      created) if the application explicitly asks to fsync any of the parent
      directories.
      
      Using dbench, which mixes several filesystems operations including renames,
      revealed some significant latency gains. The following script that uses
      dbench was used to test this:
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-m single -d single"
        THREADS=16
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -t 300 -D $MNT $THREADS
      
        umount $MNT
      
      The test was run on bare metal, no virtualization, on a box with 12 cores
      (Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
      configuration that is the default of typical distributions (debian in this
      case), without debug options enabled (kasan, kmemleak, slub debug, debug
      of page allocations, lock debugging, etc).
      
      Results before this patch:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    10750455     0.011   155.088
       Close         7896674     0.001     0.243
       Rename         455222     2.158  1101.947
       Unlink        2171189     0.067   121.638
       Deltree           256     2.425     7.816
       Mkdir             128     0.002     0.003
       Qpathinfo     9744323     0.006    21.370
       Qfileinfo     1707092     0.001     0.146
       Qfsinfo       1786756     0.001    11.228
       Sfileinfo      875612     0.003    21.263
       Find          3767281     0.025     9.617
       WriteX        5356924     0.011   211.390
       ReadX        16852694     0.003     9.442
       LockX           35008     0.002     0.119
       UnlockX         35008     0.001     0.138
       Flush          753458     4.252  1102.249
      
      Throughput 1128.35 MB/sec  16 clients  16 procs  max_latency=1102.255 ms
      
      Results after this patch:
      
      16 clients, after
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    11471098     0.012   448.281
       Close         8426396     0.001     0.925
       Rename         485746     0.123   267.183
       Unlink        2316477     0.080    63.433
       Deltree           288     2.830    11.144
       Mkdir             144     0.003     0.010
       Qpathinfo    10397420     0.006    10.288
       Qfileinfo     1822039     0.001     0.169
       Qfsinfo       1906497     0.002    14.039
       Sfileinfo      934433     0.004     2.438
       Find          4019879     0.026    10.200
       WriteX        5718932     0.011   200.985
       ReadX        17981671     0.003    10.036
       LockX           37352     0.002     0.076
       UnlockX         37352     0.001     0.109
       Flush          804018     5.015   778.033
      
      Throughput 1201.98 MB/sec  16 clients  16 procs  max_latency=778.036 ms
      (+6.5% throughput, -29.4% max latency, -75.8% rename latency)
      
      Test case generic/498 from fstests tests the scenario that the previously
      mentioned commit fixed.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      75b463d2
    • Filipe Manana's avatar
      btrfs: do not take the log_mutex of the subvolume when pinning the log · 5522a27e
      Filipe Manana authored
      During a rename we pin the log to make sure no one commits a log that
      reflects an ongoing rename operation, as it might result in a committed
      log where it recorded the unlink of the old name without having recorded
      the new name. However we are taking the subvolume's log_mutex before
      incrementing the log_writers counter, which is not necessary since that
      counter is atomic and we only remove the old name from the log and add
      the new name to the log after we have incremented log_writers, ensuring
      that no one can commit the log after we have removed the old name from
      the log and before we added the new name to the log.
      
      By taking the log_mutex lock we are just adding unnecessary contention on
      the lock, which can become visible for workloads that mix renames with
      fsyncs, writes for files opened with O_SYNC and unlink operations (if the
      inode or its parent were fsynced before in the current transaction).
      
      So just remove the lock and unlock of the subvolume's log_mutex at
      btrfs_pin_log_trans().
      
      Using dbench, which mixes different types of operations that end up taking
      that mutex (fsyncs, renames, unlinks and writes into files opened with
      O_SYNC) revealed some small gains. The following script that calls dbench
      was used:
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-m single -d single"
        THREADS=32
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -s -t 600 -D $MNT $THREADS
      
        umount $MNT
      
      The test was run on bare metal, no virtualization, on a box with 12 cores
      (Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
      configuration that is the default of typical distributions (debian in this
      case), without debug options enabled (kasan, kmemleak, slub debug, debug
      of page allocations, lock debugging, etc).
      
      Results before this patch:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4410848     0.017   738.640
       Close        3240222     0.001     0.834
       Rename        186850     7.478  1272.476
       Unlink        890875     0.128   785.018
       Deltree          128     2.846    12.081
       Mkdir             64     0.002     0.003
       Qpathinfo    3997659     0.009    11.171
       Qfileinfo     701307     0.001     0.478
       Qfsinfo       733494     0.002     1.103
       Sfileinfo     359362     0.004     3.266
       Find         1546226     0.041     4.128
       WriteX       2202803     7.905  1376.989
       ReadX        6917775     0.003     3.887
       LockX          14392     0.002     0.043
       UnlockX        14392     0.001     0.085
       Flush         309225     0.128  1033.936
      
      Throughput 231.555 MB/sec (sync open)  32 clients  32 procs  max_latency=1376.993 ms
      
      Results after this patch:
      
      Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4603244     0.017   232.776
       Close        3381299     0.001     1.041
       Rename        194871     7.251  1073.165
       Unlink        929730     0.133   119.233
       Deltree          128     2.871    10.199
       Mkdir             64     0.002     0.004
       Qpathinfo    4171343     0.009    11.317
       Qfileinfo     731227     0.001     1.635
       Qfsinfo       765079     0.002     3.568
       Sfileinfo     374881     0.004     1.220
       Find         1612964     0.041     4.675
       WriteX       2296720     7.569  1178.204
       ReadX        7213633     0.003     3.075
       LockX          14976     0.002     0.076
       UnlockX        14976     0.001     0.061
       Flush         322635     0.102   579.505
      
      Throughput 241.4 MB/sec (sync open)  32 clients  32 procs  max_latency=1178.207 ms
      (+4.3% throughput, -14.4% max latency)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5522a27e
    • David Sterba's avatar
      btrfs: send: remove indirect callback parameter for changed_cb · 1b51d6fc
      David Sterba authored
      There's a custom callback passed to btrfs_compare_trees which happens to
      be named exactly same as the existing function implementing it. This is
      confusing and the indirection is not necessary for our needs. Compiler
      is clever enough to call it directly so there's effectively no change.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b51d6fc
    • David Sterba's avatar
      btrfs: scrub: rename ratelimit state varaible to avoid shadowing · 8bb1cf1b
      David Sterba authored
      There's already defined _rs within ctree.h:btrfs_printk_ratelimited,
      local variables should not use _ to avoid such name clashes with
      macro-local variables.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bb1cf1b
    • David Sterba's avatar
      btrfs: remove unnecessarily shadowed variables · 0af447d0
      David Sterba authored
      In btrfs_orphan_cleanup, there's another instance of fs_info, but it's
      the same as the one we already have.
      
      In btrfs_backref_finish_upper_links, rb_node is same type and used
      as temporary cursor to the tree.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0af447d0
    • David Sterba's avatar
      btrfs: compression: move declarations to header · cb4c9198
      David Sterba authored
      The declarations of compression algorithm callbacks are defined in the
      .c file as they're used from there. Compiler warns that there are no
      declarations for public functions when compiling lzo.c/zlib.c/zstd.c.
      Fix that by moving the declarations to the header as it's the common
      place for all of them.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb4c9198
    • David Sterba's avatar
      btrfs: remove const from btrfs_feature_set_name · 9e6df7ce
      David Sterba authored
      The function btrfs_feature_set_name returns a const char pointer, the
      second const is not necessary and reported as a warning:
      
       In file included from fs/btrfs/space-info.c:6:
       fs/btrfs/sysfs.h:16:1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
          16 | const char * const btrfs_feature_set_name(enum btrfs_feature_set set);
             | ^~~~~
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e6df7ce