1. 07 Oct, 2020 40 commits
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
      btrfs: convert btrfs_inode_sectorsize to take btrfs_inode · 6fee248d
      Nikolay Borisov authored
      It's counterintuitive to have a function named btrfs_inode_xxx which
      takes a generic inode. Also move the function to btrfs_inode.h so that
      it has access to the definition of struct btrfs_inode.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6fee248d
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Josef Bacik's avatar
      btrfs: use BTRFS_NESTED_NEW_ROOT for double splits · ca9d473a
      Josef Bacik authored
      I've made this change separate since it requires both of the newly added
      NESTED flags and I didn't want to slip it into one of those changes.
      
      If we do a double split of a node we can end up doing a
      BTRFS_NESTED_SPLIT on level 0, which throws lockdep off because it
      appears as a double lock.  Since we're maxed out on subclasses, use
      BTRFS_NESTED_NEW_ROOT if we had to do a double split.  This is OK
      because we won't have to do a double split if we had to insert a new
      root, and the new root would be at a higher level anyway.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca9d473a
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots · cf6f34aa
      Josef Bacik authored
      The way we add new roots is confusing from a locking perspective for
      lockdep.  We generally have the rule that we lock things in order from
      highest level to lowest, but in the case of adding a new level to the
      tree we actually allocate a new block for the root, which makes the
      locking go in reverse.  A similar issue exists for snapshotting, we cow
      the original root for the root of a new tree, however they're at the
      same level.  Address this by using BTRFS_NESTING_NEW_ROOT for these
      operations.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf6f34aa
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_SPLIT for split blocks · 4dff97e6
      Josef Bacik authored
      If we are splitting a leaf/node, we could do something like the
      following
      
      lock(leaf)  BTRFS_NESTING_NORMAL
        lock(left) BTRFS_NESTING_LEFT + BTRFS_NESTING_COW
          push from leaf -> left
            reset path to point to left
              split left
                allocate new block, lock block BTRFS_NESTING_SPLIT
      
      at the new block point we need to have a different nesting level,
      because we have already used either BTRFS_NESTING_LEFT or
      BTRFS_NESTING_RIGHT when pushing items from the original leaf into the
      adjacent leaves.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4dff97e6
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW · bf59a5a2
      Josef Bacik authored
      For similar reasons as BTRFS_NESTING_COW, we need
      BTRFS_NESTING_LEFT/RIGHT_COW.  The pattern is this
      
      lock leaf -> BTRFS_NESTING_NORMAL
        cow leaf -> BTRFS_NESTING_COW
          split leaf
            lock left -> BTRFS_NESTING_LEFT
              cow left -> BTRFS_NESTING_LEFT_COW
      
      We need this in order to indicate to lockdep that these locks are
      discrete and are being taken in a safe order.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf59a5a2
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT · bf77467a
      Josef Bacik authored
      Our lockdep maps are based on rootid+level, however in some cases we
      will lock adjacent blocks on the same level, namely in searching forward
      or in split/balance.  Because of this lockdep will complain, so we need
      a separate subclass to indicate to lockdep that these are different
      locks.
      
      lock leaf -> BTRFS_NESTING_NORMAL
        cow leaf -> BTRFS_NESTING_COW
          split leaf
             lock left -> BTRFS_NESTING_LEFT
             lock right -> BTRFS_NESTING_RIGHT
      
      The above graph illustrates the need for this new nesting subclass.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf77467a
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks · 9631e4cc
      Josef Bacik authored
      When we COW a block we are holding a lock on the original block, and
      then we lock the new COW block.  Because our lockdep maps are based on
      root + level, this will make lockdep complain.  We need a way to
      indicate a subclass for locking the COW'ed block, so plumb through our
      btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
      and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.
      
      The reason I've added all this extra infrastructure is because there
      will be need of different nesting classes in follow up patches.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9631e4cc
    • Josef Bacik's avatar
      btrfs: add nesting tags to the locking helpers · fd7ba1c1
      Josef Bacik authored
      We will need these when we switch to an rwsem, so plumb in the
      infrastructure here to use later on.  I violate the 80 character limit
      some here because it'll be cleaned up later.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd7ba1c1
    • Josef Bacik's avatar
      btrfs: introduce btrfs_path::recurse · 51899412
      Josef Bacik authored
      Our current tree locking stuff allows us to recurse with read locks if
      we're already holding the write lock.  This is necessary for the space
      cache inode, as we could be holding a lock on the root_tree root when we
      need to cache a block group, and thus need to be able to read down the
      root_tree to read in the inode cache.
      
      We can get away with this in our current locking, but we won't be able
      to with a rwsem.  Handle this by purposefully annotating the places
      where we require recursion, so that in the future we can maybe come up
      with a way to avoid the recursion.  In the case of the free space inode,
      this will be superseded by the free space tree.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51899412
    • Josef Bacik's avatar
      btrfs: rename extent_buffer::lock_nested to extent_buffer::lock_recursed · 329ced79
      Josef Bacik authored
      Nested locking with lockdep and everything else refers to lock hierarchy
      within the same lock map.  This is how we indicate the same locks for
      different objects are ok to take in a specific order, for our use case
      that would be to take the lock on a leaf and then take a lock on an
      adjacent leaf.
      
      What ->lock_nested _actually_ refers to is if we happen to already be
      holding the write lock on the extent buffer and we're allowing a read
      lock to be taken on that extent buffer, which is recursion.  Rename this
      so we don't get confused when we switch to a rwsem and have to start
      using the _nested helpers.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      329ced79
    • Nikolay Borisov's avatar
      btrfs: don't opencode sync_blockdev in btrfs_init_new_device · b9ba017f
      Nikolay Borisov authored
      Instead of opencoding filemap_write_and_wait simply call syncblockdev as
      it makes it abundantly clear what's going on and why this is used. No
      semantics changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9ba017f
    • Nikolay Borisov's avatar
      btrfs: remove redundant code from btrfs_free_stale_devices · 4ae312e9
      Nikolay Borisov authored
      Following the refactor of btrfs_free_stale_devices in
      7bcb8164 ("btrfs: use device_list_mutex when removing stale devices")
      fs_devices are freed after they have been iterated by the inner
      list_for_each so the use-after-free fixed by introducing the break in
      fd649f10 ("btrfs: Fix use-after-free when cleaning up fs_devs with
      a single stale device") is no longer necessary. Just remove it
      altogether. No functional changes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ae312e9
    • Nikolay Borisov's avatar
      btrfs: refactor locked condition in btrfs_init_new_device · 44cab9ba
      Nikolay Borisov authored
      Invert unlocked to locked and exploit the fact it can only ever be
      modified if we are adding a new device to a seed filesystem. This allows
      to simplify the check in error: label. No semantics changes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44cab9ba
    • Nikolay Borisov's avatar
      btrfs: use RCU for quick device check in btrfs_init_new_device · f4cfa9bd
      Nikolay Borisov authored
      When adding a new device there's a mandatory check to see if a device is
      being duplicated to the filesystem it's added to. Since this is a
      read-only operations not necessary to take device_list_mutex and can simply
      make do with an rcu-readlock.
      
      Using just RCU is safe because there won't be another device add delete
      running in parallel as btrfs_init_new_device is called only from
      btrfs_ioctl_add_dev.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4cfa9bd
    • Qu Wenruo's avatar
      btrfs: ctree: check key order before merging tree blocks · d16c702f
      Qu Wenruo authored
      [BUG]
      With a crafted image, btrfs can panic at btrfs_del_csums():
      
        kernel BUG at fs/btrfs/ctree.c:3188!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
        RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
        RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
        RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
        R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
        FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
        Call Trace:
        truncate_one_csum+0xac/0xf0
        btrfs_del_csums+0x24f/0x3a0
        __btrfs_free_extent.isra.72+0x5a7/0xbe0
        __btrfs_run_delayed_refs+0x539/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace 93bf9db00e6c374e ]---
      
      [CAUSE]
      This crafted image has a tricky key order corruption:
      
        checksum tree key (CSUM_TREE ROOT_ITEM 0)
        node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
                ...
                key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
                key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
                ...
      
        leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
                item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
                        range start 73785344 end 75497472 length 1712128
                item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
                        range start 75497472 end 75501568 length 4096
                item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
                        range start 75501568 end 77283328 length 1781760
                item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
                        range start 77283328 end 77287424 length 4096
                item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
                        range start 4120596480 end 4120903680 length 307200
        leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
                item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
                        range start 77594624 end 79306752 length 1712128
                ...
      
      Note the item 4 key of leaf 29757440, which is obviously too large, and
      even larger than the first key of the next leaf.
      
      However it still follows the key order in that tree block, thus tree
      checker is unable to detect it at read time, since tree checker can only
      work inside one leaf, thus such complex corruption can't be detected in
      advance.
      
      [FIX]
      The next time to detect such problem is at tree block merge time,
      which is in push_node_left(), balance_node_right(), push_leaf_left() or
      push_leaf_right().
      
      Now we check if the key order of the right-most key of the left node is
      larger than the left-most key of the right node.
      
      By this we don't need to call the full tree-checker, while still keeping
      the key order correct as key order in each node is already checked by
      tree checker thus we only need to check the above two slots.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d16c702f
    • Qu Wenruo's avatar
      btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref() · 07cce5cf
      Qu Wenruo authored
      [BUG]
      With a crafted image, btrfs can panic at insert_inline_extent_backref():
      
        kernel BUG at fs/btrfs/extent-tree.c:1857!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1117 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:insert_inline_extent_backref+0xcc/0xe0
        RSP: 0018:ffffac4dc1287be8 EFLAGS: 00010293
        RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000001
        RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffffac4dc1287c28 R08: ffffac4dc1287ab8 R09: ffffac4dc1287ac0
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff8febef88a540 R14: ffff8febeaa7bc30 R15: 0000000000000000
        FS: 0000000000000000(0000) GS:ffff8febf7a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f663ace94c0 CR3: 0000000235698006 CR4: 00000000000206f0
        Call Trace:
        ? _cond_resched+0x1a/0x50
        __btrfs_inc_extent_ref.isra.64+0x7e/0x240
        ? btrfs_merge_delayed_refs+0xa5/0x330
        __btrfs_run_delayed_refs+0x653/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace 2ad8b3de903cf825 ]---
      
      [CAUSE]
      Due to extent tree corruption (still valid by itself, but bad cross
      ref), we can allocate an extent which is still in extent tree.  The
      offending tree block of that case is from csum tree.  The newly
      allocated tree block is also for csum tree.
      
      Then we will try to insert a tree block ref for the existing tree block
      ref.
      
      For a tree extent item, tree block can never be shared directly by the
      same tree twice.  We have such BUG_ON() to prevent such problem, but
      this is not a proper error handling.
      
      [FIX]
      Replace that BUG_ON() with proper error message and leaf dump for debug
      build.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202829Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07cce5cf
    • Qu Wenruo's avatar
      btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent() · 1c2a07f5
      Qu Wenruo authored
      __btrfs_free_extent() is doing two things:
      
      1. Reduce the refs number of an extent backref
         Either it's an inline extent backref (inside EXTENT/METADATA item) or
         a keyed extent backref (SHARED_* item).
         We only need to locate that backref line, either reduce the number or
         remove the backref line completely.
      
      2. Update the refs count in EXTENT/METADATA_ITEM
      
      During step 1), we will try to locate the EXTENT/METADATA_ITEM without
      triggering another btrfs_search_slot() as fast path.
      
      Only when we fail to locate that item, we will trigger another
      btrfs_search_slot() to get that EXTENT/METADATA_ITEM after we
      updated/deleted the backref line.
      
      And we have a lot of strict checks on things like refs_to_drop against
      extent refs and special case checks for single ref extents.
      
      There are 7 BUG_ON()s, although they're doing correct checks, they can
      be triggered by crafted images.
      
      This patch improves the function:
      
      - Introduce two examples to show what __btrfs_free_extent() is doing
        One inline backref case and one keyed case.  Should cover most cases.
      
      - Kill all BUG_ON()s with proper error message and optional leaf dump
      
      - Add comment to show the overall flow
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202819
      [ The report triggers one BUG_ON() in __btrfs_free_extent() ]
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c2a07f5
    • Qu Wenruo's avatar
      btrfs: extent_io: do extra check for extent buffer read write functions · f98b6215
      Qu Wenruo authored
      Although we have start, len check for extent buffer reader/write (e.g.
      read_extent_buffer()), these checks have limitations:
      
      - No overflow check
        Values like start = 1024 len = -1024 can still pass the basic
         (start + len) > eb->len check.
      
      - Checks are not consistent
        For read_extent_buffer() we only check (start + len) against eb->len.
        While for memcmp_extent_buffer() we also check start against eb->len.
      
      - Different error reporting mechanism
        We use WARN() in read_extent_buffer() but BUG() in
        memcpy_extent_buffer().
      
      - Still modify memory if the request is obviously wrong
        In read_extent_buffer() even we find (start + len) > eb->len, we still
        call memset(dst, 0, len), which can easily cause memory access error
        if start + len overflows.
      
      To address above problems, this patch creates a new common function to
      check such access, check_eb_range().
      
      - Add overflow check
        This function checks start, start + len against eb->len and overflow
        check.
      
      - Unified checks
      
      - Unified error reports
        Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
        And also do btrfs_warn() message for non-debug build.
      
      - Exit ASAP if check fails
        No more possible memory corruption.
      
      - Add extra comment for @start @len used in those functions as it's
        sometimes confused with the logical addressing instead of a range
        inside the eb space
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
      [ Inspired by above report, the report itself is already addressed ]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ use check_add_overflow ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f98b6215
    • Nikolay Borisov's avatar
      btrfs: rework error detection in init_tree_roots · 217f5004
      Nikolay Borisov authored
      To avoid duplicating 3 lines of code the error detection logic in
      init_tree_roots is somewhat quirky. It first checks for the presence of
      any error condition, then checks for the specific condition to perform
      any specific actions. That's spurious because directly checking for
      each respective error condition and doing the necessary steps is more
      obvious. While at it change the -EUCLEAN to -EIO in case the extent
      buffer is not read correctly, this is in line with other sites which
      return -EIO when the eb couldn't be read.
      
      Additionally it results in smaller code and the code reads
      more linearly:
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-95 (-95)
      Function                                     old     new   delta
      open_ctree                                 17243   17148     -95
      Total: Before=113104, After=113009, chg -0.08%
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      217f5004
    • Qu Wenruo's avatar
      btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations · e85fde51
      Qu Wenruo authored
      [BUG]
      When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
      
        generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
      
      And with the following metadata leak:
      
        BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace a6cfd45ba80e4e06 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
        BTRFS info (device dm-3): disk space caching is enabled
        BTRFS info (device dm-3): has skinny extents
      
      [CAUSE]
      The qgroup preallocated meta rsv operations of that offending root are:
      
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
      
      It's pretty obvious that, we reserve qgroup meta rsv in
      btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
      release/convert calls in btrfs_subvolume_release_metadata().
      
      This leads to the leakage.
      
      [FIX]
      To fix this bug, we should follow what we're doing in
      btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
      add it to block_rsv->qgroup_rsv_reserved.
      
      And free the qgroup reserved metadata space when releasing the
      block_rsv.
      
      To do this, we need to change the btrfs_subvolume_release_metadata() to
      accept btrfs_root, and record the qgroup_to_release number, and call
      btrfs_qgroup_convert_reserved_meta() for it.
      
      Fixes: 733e03a0 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e85fde51
    • Qu Wenruo's avatar
      btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode · b4c5d8fd
      Qu Wenruo authored
      For delayed inode facility, qgroup metadata is reserved for it, and
      later freed.
      
      However we're freeing more bytes than we reserved.
      In btrfs_delayed_inode_reserve_metadata():
      
      	num_bytes = btrfs_calc_metadata_size(fs_info, 1);
      	...
      		ret = btrfs_qgroup_reserve_meta_prealloc(root,
      				fs_info->nodesize, true);
      		...
      		if (!ret) {
      			node->bytes_reserved = num_bytes;
      
      But in btrfs_delayed_inode_release_metadata():
      
      	if (qgroup_free)
      		btrfs_qgroup_free_meta_prealloc(node->root,
      				node->bytes_reserved);
      	else
      		btrfs_qgroup_convert_reserved_meta(node->root,
      				node->bytes_reserved);
      
      This means, we're always releasing more qgroup metadata rsv than we have
      reserved.
      
      This won't trigger selftest warning, as btrfs qgroup metadata rsv has
      extra protection against cases like quota enabled half-way.
      
      But we still need to fix this problem any way.
      
      This patch will use the same num_bytes for qgroup metadata rsv so we
      could handle it correctly.
      
      Fixes: f218ea6c ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4c5d8fd
    • Josef Bacik's avatar
      btrfs: do not hold device_list_mutex when closing devices · 425c6ed6
      Josef Bacik authored
      The following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
      ------------------------------------------------------
      fsstress/8739 is trying to acquire lock:
      ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70
      
      but task is already holding lock:
      ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #10 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x129/0x210
             btrfs_page_mkwrite+0x6a/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      -> #9 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x68/0x90
             _copy_to_user+0x1e/0x80
             perf_read+0x141/0x2c0
             vfs_read+0xad/0x1b0
             ksys_read+0x5f/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #8 (&cpuctx_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x88/0x150
             perf_event_init+0x1db/0x20b
             start_kernel+0x3ae/0x53c
             secondary_startup_64+0xa4/0xb0
      
      -> #7 (pmus_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x4f/0x150
             cpuhp_invoke_callback+0xb1/0x900
             _cpu_up.constprop.26+0x9f/0x130
             cpu_up+0x7b/0xc0
             bringup_nonboot_cpus+0x4f/0x60
             smp_init+0x26/0x71
             kernel_init_freeable+0x110/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #6 (cpu_hotplug_lock){++++}-{0:0}:
             cpus_read_lock+0x39/0xb0
             kmem_cache_create_usercopy+0x28/0x230
             kmem_cache_create+0x12/0x20
             bioset_init+0x15e/0x2b0
             init_bio+0xa3/0xaa
             do_one_initcall+0x5a/0x2e0
             kernel_init_freeable+0x1f4/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #5 (bio_slab_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             bioset_init+0xbc/0x2b0
             __blk_alloc_queue+0x6f/0x2d0
             blk_mq_init_queue_data+0x1b/0x70
             loop_add+0x110/0x290 [loop]
             fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
             do_one_initcall+0x5a/0x2e0
             do_init_module+0x5a/0x220
             load_module+0x2459/0x26e0
             __do_sys_finit_module+0xba/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #4 (loop_ctl_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             lo_open+0x18/0x50 [loop]
             __blkdev_get+0xec/0x570
             blkdev_get+0xe8/0x150
             do_dentry_open+0x167/0x410
             path_openat+0x7c9/0xa80
             do_filp_open+0x93/0x100
             do_sys_openat2+0x22a/0x2e0
             do_sys_open+0x4b/0x80
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             blkdev_put+0x1d/0x120
             close_fs_devices.part.31+0x84/0x130
             btrfs_close_devices+0x44/0xb0
             close_ctree+0x296/0x2b2
             generic_shutdown_super+0x69/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_run_dev_stats+0x49/0x480
             commit_cowonly_roots+0xb5/0x2a0
             btrfs_commit_transaction+0x516/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_commit_transaction+0x4bb/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1272/0x2310
             lock_acquire+0x9e/0x360
             __mutex_lock+0x9f/0x930
             btrfs_record_root_in_trans+0x43/0x70
             start_transaction+0xd1/0x5d0
             btrfs_dirty_inode+0x42/0xd0
             file_update_time+0xc8/0x110
             btrfs_page_mkwrite+0x10c/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_pagefaults);
                                     lock(&mm->mmap_lock#2);
                                     lock(sb_pagefaults);
        lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by fsstress/8739:
       #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
       #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
       #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0
      
      stack backtrace:
      CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
      Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
      Call Trace:
       dump_stack+0x78/0xa0
       check_noncircular+0x165/0x180
       __lock_acquire+0x1272/0x2310
       ? btrfs_get_alloc_profile+0x150/0x210
       lock_acquire+0x9e/0x360
       ? btrfs_record_root_in_trans+0x43/0x70
       __mutex_lock+0x9f/0x930
       ? btrfs_record_root_in_trans+0x43/0x70
       ? lock_acquire+0x9e/0x360
       ? join_transaction+0x5d/0x450
       ? find_held_lock+0x2d/0x90
       ? btrfs_record_root_in_trans+0x43/0x70
       ? join_transaction+0x3d5/0x450
       ? btrfs_record_root_in_trans+0x43/0x70
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       ? handle_mm_fault+0x5e/0x1730
       do_page_mkwrite+0x4d/0xc0
       ? __do_fault+0x32/0x150
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7faa6c9969c4
      
      Was seen in testing.  The fix is similar to that of
      
        btrfs: open device without device_list_mutex
      
      where we're holding the device_list_mutex and then grab the bd_mutex,
      which pulls in a bunch of dependencies under the bd_mutex.  We only ever
      call btrfs_close_devices() on mount failure or unmount, so we're save to
      not have the device_list_mutex here.  We're already holding the
      uuid_mutex which keeps us safe from any external modification of the
      fs_devices.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      425c6ed6
    • Josef Bacik's avatar
      btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks · 62cf5391
      Josef Bacik authored
      When closing and freeing the source device we could end up doing our
      final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
      we want to be holding as few locks as possible, so move this call
      outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
      we're modifying the fs_devices we need to make sure we're holding the
      uuid_mutex here, so take that as well.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62cf5391
    • Nikolay Borisov's avatar
      btrfs: remove alloc_list splice in btrfs_prepare_sprout · 68abf360
      Nikolay Borisov authored
      btrfs_prepare_sprout is called when the first rw device is added to a
      seed filesystem. This means the filesystem can't have its alloc_list
      be non-empty, since seed filesystems are read only. Simply remove the
      code altogether.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68abf360
    • Nikolay Borisov's avatar
      btrfs: document some invariants of seed code · 427c8fdd
      Nikolay Borisov authored
      Without good understanding of how seed devices works it's hard to grok
      some of what the code in open_seed_devices or btrfs_prepare_sprout does.
      
      Add comments hopefully reducing some of the cognitive load.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      427c8fdd
    • Nikolay Borisov's avatar
      btrfs: switch seed device to list api · 944d3f9f
      Nikolay Borisov authored
      While this patch touches a bunch of files the conversion is
      straighforward. Instead of using the implicit linked list anchored at
      btrfs_fs_devices::seed the code is switched to using
      list_for_each_entry.
      
      Previous patches in the series already factored out code that processed
      both main and seed devices so in those cases the factored out functions
      are called on the main fs_devices and then on every seed dev inside
      list_for_each_entry.
      
      Using list api also allows to simplify deletion from the seed dev list
      performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
      substituting a while() loop with a simple list_del_init.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      944d3f9f
    • Nikolay Borisov's avatar
      btrfs: simplify setting/clearing fs_info to btrfs_fs_devices · c4989c2f
      Nikolay Borisov authored
      It makes no sense to have sysfs-related routines be responsible for
      properly initialising the fs_info pointer of struct btrfs_fs_device.
      Instead this can be streamlined by making it the responsibility of
      btrfs_init_devices_late to initialize it. That function already
      initializes fs_info of every individual device in btrfs_fs_devices.
      
      As far as clearing it is concerned it makes sense to move it to
      close_fs_devices. That function is only called when struct
      btrfs_fs_devices is no longer in use - either for holding seeds or
      main devices for a mounted filesystem.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4989c2f
    • Nikolay Borisov's avatar
      btrfs: make close_fs_devices return void · 54eed6ae
      Nikolay Borisov authored
      The return value of this function conveys absolutely no information.
      All callers already check the state of fs_devices->opened to decide how
      to proceed. So convert the function to returning void. While at it make
      btrfs_close_devices also return void.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54eed6ae
    • Nikolay Borisov's avatar
      btrfs: factor out loop logic from btrfs_free_extra_devids · 3712ccb7
      Nikolay Borisov authored
      This prepares the code to switching seeds devices to a proper list.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3712ccb7
    • Nikolay Borisov's avatar
      btrfs: factor out reada loop in __reada_start_machine · dc0ab488
      Nikolay Borisov authored
      This is in preparation for moving fs_devices to proper lists.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc0ab488
    • Nikolay Borisov's avatar
      btrfs: remove err variable from btrfs_get_extent · 1028d1c4
      Nikolay Borisov authored
      There's no practical reason too use 'err' as a variable to convey
      errors. In fact it's value is either set explicitly in the beginning of
      the function or it simply takes the value of 'ret'. Not conforming to
      the usual pattern of having ret be the only variable used to convey
      errors makes the code more error prone to bugs. In fact one such bug
      was introduced by 6bf9e4bd ("btrfs: inode: Verify inode mode toi
      avoid NULL pointer dereference") by assigning the error value to 'ret'
      and not 'err'.
      
      Let's fix that issue and make the function less tricky by leaving only
      ret to convey error values.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1028d1c4
    • Josef Bacik's avatar
      btrfs: dio iomap DSYNC workaround · 0eb79294
      Josef Bacik authored
      iomap dio will run generic_write_sync() for us if the iocb is DSYNC.
      This is problematic for us because of 2 reasons:
      
      1. we hold the inode_lock() during this operation, and we take it in
         generic_write_sync()
      2. we hold a read lock on the dio_sem but take the write lock in fsync
      
      Since we don't want to rip out this code right now, but reworking the
      locking is a bit much to do at this point, work around this problem with
      this masterpiece of a patch.
      
      First, we clear DSYNC on the iocb so that the iomap stuff doesn't know
      that it needs to handle the sync.  We save this fact in
      current->journal_info, because we need to see do special things once
      we're in iomap_begin, and we have no way to pass private information
      into iomap_dio_rw().
      
      Next we specify a separate iomap_dio_ops for sync, which implements an
      ->end_io() callback that gets called when the dio completes.  This is
      important for AIO, because we really do need to run generic_write_sync()
      if we complete asynchronously.  However if we're still in the submitting
      context when we enter ->end_io() we clear the flag so that the submitter
      knows they're the ones that needs to run generic_write_sync().
      
      This is meant to be temporary.  We need to work out how to eliminate the
      inode_lock() and the dio_sem in our fsync and use another mechanism to
      protect these operations.
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0eb79294