1. 08 Feb, 2021 40 commits
    • Roman Anasal's avatar
      btrfs: send: use struct send_ctx *sctx for btrfs_compare_trees and changed_cb · 88980383
      Roman Anasal authored
      btrfs_compare_trees and changed_cb use a void *ctx parameter instead of
      struct send_ctx *sctx but when used in changed_cb it is immediately
      cast to `struct send_ctx *sctx = ctx;`.
      
      changed_cb is only ever called from btrfs_compare_trees and full_send_tree:
      - full_send_tree already passes a struct send_ctx *sctx
      - btrfs_compare_trees is only called by send_subvol with a struct send_ctx *sctx
      - void *ctx in btrfs_compare_trees is only used to be passed to changed_cb
      
      So casting to/from void *ctx seems unnecessary and directly using
      struct send_ctx *sctx instead provides better type-safety.
      
      The original reason for using void *ctx in the first place seems to have
      been dropped with 1b51d6fc ("btrfs: send: remove indirect callback
      parameter for changed_cb").
      Signed-off-by: default avatarRoman Anasal <roman.anasal@bdsu.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88980383
    • Josef Bacik's avatar
      btrfs: run delayed refs less often in commit_cowonly_roots · 488bc2a2
      Josef Bacik authored
      We love running delayed refs in commit_cowonly_roots, but it is a bit
      excessive.  I was seeing cases of running 3 or 4 refs a few times in a
      row during this time.  Instead simply:
      
      - update all of the roots first
      - then run delayed refs
      - then handle the empty block groups case
      - and then if we have any more dirty roots do the whole thing again
      
      This allows us to be much more efficient with our delayed ref running,
      as we can batch a few more operations at once.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      488bc2a2
    • Josef Bacik's avatar
      btrfs: stop running all delayed refs during snapshot · dac348e9
      Josef Bacik authored
      This was added in commit 361048f5 ("Btrfs: fix full backref problem
      when inserting shared block reference") to address a problem where we
      hit the following BUG_ON() in alloc_reserved_tree_block
      
              if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
                      BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF));
      
      However this BUG_ON() is bogus, and was removed by previous commit:
      
        btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
      
      We no longer need to run delayed refs because of this, and can remove
      this flushing here.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dac348e9
    • Josef Bacik's avatar
      btrfs: remove bogus BUG_ON in alloc_reserved_tree_block · b7774425
      Josef Bacik authored
      The fix 361048f5 ("Btrfs: fix full backref problem when inserting
      shared block reference") added a delayed ref flushing at subvolume
      creation time in order to avoid hitting this particular BUG_ON().
      
      Before this fix, we were tripping the BUG_ON() by
      
      1. Modify snapshot A, which creates blocks with a normal reference for
         snapshot A, as A is the owner of these blocks.  We now have delayed
         refs for these blocks.
      2. Create a snapshot of A named B, which pushes references for the
         children blocks of the root node for the new root B, thus creating
         more delayed refs for newly allocated blocks.
      3. A is modified, and because the metadata blocks can now be shared, it
         must push FULL_BACKREF references to the children of any block that A
         COWs down it's path to its target key.
      4. Delayed refs are run.  Because these are newly allocated blocks, we
         have ->must_insert_reserved reserved set on the delayed ref head, we
         call into alloc_reserved_tree_block() to add the extent item, and
         then add our ref.  At the time of this fix, we were ordering
         FULL_BACKREF delayed ref operations first, so we'd go to add this
         reference and then BUG_ON() because we didn't have the FULL_BACKREF
         flag set.
      
      The patch fixed this problem by making sure we ran the delayed refs
      before we had the chance to modify A.  This meant that any *new* blocks
      would have had their extent items created _before_ we would ever
      actually COW down and generate FULL_BACKREF entries.  Thus the problem
      went away.
      
      However this BUG_ON() is actually completely bogus.  The existence of a
      full backref doesn't necessarily mean that FULL_BACKREF must be set on
      that block, it must only be set on the actual parent itself.  Consider
      the example provided above.  If we COW down one path from A, any nodes
      are going to have a FULL_BACKREF ref pushed down to _all_ of their
      children, but not all of the children are going to have FULL_BACKREF
      set.  It is completely valid to have an extent item with normal and full
      backrefs without FULL_BACKREF actually set on the block itself.
      
      As a final note, I have been testing with the patch (applied after this
      one)
      
        btrfs: stop running all delayed refs during snapshot
      
      which removed this flushing.  My test was a torture test which did a lot
      of operations while snapshotting and deleting snapshots as well as
      relocation, and I never tripped this BUG_ON().  This is actually because
      at the time of 361048f5, we ordered SHARED keys _before_ normal
      references, and thus they would get run first.  However currently they
      are ordered _after_ normal references, so we'd do the initial creation
      without having a shared reference, and thus not hit this BUG_ON(), which
      explains why I didn't start hitting this problem during my testing with
      my other patch applied.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7774425
    • Josef Bacik's avatar
      btrfs: move delayed ref flushing for qgroup into qgroup helper · 2a4d84c1
      Josef Bacik authored
      The commit d6726335 ("btrfs: qgroup: Make snapshot accounting work
      with new extent-oriented qgroup.") added a flush of the delayed refs
      during snapshot creation in order to get the qgroup accounting properly.
      However this code has changed and been moved to it's own helper that is
      skipped if qgroups are turned off.  Move the flushing to the helper, as
      we do not need it when qgroups are turned off.
      
      Also add a comment explaining why it exists, and why it doesn't actually
      save us.  This will be helpful later when we try to fix qgroup
      accounting properly.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2a4d84c1
    • Josef Bacik's avatar
      btrfs: only run delayed refs once before committing · ad368f33
      Josef Bacik authored
      We try to pre-flush the delayed refs when committing, because we want to
      do as little work as possible in the critical section of the transaction
      commit.
      
      However doing this twice can lead to very long transaction commit delays
      as other threads are allowed to continue to generate more delayed refs,
      which potentially delays the commit by multiple minutes in very extreme
      cases.
      
      So simply stick to one pre-flush, and then continue the rest of the
      transaction commit.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad368f33
    • Josef Bacik's avatar
      btrfs: delayed refs pre-flushing should only run the heads we have · 61a56a99
      Josef Bacik authored
      Previously our delayed ref running used the total number of items as the
      items to run.  However we changed that to number of heads to run with
      the delayed_refs_rsv, as generally we want to run all of the operations
      for one bytenr.
      
      But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the
      number of items that we have.  This is generally fine, but if we have
      some operation generation loads of delayed refs while we're doing this
      pre-flushing in the transaction commit, we'll just spin forever doing
      delayed refs.
      
      Fix this to simply pick the number of delayed refs we currently have,
      that way we do not end up doing a lot of extra work that's being
      generated in other threads.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61a56a99
    • Josef Bacik's avatar
      btrfs: only let one thread pre-flush delayed refs in commit · e19eb11f
      Josef Bacik authored
      I've been running a stress test that runs 20 workers in their own
      subvolume, which are running an fsstress instance with 4 threads per
      worker, which is 80 total fsstress threads.  In addition to this I'm
      running balance in the background as well as creating and deleting
      snapshots.  This test takes around 12 hours to run normally, going
      slower and slower as the test goes on.
      
      The reason for this is because fsstress is running fsync sometimes, and
      because we're messing with block groups we often fall through to
      btrfs_commit_transaction, so will often have 20-30 threads all calling
      btrfs_commit_transaction at the same time.
      
      These all get stuck contending on the extent tree while they try to run
      delayed refs during the initial part of the commit.
      
      This is suboptimal, really because the extent tree is a single point of
      failure we only want one thread acting on that tree at once to reduce
      lock contention.
      
      Fix this by making the flushing mechanism a bit operation, to make it
      easy to use test_and_set_bit() in order to make sure only one task does
      this initial flush.
      
      Once we're into the transaction commit we only have one thread doing
      delayed ref running, it's just this initial pre-flush that is
      problematic.  With this patch my stress test takes around 90 minutes to
      run, instead of 12 hours.
      
      The memory barrier is not necessary for the flushing bit as it's
      ordered, unlike plain int. The transaction state accessed in
      btrfs_should_end_transaction could be affected by that too as it's not
      always used under transaction lock. Upon Nikolay's analysis in [1]
      it's not necessary:
      
        In should_end_transaction it's read without holding any locks. (U)
      
        It's modified in btrfs_cleanup_transaction without holding the
        fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.
      
        set in cleanup_transaction under fs_info->trans_lock (L)
        set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
        set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
        set in btrfs_commit_trans to COMMIT_UNBLOCK under
        fs_info->trans_lock.(L)
      
        set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
        point the transaction is finished and fs_info->running_trans is NULL (U
        but irrelevant).
      
        So by the looks of it we can have a concurrent READ race with a WRITE,
        due to reads not taking a lock. In this case what we want to ensure is
        we either see new or old state. I consulted with Will Deacon and he said
        that in such a case we'd want to annotate the accesses to ->state with
        (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
        don't think this could happen but I imagine at some point KCSAN would
        flag such an access as racy (which it is).
      
      [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.comReviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ add comments regarding memory barrier ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e19eb11f
    • Josef Bacik's avatar
      btrfs: do not block on deleted bgs mutex in the cleaner · ddfd08cb
      Josef Bacik authored
      While running some stress tests I started getting hung task messages.
      This is because the delete unused block groups code has to take the
      delete_unused_bgs_mutex to do it's work, which is taken by balance to
      make sure we don't delete block groups while we're balancing.
      
      The problem is that balance can take a while, and so we were getting
      hung task warnings.  We don't need to block and run these things, and
      the cleaner is needed to do other work, so trylock on this mutex and
      just bail if we can't acquire it right away.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ddfd08cb
    • Josef Bacik's avatar
      btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root · 867ed321
      Josef Bacik authored
      While testing my error handling patches, I added a error injection site
      at btrfs_inc_extent_ref, to validate the error handling I added was
      doing the correct thing.  However I hit a pretty ugly corruption while
      doing this check, with the following error injection stack trace:
      
      btrfs_inc_extent_ref
        btrfs_copy_root
          create_reloc_root
            btrfs_init_reloc_root
      	btrfs_record_root_in_trans
      	  btrfs_start_transaction
      	    btrfs_update_inode
      	      btrfs_update_time
      		touch_atime
      		  file_accessed
      		    btrfs_file_mmap
      
      This is because we do not catch the error from btrfs_inc_extent_ref,
      which in practice would be ENOMEM, which means we lose the extent
      references for a root that has already been allocated and inserted,
      which is the problem.  Fix this by aborting the transaction if we fail
      to do the reference modification.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      867ed321
    • Josef Bacik's avatar
      btrfs: add asserts for deleting backref cache nodes · eddda68d
      Josef Bacik authored
      A weird KASAN problem that Zygo reported could have been easily caught
      if we checked for basic things in our backref freeing code.  We have two
      methods of freeing a backref node
      
      - btrfs_backref_free_node: this just is kfree() essentially.
      - btrfs_backref_drop_node: this actually unlinks the node and cleans up
        everything and then calls btrfs_backref_free_node().
      
      We should mostly be using btrfs_backref_drop_node(), to make sure the
      node is properly unlinked from the backref cache, and only use
      btrfs_backref_free_node() when we know the node isn't actually linked to
      the backref cache.  We made a mistake here and thus got the KASAN splat.
      
      Make this style of issue easier to find by adding some ASSERT()'s to
      btrfs_backref_free_node() and adjusting our deletion stuff to properly
      init the list so we can rely on list_empty() checks working properly.
      
        BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
        Read of size 8 at addr ffff888112402950 by task btrfs/28836
      
        CPU: 0 PID: 28836 Comm: btrfs Tainted: G        W         5.10.0-e35f27394290-for-next+ #23
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        Call Trace:
         dump_stack+0xbc/0xf9
         ? btrfs_backref_cleanup_node+0x18a/0x420
         print_address_description.constprop.8+0x21/0x210
         ? record_print_text.cold.34+0x11/0x11
         ? btrfs_backref_cleanup_node+0x18a/0x420
         ? btrfs_backref_cleanup_node+0x18a/0x420
         kasan_report.cold.10+0x20/0x37
         ? btrfs_backref_cleanup_node+0x18a/0x420
         __asan_load8+0x69/0x90
         btrfs_backref_cleanup_node+0x18a/0x420
         btrfs_backref_release_cache+0x83/0x1b0
         relocate_block_group+0x394/0x780
         ? merge_reloc_roots+0x4a0/0x4a0
         btrfs_relocate_block_group+0x26e/0x4c0
         btrfs_relocate_chunk+0x52/0x120
         btrfs_balance+0xe2e/0x1900
         ? check_flags.part.50+0x6c/0x1e0
         ? btrfs_relocate_chunk+0x120/0x120
         ? kmem_cache_alloc_trace+0xa06/0xcb0
         ? _copy_from_user+0x83/0xc0
         btrfs_ioctl_balance+0x3a7/0x460
         btrfs_ioctl+0x24c8/0x4360
         ? __kasan_check_read+0x11/0x20
         ? check_chain_key+0x1f4/0x2f0
         ? __asan_loadN+0xf/0x20
         ? btrfs_ioctl_get_supported_features+0x30/0x30
         ? kvm_sched_clock_read+0x18/0x30
         ? check_chain_key+0x1f4/0x2f0
         ? lock_downgrade+0x3f0/0x3f0
         ? handle_mm_fault+0xad6/0x2150
         ? do_vfs_ioctl+0xfc/0x9d0
         ? ioctl_file_clone+0xe0/0xe0
         ? check_flags.part.50+0x6c/0x1e0
         ? check_flags.part.50+0x6c/0x1e0
         ? check_flags+0x26/0x30
         ? lock_is_held_type+0xc3/0xf0
         ? syscall_enter_from_user_mode+0x1b/0x60
         ? do_syscall_64+0x13/0x80
         ? rcu_read_lock_sched_held+0xa1/0xd0
         ? __kasan_check_read+0x11/0x20
         ? __fget_light+0xae/0x110
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f4c4bdfe427
        RSP: 002b:00007fff33ee6df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007fff33ee6e98 RCX: 00007f4c4bdfe427
        RDX: 00007fff33ee6e98 RSI: 00000000c4009420 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
        R10: fffffffffffff59d R11: 0000000000000202 R12: 0000000000000001
        R13: 0000000000000000 R14: 00007fff33ee8a34 R15: 0000000000000001
      
        Allocated by task 28836:
         kasan_save_stack+0x21/0x50
         __kasan_kmalloc.constprop.18+0xbe/0xd0
         kasan_kmalloc+0x9/0x10
         kmem_cache_alloc_trace+0x410/0xcb0
         btrfs_backref_alloc_node+0x46/0xf0
         btrfs_backref_add_tree_node+0x60d/0x11d0
         build_backref_tree+0xc5/0x700
         relocate_tree_blocks+0x2be/0xb90
         relocate_block_group+0x2eb/0x780
         btrfs_relocate_block_group+0x26e/0x4c0
         btrfs_relocate_chunk+0x52/0x120
         btrfs_balance+0xe2e/0x1900
         btrfs_ioctl_balance+0x3a7/0x460
         btrfs_ioctl+0x24c8/0x4360
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 28836:
         kasan_save_stack+0x21/0x50
         kasan_set_track+0x20/0x30
         kasan_set_free_info+0x1f/0x30
         __kasan_slab_free+0xf3/0x140
         kasan_slab_free+0xe/0x10
         kfree+0xde/0x200
         btrfs_backref_error_cleanup+0x452/0x530
         build_backref_tree+0x1a5/0x700
         relocate_tree_blocks+0x2be/0xb90
         relocate_block_group+0x2eb/0x780
         btrfs_relocate_block_group+0x26e/0x4c0
         btrfs_relocate_chunk+0x52/0x120
         btrfs_balance+0xe2e/0x1900
         btrfs_ioctl_balance+0x3a7/0x460
         btrfs_ioctl+0x24c8/0x4360
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff888112402900
         which belongs to the cache kmalloc-128 of size 128
        The buggy address is located 80 bytes inside of
         128-byte region [ffff888112402900, ffff888112402980)
        The buggy address belongs to the page:
        page:0000000028b1cd08 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888131c810c0 pfn:0x112402
        flags: 0x17ffe0000000200(slab)
        raw: 017ffe0000000200 ffffea000424f308 ffffea0007d572c8 ffff888100040440
        raw: ffff888131c810c0 ffff888112402000 0000000100000009 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff888112402800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff888112402880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        >ffff888112402900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                         ^
         ffff888112402980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
         ffff888112402a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Link: https://lore.kernel.org/linux-btrfs/20201208194607.GI31381@hungrycats.org/
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eddda68d
    • Josef Bacik's avatar
      btrfs: do not warn if we can't find the reloc root when looking up backref · f78743fb
      Josef Bacik authored
      The backref code is looking for a reloc_root that corresponds to the
      given fs root.  However any number of things could have gone wrong while
      initializing that reloc_root, like ENOMEM while trying to allocate the
      root itself, or EIO while trying to write the root item.  This would
      result in no corresponding reloc_root being in the reloc root cache, and
      thus would return NULL when we do the find_reloc_root() call.
      
      Because of this we do not want to WARN_ON().  This presumably was meant
      to catch developer errors, cases where we messed up adding the reloc
      root.  However we can easily hit this case with error injection, and
      thus should not do a WARN_ON().
      
      CC: stable@vger.kernel.org # 5.10+
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f78743fb
    • Josef Bacik's avatar
      btrfs: splice remaining dirty_bg's onto the transaction dirty bg list · 938fcbfb
      Josef Bacik authored
      While doing error injection testing with my relocation patches I hit the
      following assert:
      
        assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3357!
        invalid opcode: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 24351 Comm: umount Tainted: G        W         5.10.0-rc3+ #193
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        RIP: 0010:assertfail.constprop.0+0x18/0x1a
        RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282
        RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000
        RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050
        RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000
        R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100
        R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100
        FS:  00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0
        Call Trace:
         btrfs_free_block_groups.cold+0x55/0x55
         close_ctree+0x2c5/0x306
         ? fsnotify_destroy_marks+0x14/0x100
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20
         deactivate_locked_super+0x36/0xa0
         cleanup_mnt+0x12d/0x190
         task_work_run+0x5c/0xa0
         exit_to_user_mode_prepare+0x1b1/0x1d0
         syscall_exit_to_user_mode+0x54/0x280
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happened because I injected an error in btrfs_cow_block() while
      running the dirty block groups.  When we run the dirty block groups, we
      splice the list onto a local list to process.  However if an error
      occurs, we only cleanup the transactions dirty block group list, not any
      pending block groups we have on our locally spliced list.
      
      In fact if we fail to allocate a path in this function we'll also fail
      to clean up the splice list.
      
      Fix this by splicing the list back onto the transaction dirty block
      group list so that the block groups are cleaned up.  Then add a 'out'
      label and have the error conditions jump to out so that the errors are
      handled properly.  This also has the side-effect of fixing a problem
      where we would clear 'ret' on error because we unconditionally ran
      btrfs_run_delayed_refs().
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      938fcbfb
    • Josef Bacik's avatar
      btrfs: fix reloc root leak with 0 ref reloc roots on recovery · c78a10ae
      Josef Bacik authored
      When recovering a relocation, if we run into a reloc root that has 0
      refs we simply add it to the reloc_control->reloc_roots list, and then
      clean it up later.  The problem with this is __del_reloc_root() doesn't
      do anything if the root isn't in the radix tree, which in this case it
      won't be because we never call __add_reloc_root() on the reloc_root.
      
      This exit condition simply isn't correct really.  During normal
      operation we can remove ourselves from the rb tree and then we're meant
      to clean up later at merge_reloc_roots() time, and this happens
      correctly.  During recovery we're depending on free_reloc_roots() to
      drop our references, but we're short-circuiting.
      
      Fix this by continuing to check if we're on the list and dropping
      ourselves from the reloc_control root list and dropping our reference
      appropriately.  Change the corresponding BUG_ON() to an ASSERT() that
      does the correct thing if we aren't in the rb tree.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c78a10ae
    • Nigel Christian's avatar
      btrfs: remove repeated word in struct member comment · 2e626e56
      Nigel Christian authored
      Comment for processed extent end of range has an unnecessary "in",
      remove it.
      Signed-off-by: default avatarNigel Christian <nigel.l.christian@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e626e56
    • Josef Bacik's avatar
      btrfs: account for new extents being deleted in total_bytes_pinned · 81e75ac7
      Josef Bacik authored
      My recent patch set "A variety of lock contention fixes", found here
      
      https://lore.kernel.org/linux-btrfs/cover.1608319304.git.josef@toxicpanda.com/
      (Tracked in https://github.com/btrfs/linux/issues/86)
      
      that reduce lock contention on the extent root by running delayed refs
      less often resulted in a regression in generic/371.  This test
      fallocate()'s the fs until it's full, deletes all the files, and then
      tries to fallocate() until full again.
      
      Before these patches we would run all of the delayed refs during
      flushing, and then would commit the transaction because we had plenty of
      pinned space to recover in order to allocate.  However my patches made
      it so we weren't running the delayed refs as aggressively, which meant
      that we appeared to have less pinned space when we were deciding to
      commit the transaction.
      
      We use the space_info->total_bytes_pinned to approximate how much space
      we have pinned.  It's approximate because if we remove a reference to an
      extent we may free it, but there may be more references to it than we
      know of at that point, but we account it as pinned at the creation time,
      and then it's properly accounted when the delayed ref runs.
      
      The way we account for pinned space is if the
      delayed_ref_head->total_ref_mod is < 0, because that is clearly a
      freeing option.  However there is another case, and that is where
      ->total_ref_mod == 0 && ->must_insert_reserved == 1.
      
      When we allocate a new extent, we have ->total_ref_mod == 1 and we have
      ->must_insert_reserved == 1.  This is used to indicate that it is a
      brand new extent and will need to have its extent entry added before we
      modify any references on the delayed ref head.  But if we subsequently
      remove that extent reference, our ->total_ref_mod will be 0, and that
      space will be pinned and freed.  Accounting for this case properly
      allows for generic/371 to pass with my delayed refs patches applied.
      
      It's important to note that this problem exists without the referenced
      patches, it just was uncovered by them.
      
      CC: stable@vger.kernel.org # 5.10
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81e75ac7
    • Josef Bacik's avatar
      btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself · 2187374f
      Josef Bacik authored
      Currently we pass things around to figure out if we maybe freeing data
      based on the state of the delayed refs head.  This makes the accounting
      sort of confusing and hard to follow, as it's distinctly separate from
      the delayed ref heads stuff, but also depends on it entirely.
      
      Fix this by explicitly adjusting the space_info->total_bytes_pinned in
      the delayed refs code.  We now have two places where we modify this
      counter, once where we create the delayed and destroy the delayed refs,
      and once when we pin and unpin the extents.  This means there is a
      slight overlap between delayed refs and the pin/unpin mechanisms, but
      this is simply used by the ENOSPC infrastructure to determine if we need
      to commit the transaction, so there's no adverse affect from this, we
      might simply commit thinking it will give us enough space when it might
      not.
      
      CC: stable@vger.kernel.org # 5.10
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2187374f
    • Nikolay Borisov's avatar
      btrfs: enable W=1 checks for btrfs · e9aa7c28
      Nikolay Borisov authored
      Now that the btrfs' codebase is clean of almost all W=1 warnings let's
      enable those checks unconditionally for the entire fs/btrfs/ and its
      subdirectories to catch potential errors during development.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add some comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9aa7c28
    • Nikolay Borisov's avatar
      lib/zstd: convert constants to defines · 71c36788
      Nikolay Borisov authored
      These constants are really used internally by zstd and including
      linux/zstd.h into users results in the following warnings:
      
      In file included from fs/btrfs/zstd.c:19:
      ./include/linux/zstd.h:798:21: warning: ‘ZSTD_skippableHeaderSize’ defined but not used [-Wunused-const-variable=]
        798 | static const size_t ZSTD_skippableHeaderSize = 8;
            |                     ^~~~~~~~~~~~~~~~~~~~~~~~
      ./include/linux/zstd.h:796:21: warning: ‘ZSTD_frameHeaderSize_max’ defined but not used [-Wunused-const-variable=]
        796 | static const size_t ZSTD_frameHeaderSize_max = ZSTD_FRAMEHEADERSIZE_MAX;
            |                     ^~~~~~~~~~~~~~~~~~~~~~~~
      ./include/linux/zstd.h:795:21: warning: ‘ZSTD_frameHeaderSize_min’ defined but not used [-Wunused-const-variable=]
        795 | static const size_t ZSTD_frameHeaderSize_min = ZSTD_FRAMEHEADERSIZE_MIN;
            |                     ^~~~~~~~~~~~~~~~~~~~~~~~
      ./include/linux/zstd.h:794:21: warning: ‘ZSTD_frameHeaderSize_prefix’ defined but not used [-Wunused-const-variable=]
        794 | static const size_t ZSTD_frameHeaderSize_prefix = 5;
      
      So fix those warnings by turning the constants into defines.
      Reviewed-by: default avatarNick Terrell <terrelln@fb.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71c36788
    • Nikolay Borisov's avatar
      btrfs: zoned: remove unused variable in btrfs_sb_log_location_bdev · 8c31a3db
      Nikolay Borisov authored
      This fixes warning:
      
      fs/btrfs/zoned.c:491:6: warning: variable ‘zone_size’ set but not used [-Wunused-but-set-variable]
        491 |  u64 zone_size;
      
      which got introduced in 12659251 ("btrfs: implement log-structured
      superblock for ZONED mode"). We'll enable the warning by default and
      want clean build until the relevant zoned patches land.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c31a3db
    • Nikolay Borisov's avatar
      btrfs: fix parameter description for functions in extent_io.c · 3bed2da1
      Nikolay Borisov authored
      This makes the file W=1 clean and fixes the following warnings:
      
      fs/btrfs/extent_io.c:414: warning: Function parameter or member 'tree' not described in '__etree_search'
      fs/btrfs/extent_io.c:414: warning: Function parameter or member 'offset' not described in '__etree_search'
      fs/btrfs/extent_io.c:414: warning: Function parameter or member 'next_ret' not described in '__etree_search'
      fs/btrfs/extent_io.c:414: warning: Function parameter or member 'prev_ret' not described in '__etree_search'
      fs/btrfs/extent_io.c:414: warning: Function parameter or member 'p_ret' not described in '__etree_search'
      fs/btrfs/extent_io.c:414: warning: Function parameter or member 'parent_ret' not described in '__etree_search'
      fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'tree' not described in 'find_contiguous_extent_bit'
      fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start' not described in 'find_contiguous_extent_bit'
      fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start_ret' not described in 'find_contiguous_extent_bit'
      fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'end_ret' not described in 'find_contiguous_extent_bit'
      fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'bits' not described in 'find_contiguous_extent_bit'
      fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'tree' not described in 'find_first_clear_extent_bit'
      fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start' not described in 'find_first_clear_extent_bit'
      fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start_ret' not described in 'find_first_clear_extent_bit'
      fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'end_ret' not described in 'find_first_clear_extent_bit'
      fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'bits' not described in 'find_first_clear_extent_bit'
      fs/btrfs/extent_io.c:4187: warning: Function parameter or member 'epd' not described in 'extent_write_cache_pages'
      fs/btrfs/extent_io.c:4187: warning: Excess function parameter 'data' description in 'extent_write_cache_pages'
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3bed2da1
    • Nikolay Borisov's avatar
      btrfs: fix parameter description in space-info.c · d98b188e
      Nikolay Borisov authored
      With these fixes space-info.c is clear for W=1 warnings, namely the
      following ones are fixed:
      
      fs/btrfs/space-info.c:575: warning: Function parameter or member 'fs_info' not described in 'may_commit_transaction'
      fs/btrfs/space-info.c:575: warning: Function parameter or member 'space_info' not described in 'may_commit_transaction'
      fs/btrfs/space-info.c:1231: warning: Function parameter or member 'fs_info' not described in 'handle_reserve_ticket'
      fs/btrfs/space-info.c:1231: warning: Function parameter or member 'space_info' not described in 'handle_reserve_ticket'
      fs/btrfs/space-info.c:1231: warning: Function parameter or member 'ticket' not described in 'handle_reserve_ticket'
      fs/btrfs/space-info.c:1231: warning: Function parameter or member 'flush' not described in 'handle_reserve_ticket'
      fs/btrfs/space-info.c:1315: warning: Function parameter or member 'fs_info' not described in '__reserve_bytes'
      fs/btrfs/space-info.c:1315: warning: Function parameter or member 'space_info' not described in '__reserve_bytes'
      fs/btrfs/space-info.c:1315: warning: Function parameter or member 'orig_bytes' not described in '__reserve_bytes'
      fs/btrfs/space-info.c:1315: warning: Function parameter or member 'flush' not described in '__reserve_bytes'
      fs/btrfs/space-info.c:1427: warning: Function parameter or member 'root' not described in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1427: warning: Function parameter or member 'block_rsv' not described in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1427: warning: Function parameter or member 'orig_bytes' not described in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1427: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1462: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_data_bytes'
      fs/btrfs/space-info.c:1462: warning: Function parameter or member 'bytes' not described in 'btrfs_reserve_data_bytes'
      fs/btrfs/space-info.c:1462: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_data_bytes'
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d98b188e
    • Nikolay Borisov's avatar
      btrfs: fix parameter description of btrfs_inode_rsv_release/btrfs_delalloc_release_space · b762d1d0
      Nikolay Borisov authored
      Fixes following warnings:
      
      fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'inode' not described in 'btrfs_inode_rsv_release'
      fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_inode_rsv_release'
      fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'reserved' not described in 'btrfs_delalloc_release_space'
      fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_delalloc_release_space'
      fs/btrfs/delalloc-space.c:472: warning: Excess function parameter 'release_bytes' description in 'btrfs_delalloc_release_space'
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b762d1d0
    • Nikolay Borisov's avatar
      6e353e3b
    • Nikolay Borisov's avatar
      btrfs: fix description format of fs_info of btrfs_wait_on_delayed_iputs · 2639631d
      Nikolay Borisov authored
      Fixes fs/btrfs/inode.c:3101: warning: Function parameter or member 'fs_info' not described in 'btrfs_wait_on_delayed_iputs'
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2639631d
    • Nikolay Borisov's avatar
      btrfs: document fs_info in btrfs_rmap_block · 9ee9b979
      Nikolay Borisov authored
      Fixes fs/btrfs/block-group.c:1570: warning: Function parameter or member 'fs_info' not described in 'btrfs_rmap_block'
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ee9b979
    • Nikolay Borisov's avatar
      btrfs: document now parameter of peek_discard_list · 92419695
      Nikolay Borisov authored
      Fixes fs/btrfs/discard.c:203: warning: Function parameter or member 'now' not described in 'peek_discard_list'
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92419695
    • Nikolay Borisov's avatar
      btrfs: improve parameter description for __btrfs_write_out_cache · f092cf3c
      Nikolay Borisov authored
      Fixes following W=1 warnings:
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'root' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'inode' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'ctl' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'block_group' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'io_ctl' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'trans' not described in '__btrfs_write_out_cache'
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f092cf3c
    • Nikolay Borisov's avatar
      btrfs: fix parameter description in delayed-ref.c functions · 696eb22b
      Nikolay Borisov authored
      This fixes the following warnings:
      
      fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_release'
      fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'nr' not described in 'btrfs_delayed_refs_rsv_release'
      fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'fs_info' not described in 'btrfs_migrate_to_delayed_refs_rsv'
      fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'src' not described in 'btrfs_migrate_to_delayed_refs_rsv'
      fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'num_bytes' not described in 'btrfs_migrate_to_delayed_refs_rsv'
      fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_refill'
      fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'flush' not described in 'btrfs_delayed_refs_rsv_refill'
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      696eb22b
    • Nikolay Borisov's avatar
      btrfs: fix function description formats in file-item.c · ca4207ae
      Nikolay Borisov authored
      This fixes following W=1 warnings:
      
      fs/btrfs/file-item.c:27: warning: Cannot understand  * @inode:  the inode we want to update the disk_i_size for
       on line 27 - I thought it was a doc line
      fs/btrfs/file-item.c:65: warning: Cannot understand  * @inode - the inode we're modifying
       on line 65 - I thought it was a doc line
      fs/btrfs/file-item.c:91: warning: Cannot understand  * @inode - the inode we're modifying
       on line 91 - I thought it was a doc line
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca4207ae
    • Nikolay Borisov's avatar
      btrfs: fix parameter description of btrfs_add_extent_mapping · 9ad37bb3
      Nikolay Borisov authored
      This fixes the following compiler warnings:
      
      fs/btrfs/extent_map.c:601: warning: Function parameter or member 'fs_info' not described in 'btrfs_add_extent_mapping'
      fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_tree' not described in 'btrfs_add_extent_mapping'
      fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_in' not described in 'btrfs_add_extent_mapping'
      fs/btrfs/extent_map.c:601: warning: Function parameter or member 'start' not described in 'btrfs_add_extent_mapping'
      fs/btrfs/extent_map.c:601: warning: Function parameter or member 'len' not described in 'btrfs_add_extent_mapping'
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ad37bb3
    • Nikolay Borisov's avatar
      btrfs: document modified parameter of add_extent_mapping · 401bd2dd
      Nikolay Borisov authored
      Fixes fs/btrfs/extent_map.c:399: warning: Function parameter or member
      'modified' not described in 'add_extent_mapping'
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      401bd2dd
    • Qu Wenruo's avatar
      btrfs: rework the order of btrfs_ordered_extent::flags · 3c198fe0
      Qu Wenruo authored
      [BUG]
      There is a long existing bug in the last parameter of
      btrfs_add_ordered_extent(), in commit 771ed689 ("Btrfs: Optimize
      compressed writeback and reads") back to 2008.
      
      In that ancient commit btrfs_add_ordered_extent() expects the @type
      parameter to be one of the following:
      
      - BTRFS_ORDERED_REGULAR
      - BTRFS_ORDERED_NOCOW
      - BTRFS_ORDERED_PREALLOC
      - BTRFS_ORDERED_COMPRESSED
      
      But we pass 0 in cow_file_range(), which means BTRFS_ORDERED_IO_DONE.
      
      Ironically extra check in __btrfs_add_ordered_extent() won't set the bit
      if we see (type == IO_DONE || type == IO_COMPLETE), and avoid any
      obvious bug.
      
      But this still leads to regular COW ordered extent having no bit to
      indicate its type in various trace events, rendering REGULAR bit
      useless.
      
      [FIX]
      Change the following aspects to avoid such problem:
      
      - Reorder btrfs_ordered_extent::flags
        Now the type bits go first (REGULAR/NOCOW/PREALLCO/COMPRESSED), then
        DIRECT bit, finally extra status bits like IO_DONE/COMPLETE/IOERR.
      
      - Add extra ASSERT() for btrfs_add_ordered_extent_*()
      
      - Remove @type parameter for btrfs_add_ordered_extent_compress()
        As the only valid @type here is BTRFS_ORDERED_COMPRESSED.
      
      - Remove the unnecessary special check for IO_DONE/COMPLETE in
        __btrfs_add_ordered_extent()
        This is just to make the code work, with extra ASSERT(), there are
        limited values can be passed in.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c198fe0
    • Yang Li's avatar
      btrfs: remove redundant NULL check before kvfree · fe3b7bb0
      Yang Li authored
      Fix below warnings reported by coccicheck:
      ./fs/btrfs/raid56.c:237:2-8: WARNING: NULL check before some freeing
      functions is not needed.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarYang Li <abaci-bugfix@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe3b7bb0
    • Josef Bacik's avatar
      btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node · 7e2a870a
      Josef Bacik authored
      Zygo reported the following panic when testing my error handling patches
      for relocation:
      
        kernel BUG at fs/btrfs/backref.c:2545!
        invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G        W 14
        Hardware name: QEMU Standard PC (i440FX + PIIX,
      
        Call Trace:
         btrfs_backref_error_cleanup+0x4df/0x530
         build_backref_tree+0x1a5/0x700
         ? _raw_spin_unlock+0x22/0x30
         ? release_extent_buffer+0x225/0x280
         ? free_extent_buffer.part.52+0xd7/0x140
         relocate_tree_blocks+0x2a6/0xb60
         ? kasan_unpoison_shadow+0x35/0x50
         ? do_relocation+0xc10/0xc10
         ? kasan_kmalloc+0x9/0x10
         ? kmem_cache_alloc_trace+0x6a3/0xcb0
         ? free_extent_buffer.part.52+0xd7/0x140
         ? rb_insert_color+0x342/0x360
         ? add_tree_block.isra.36+0x236/0x2b0
         relocate_block_group+0x2eb/0x780
         ? merge_reloc_roots+0x470/0x470
         btrfs_relocate_block_group+0x26e/0x4c0
         btrfs_relocate_chunk+0x52/0x120
         btrfs_balance+0xe2e/0x18f0
         ? pvclock_clocksource_read+0xeb/0x190
         ? btrfs_relocate_chunk+0x120/0x120
         ? lock_contended+0x620/0x6e0
         ? do_raw_spin_lock+0x1e0/0x1e0
         ? do_raw_spin_unlock+0xa8/0x140
         btrfs_ioctl_balance+0x1f9/0x460
         btrfs_ioctl+0x24c8/0x4380
         ? __kasan_check_read+0x11/0x20
         ? check_chain_key+0x1f4/0x2f0
         ? __asan_loadN+0xf/0x20
         ? btrfs_ioctl_get_supported_features+0x30/0x30
         ? kvm_sched_clock_read+0x18/0x30
         ? check_chain_key+0x1f4/0x2f0
         ? lock_downgrade+0x3f0/0x3f0
         ? handle_mm_fault+0xad6/0x2150
         ? do_vfs_ioctl+0xfc/0x9d0
         ? ioctl_file_clone+0xe0/0xe0
         ? check_flags.part.50+0x6c/0x1e0
         ? check_flags.part.50+0x6c/0x1e0
         ? check_flags+0x26/0x30
         ? lock_is_held_type+0xc3/0xf0
         ? syscall_enter_from_user_mode+0x1b/0x60
         ? do_syscall_64+0x13/0x80
         ? rcu_read_lock_sched_held+0xa1/0xd0
         ? __kasan_check_read+0x11/0x20
         ? __fget_light+0xae/0x110
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This occurs because of this check
      
        if (RB_EMPTY_NODE(&upper->rb_node))
      	  BUG_ON(!list_empty(&node->upper));
      
      As we are dropping the backref node, if we discover that our upper node
      in the edge we just cleaned up isn't linked into the cache that we are
      now done with this node, thus the BUG_ON().
      
      However this is an erroneous assumption, as we will look up all the
      references for a node first, and then process the pending edges.  All of
      the 'upper' nodes in our pending edges won't be in the cache's rb_tree
      yet, because they haven't been processed.  We could very well have many
      edges still left to cleanup on this node.
      
      The fact is we simply do not need this check, we can just process all of
      the edges only for this node, because below this check we do the
      following
      
        if (list_empty(&upper->lower)) {
      	  list_add_tail(&upper->lower, &cache->leaves);
      	  upper->lowest = 1;
        }
      
      If the upper node truly isn't used yet, then we add it to the
      cache->leaves list to be cleaned up later.  If it is still used then the
      last child node that has it linked into its node will add it to the
      leaves list and then it will be cleaned up.
      
      Fix this problem by dropping this logic altogether.  With this fix I no
      longer see the panic when testing with error injection in the backref
      code.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e2a870a
    • Josef Bacik's avatar
      btrfs: keep track of the root owner for relocation reads · f7ba2d37
      Josef Bacik authored
      While testing the error paths in relocation, I hit the following lockdep
      splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.10.0-rc3+ #206 Not tainted
        ------------------------------------------------------
        btrfs-balance/1571 is trying to acquire lock:
        ffff8cdbcc8f77d0 (&head_ref->mutex){+.+.}-{3:3}, at: btrfs_lookup_extent_info+0x156/0x3b0
      
        but task is already holding lock:
        ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-tree-00){++++}-{3:3}:
      	 down_write_nested+0x43/0x80
      	 __btrfs_tree_lock+0x27/0x100
      	 btrfs_search_slot+0x248/0x890
      	 relocate_tree_blocks+0x490/0x650
      	 relocate_block_group+0x1ba/0x5d0
      	 kretprobe_trampoline+0x0/0x50
      
        -> #1 (btrfs-csum-01){++++}-{3:3}:
      	 down_read_nested+0x43/0x130
      	 __btrfs_tree_read_lock+0x27/0x100
      	 btrfs_read_lock_root_node+0x31/0x40
      	 btrfs_search_slot+0x5ab/0x890
      	 btrfs_del_csums+0x10b/0x3c0
      	 __btrfs_free_extent+0x49d/0x8e0
      	 __btrfs_run_delayed_refs+0x283/0x11f0
      	 btrfs_run_delayed_refs+0x86/0x220
      	 btrfs_start_dirty_block_groups+0x2ba/0x520
      	 kretprobe_trampoline+0x0/0x50
      
        -> #0 (&head_ref->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1167/0x2150
      	 lock_acquire+0x116/0x3e0
      	 __mutex_lock+0x7e/0x7b0
      	 btrfs_lookup_extent_info+0x156/0x3b0
      	 walk_down_proc+0x1c3/0x280
      	 walk_down_tree+0x64/0xe0
      	 btrfs_drop_subtree+0x182/0x260
      	 do_relocation+0x52e/0x660
      	 relocate_tree_blocks+0x2ae/0x650
      	 relocate_block_group+0x1ba/0x5d0
      	 kretprobe_trampoline+0x0/0x50
      
        other info that might help us debug this:
      
        Chain exists of:
          &head_ref->mutex --> btrfs-csum-01 --> btrfs-tree-00
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-00);
      				 lock(btrfs-csum-01);
      				 lock(btrfs-tree-00);
          lock(&head_ref->mutex);
      
         *** DEADLOCK ***
      
        5 locks held by btrfs-balance/1571:
         #0: ffff8cdb89749ff8 (&fs_info->delete_unused_bgs_mutex){+.+.}-{3:3}, at: btrfs_balance+0x563/0xf40
         #1: ffff8cdb89748838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x156/0x300
         #2: ffff8cdbc2c16650 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x413/0x5c0
         #3: ffff8cdbc135f538 (btrfs-treloc-01){+.+.}-{3:3}, at: __btrfs_tree_lock+0x27/0x100
         #4: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100
      
        stack backtrace:
        CPU: 1 PID: 1571 Comm: btrfs-balance Not tainted 5.10.0-rc3+ #206
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb0
         check_noncircular+0xcf/0xf0
         ? trace_call_bpf+0x139/0x260
         __lock_acquire+0x1167/0x2150
         lock_acquire+0x116/0x3e0
         ? btrfs_lookup_extent_info+0x156/0x3b0
         __mutex_lock+0x7e/0x7b0
         ? btrfs_lookup_extent_info+0x156/0x3b0
         ? btrfs_lookup_extent_info+0x156/0x3b0
         ? release_extent_buffer+0x124/0x170
         ? _raw_spin_unlock+0x1f/0x30
         ? release_extent_buffer+0x124/0x170
         btrfs_lookup_extent_info+0x156/0x3b0
         walk_down_proc+0x1c3/0x280
         walk_down_tree+0x64/0xe0
         btrfs_drop_subtree+0x182/0x260
         do_relocation+0x52e/0x660
         relocate_tree_blocks+0x2ae/0x650
         ? add_tree_block+0x149/0x1b0
         relocate_block_group+0x1ba/0x5d0
         elfcorehdr_read+0x40/0x40
         ? elfcorehdr_read+0x40/0x40
         ? btrfs_balance+0x796/0xf40
         ? __kthread_parkme+0x66/0x90
         ? btrfs_balance+0xf40/0xf40
         ? balance_kthread+0x37/0x50
         ? kthread+0x137/0x150
         ? __kthread_bind_mask+0x60/0x60
         ? ret_from_fork+0x1f/0x30
      
      As you can see this is bogus, we never take another tree's lock under
      the csum lock.  This happens because sometimes we have to read tree
      blocks from disk without knowing which root they belong to during
      relocation.  We defaulted to an owner of 0, which translates to an fs
      tree.  This is fine as all fs trees have the same class, but obviously
      isn't fine if the block belongs to a COW only tree.
      
      Thankfully COW only trees only have their owners root as a reference to
      them, and since we already look up the extent information during
      relocation, go ahead and check and see if this block might belong to a
      COW only tree, and if so save the owner in the tree_block struct.  This
      allows us to read_tree_block with the proper owner, which gets rid of
      this lockdep splat.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7ba2d37
    • Qu Wenruo's avatar
      btrfs: introduce helper to grab an existing extent buffer from a page · c0f0a9e7
      Qu Wenruo authored
      This patch will extract the code to grab an extent buffer from a page
      into a helper, grab_extent_buffer_from_page().
      
      This reduces one indent level, and provides the work place for later
      expansion for subapge support.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0f0a9e7
    • Qu Wenruo's avatar
      btrfs: update comment for btrfs_dirty_pages · c0fab480
      Qu Wenruo authored
      The original comment is from the initial merge, which has several
      problems:
      
      - No holes check any more
      - No inline decision is made
      
      Update the out-of-date comment with more correct one.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0fab480
    • Qu Wenruo's avatar
      btrfs: refactor __extent_writepage_io() to improve readability · 6bc5636a
      Qu Wenruo authored
      The refactoring involves the following modifications:
      
      - iosize alignment
        In fact we don't really need to manually do alignment at all.
        All extent maps should already be aligned, thus basic ASSERT() check
        would be enough.
      
      - redundant variables
        We have extra variable like blocksize/pg_offset/end.
        They are all unnecessary.
      
        @blocksize can be replaced by sectorsize size directly, and it's only
        used to verify the em start/size is aligned.
      
        @pg_offset can be easily calculated using @cur and page_offset(page).
      
        @end is just assigned from @page_end and never modified, use
        "start + PAGE_SIZE - 1" directly and remove @page_end.
      
      - remove some BUG_ON()s
        The BUG_ON()s are for extent map, which we have tree-checker to check
        on-disk extent data item and runtime check.
        ASSERT() should be enough.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6bc5636a
    • Qu Wenruo's avatar
      btrfs: rename parameter offset to disk_bytenr in submit_extent_page · 0c64c33c
      Qu Wenruo authored
      The parameter offset is confusing, it's supposed to be the disk bytenr
      of metadata/data.  Rename it to disk_bytenr and update the comment.
      
      Also rename each offset passed to submit_extent_page() as @disk_bytenr
      so they're consistent.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c64c33c