1. 30 Apr, 2020 1 commit
  2. 27 Apr, 2020 2 commits
    • Qu Wenruo's avatar
      btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info · fcc99734
      Qu Wenruo authored
      [BUG]
      One run of btrfs/063 triggered the following lockdep warning:
        ============================================
        WARNING: possible recursive locking detected
        5.6.0-rc7-custom+ #48 Not tainted
        --------------------------------------------
        kworker/u24:0/7 is trying to acquire lock:
        ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
      
        but task is already holding lock:
        ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(sb_internal#2);
          lock(sb_internal#2);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        4 locks held by kworker/u24:0/7:
         #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80
         #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80
         #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
         #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs]
      
        stack backtrace:
        CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        Call Trace:
         dump_stack+0xc2/0x11a
         __lock_acquire.cold+0xce/0x214
         lock_acquire+0xe6/0x210
         __sb_start_write+0x14e/0x290
         start_transaction+0x66c/0x890 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         find_free_extent+0x1504/0x1a50 [btrfs]
         btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
         btrfs_alloc_tree_block+0x1ac/0x570 [btrfs]
         btrfs_copy_root+0x213/0x580 [btrfs]
         create_reloc_root+0x3bd/0x470 [btrfs]
         btrfs_init_reloc_root+0x2d2/0x310 [btrfs]
         record_root_in_trans+0x191/0x1d0 [btrfs]
         btrfs_record_root_in_trans+0x90/0xd0 [btrfs]
         start_transaction+0x16e/0x890 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs]
         finish_ordered_fn+0x15/0x20 [btrfs]
         btrfs_work_helper+0x116/0x9a0 [btrfs]
         process_one_work+0x632/0xb80
         worker_thread+0x80/0x690
         kthread+0x1a3/0x1f0
         ret_from_fork+0x27/0x50
      
      It's pretty hard to reproduce, only one hit so far.
      
      [CAUSE]
      This is because we're calling btrfs_join_transaction() without re-using
      the current running one:
      
      btrfs_finish_ordered_io()
      |- btrfs_join_transaction()		<<< Call #1
         |- btrfs_record_root_in_trans()
            |- btrfs_reserve_extent()
      	 |- btrfs_join_transaction()	<<< Call #2
      
      Normally such btrfs_join_transaction() call should re-use the existing
      one, without trying to re-start a transaction.
      
      But the problem is, in btrfs_join_transaction() call #1, we call
      btrfs_record_root_in_trans() before initializing current::journal_info.
      
      And in btrfs_join_transaction() call #2, we're relying on
      current::journal_info to avoid such deadlock.
      
      [FIX]
      Call btrfs_record_root_in_trans() after we have initialized
      current::journal_info.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fcc99734
    • Filipe Manana's avatar
      btrfs: fix partial loss of prealloc extent past i_size after fsync · f135cea3
      Filipe Manana authored
      When we have an inode with a prealloc extent that starts at an offset
      lower than the i_size and there is another prealloc extent that starts at
      an offset beyond i_size, we can end up losing part of the first prealloc
      extent (the part that starts at i_size) and have an implicit hole if we
      fsync the file and then have a power failure.
      
      Consider the following example with comments explaining how and why it
      happens.
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        # Create our test file with 2 consecutive prealloc extents, each with a
        # size of 128Kb, and covering the range from 0 to 256Kb, with a file
        # size of 0.
        $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
        $ xfs_io -c "falloc -k 128K 128K" /mnt/foo
      
        # Fsync the file to record both extents in the log tree.
        $ xfs_io -c "fsync" /mnt/foo
      
        # Now do a redudant extent allocation for the range from 0 to 64Kb.
        # This will merely increase the file size from 0 to 64Kb. Instead we
        # could also do a truncate to set the file size to 64Kb.
        $ xfs_io -c "falloc 0 64K" /mnt/foo
      
        # Fsync the file, so we update the inode item in the log tree with the
        # new file size (64Kb). This also ends up setting the number of bytes
        # for the first prealloc extent to 64Kb. This is done by the truncation
        # at btrfs_log_prealloc_extents().
        # This means that if a power failure happens after this, a write into
        # the file range 64Kb to 128Kb will not use the prealloc extent and
        # will result in allocation of a new extent.
        $ xfs_io -c "fsync" /mnt/foo
      
        # Now set the file size to 256K with a truncate and then fsync the file.
        # Since no changes happened to the extents, the fsync only updates the
        # i_size in the inode item at the log tree. This results in an implicit
        # hole for the file range from 64Kb to 128Kb, something which fsck will
        # complain when not using the NO_HOLES feature if we replay the log
        # after a power failure.
        $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
      
      So instead of always truncating the log to the inode's current i_size at
      btrfs_log_prealloc_extents(), check first if there's a prealloc extent
      that starts at an offset lower than the i_size and with a length that
      crosses the i_size - if there is one, just make sure we truncate to a
      size that corresponds to the end offset of that prealloc extent, so
      that we don't lose the part of that extent that starts at i_size if a
      power failure happens.
      
      A test case for fstests follows soon.
      
      Fixes: 31d11b83 ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f135cea3
  3. 23 Apr, 2020 4 commits
    • Xiyu Yang's avatar
      btrfs: fix transaction leak in btrfs_recover_relocation · 1402d17d
      Xiyu Yang authored
      btrfs_recover_relocation() invokes btrfs_join_transaction(), which joins
      a btrfs_trans_handle object into transactions and returns a reference of
      it with increased refcount to "trans".
      
      When btrfs_recover_relocation() returns, "trans" becomes invalid, so the
      refcount should be decreased to keep refcount balanced.
      
      The reference counting issue happens in one exception handling path of
      btrfs_recover_relocation(). When read_fs_root() failed, the refcnt
      increased by btrfs_join_transaction() is not decreased, causing a refcnt
      leak.
      
      Fix this issue by calling btrfs_end_transaction() on this error path
      when read_fs_root() failed.
      
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: default avatarXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1402d17d
    • Xiyu Yang's avatar
      btrfs: fix block group leak when removing fails · f6033c5e
      Xiyu Yang authored
      btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which
      returns a local reference of the block group that contains the given
      bytenr to "block_group" with increased refcount.
      
      When btrfs_remove_block_group() returns, "block_group" becomes invalid,
      so the refcount should be decreased to keep refcount balanced.
      
      The reference counting issue happens in several exception handling paths
      of btrfs_remove_block_group(). When those error scenarios occur such as
      btrfs_alloc_path() returns NULL, the function forgets to decrease its
      refcnt increased by btrfs_lookup_block_group() and will cause a refcnt
      leak.
      
      Fix this issue by jumping to "out_put_group" label and calling
      btrfs_put_block_group() when those error scenarios occur.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: default avatarXin Tan <tanxin.ctf@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6033c5e
    • Josef Bacik's avatar
      btrfs: drop logs when we've aborted a transaction · ef67963d
      Josef Bacik authored
      Dave reported a problem where we were panicing with generic/475 with
      misc-5.7.  This is because we were doing IO after we had stopped all of
      the worker threads, because we do the log tree cleanup on roots at drop
      time.  Cleaning up the log tree will always need to do reads if we
      happened to have evicted the blocks from memory.
      
      Because of this simply add a helper to btrfs_cleanup_transaction() that
      will go through and drop all of the log roots.  This gets run before we
      do the close_ctree() work, and thus we are allowed to do any reads that
      we would need.  I ran this through many iterations of generic/475 with
      constrained memory and I did not see the issue.
      
        general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 2 PID: 12359 Comm: umount Tainted: G        W 5.6.0-rc7-btrfs-next-58 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_queue_work+0x33/0x1c0 [btrfs]
        RSP: 0018:ffff9cfb015937d8 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff8eb5e339ed80 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff8eb5eb33b770 RDI: ffff8eb5e37a0460
        RBP: ffff8eb5eb33b770 R08: 000000000000020c R09: ffffffff9fc09ac0
        R10: 0000000000000007 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b
        R13: ffff9cfb00229040 R14: 0000000000000008 R15: ffff8eb5d3868000
        FS:  00007f167ea022c0(0000) GS:ffff8eb5fae00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f167e5e0cb1 CR3: 0000000138c18004 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_end_bio+0x81/0x130 [btrfs]
         __split_and_process_bio+0xaf/0x4e0 [dm_mod]
         ? percpu_counter_add_batch+0xa3/0x120
         dm_process_bio+0x98/0x290 [dm_mod]
         ? generic_make_request+0xfb/0x410
         dm_make_request+0x4d/0x120 [dm_mod]
         ? generic_make_request+0xfb/0x410
         generic_make_request+0x12a/0x410
         ? submit_bio+0x38/0x160
         submit_bio+0x38/0x160
         ? percpu_counter_add_batch+0xa3/0x120
         btrfs_map_bio+0x289/0x570 [btrfs]
         ? kmem_cache_alloc+0x24d/0x300
         btree_submit_bio_hook+0x79/0xc0 [btrfs]
         submit_one_bio+0x31/0x50 [btrfs]
         read_extent_buffer_pages+0x2fe/0x450 [btrfs]
         btree_read_extent_buffer_pages+0x7e/0x170 [btrfs]
         walk_down_log_tree+0x343/0x690 [btrfs]
         ? walk_log_tree+0x3d/0x380 [btrfs]
         walk_log_tree+0xf7/0x380 [btrfs]
         ? plist_requeue+0xf0/0xf0
         ? delete_node+0x4b/0x230
         free_log_tree+0x4c/0x130 [btrfs]
         ? wait_log_commit+0x140/0x140 [btrfs]
         btrfs_free_log+0x17/0x30 [btrfs]
         btrfs_drop_and_free_fs_root+0xb0/0xd0 [btrfs]
         btrfs_free_fs_roots+0x10c/0x190 [btrfs]
         ? do_raw_spin_unlock+0x49/0xc0
         ? _raw_spin_unlock+0x29/0x40
         ? release_extent_buffer+0x121/0x170 [btrfs]
         close_ctree+0x289/0x2e6 [btrfs]
         generic_shutdown_super+0x6c/0x110
         kill_anon_super+0xe/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x3a/0x70
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Fixes: 8c38938c ("btrfs: move the root freeing stuff into btrfs_put_root")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef67963d
    • Filipe Manana's avatar
      btrfs: fix memory leak of transaction when deleting unused block group · 5150bf19
      Filipe Manana authored
      When cleaning pinned extents right before deleting an unused block group,
      we check if there's still a previous transaction running and if so we
      increment its reference count before using it for cleaning pinned ranges
      in its pinned extents iotree. However we ended up never decrementing the
      reference count after using the transaction, resulting in a memory leak.
      
      Fix it by decrementing the reference count.
      
      Fixes: fe119a6e ("btrfs: switch to per-transaction pinned extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5150bf19
  4. 20 Apr, 2020 1 commit
  5. 17 Apr, 2020 1 commit
    • Josef Bacik's avatar
      btrfs: fix setting last_trans for reloc roots · aec7db3b
      Josef Bacik authored
      I made a mistake with my previous fix, I assumed that we didn't need to
      mess with the reloc roots once we were out of the part of relocation where
      we are actually moving the extents.
      
      The subtle thing that I missed is that btrfs_init_reloc_root() also
      updates the last_trans for the reloc root when we do
      btrfs_record_root_in_trans() for the corresponding fs_root.  I've added a
      comment to make sure future me doesn't make this mistake again.
      
      This showed up as a WARN_ON() in btrfs_copy_root() because our
      last_trans didn't == the current transid.  This could happen if we
      snapshotted a fs root with a reloc root after we set
      rc->create_reloc_tree = 0, but before we actually merge the reloc root.
      
      Worth mentioning that the regression produced the following warning
      when running snapshot creation and balance in parallel:
      
        BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs]
        CPU: 0 PID: 12823 Comm: btrfs Tainted: G        W 5.6.0-rc7-btrfs-next-58 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs]
        RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202
        RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48
        RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818
        RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002
        R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818
        R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00
        FS:  00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? create_reloc_root+0x49/0x2b0 [btrfs]
         ? kmem_cache_alloc_trace+0xe5/0x200
         create_reloc_root+0x8b/0x2b0 [btrfs]
         btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs]
         create_pending_snapshot+0x610/0x1010 [btrfs]
         create_pending_snapshots+0xa8/0xd0 [btrfs]
         btrfs_commit_transaction+0x4c7/0xc50 [btrfs]
         ? btrfs_mksubvol+0x3cd/0x560 [btrfs]
         btrfs_mksubvol+0x455/0x560 [btrfs]
         __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs]
         btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs]
         ? mem_cgroup_commit_charge+0x6e/0x540
         btrfs_ioctl+0x12d8/0x3760 [btrfs]
         ? do_raw_spin_unlock+0x49/0xc0
         ? _raw_spin_unlock+0x29/0x40
         ? __handle_mm_fault+0x11b3/0x14b0
         ? ksys_ioctl+0x92/0xb0
         ksys_ioctl+0x92/0xb0
         ? trace_hardirqs_off_thunk+0x1a/0x1c
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x5c/0x280
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7fdeabd3bdd7
      
      Fixes: 2abc726a ("btrfs: do not init a reloc root if we aren't relocating")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aec7db3b
  6. 10 Apr, 2020 1 commit
  7. 08 Apr, 2020 4 commits
    • Filipe Manana's avatar
      btrfs: fix reclaim counter leak of space_info objects · d611add4
      Filipe Manana authored
      Whenever we add a ticket to a space_info object we increment the object's
      reclaim_size counter witht the ticket's bytes, and we decrement it with
      the corresponding amount only when we are able to grant the requested
      space to the ticket. When we are not able to grant the space to a ticket,
      or when the ticket is removed due to a signal (e.g. an application has
      received sigterm from the terminal) we never decrement the counter with
      the corresponding bytes from the ticket. This leak can result in the
      space reclaim code to later do much more work than necessary. So fix it
      by decrementing the counter when those two cases happen as well.
      
      Fixes: db161806 ("btrfs: account ticket size at add/delete time")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d611add4
    • Filipe Manana's avatar
      btrfs: make full fsyncs always operate on the entire file again · 7af59743
      Filipe Manana authored
      This is a revert of commit 0a8068a3 ("btrfs: make ranged full
      fsyncs more efficient"), with updated comment in btrfs_sync_file.
      
      Commit 0a8068a3 ("btrfs: make ranged full fsyncs more efficient")
      made full fsyncs operate on the given range only as it assumed it was safe
      when using the NO_HOLES feature, since the hole detection was simplified
      some time ago and no longer was a source for races with ordered extent
      completion of adjacent file ranges.
      
      However it's still not safe to have a full fsync only operate on the given
      range, because extent maps for new extents might not be present in memory
      due to inode eviction or extent cloning. Consider the following example:
      
      1) We are currently at transaction N;
      
      2) We write to the file range [0, 1MiB);
      
      3) Writeback finishes for the whole range and ordered extents complete,
         while we are still at transaction N;
      
      4) The inode is evicted;
      
      5) We open the file for writing, causing the inode to be loaded to
         memory again, which sets the 'full sync' bit on its flags. At this
         point the inode's list of modified extent maps is empty (figuring
         out which extents were created in the current transaction and were
         not yet logged by an fsync is expensive, that's why we set the
         'full sync' bit when loading an inode);
      
      6) We write to the file range [512KiB, 768KiB);
      
      7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB).
         This correctly flushes this range and logs its extent into the log
         tree. When the writeback started an extent map for range [512KiB, 768KiB)
         was added to the inode's list of modified extents, and when the fsync()
         finishes logging it removes that extent map from the list of modified
         extent maps. This fsync also clears the 'full sync' bit;
      
      8) We do a regular fsync() (full ranged). This fsync() ends up doing
         nothing because the inode's list of modified extents is empty and
         no other changes happened since the previous ranged fsync(), so
         it just returns success (0) and we end up never logging extents for
         the file ranges [0, 512KiB) and [768KiB, 1MiB).
      
      Another scenario where this can happen is if we replace steps 2 to 4 with
      cloning from another file into our test file, as that sets the 'full sync'
      bit in our inode's flags and does not populate its list of modified extent
      maps.
      
      This was causing test case generic/457 to fail sporadically when using the
      NO_HOLES feature, as it exercised this later case where the inode has the
      'full sync' bit set and has no extent maps in memory to represent the new
      extents due to extent cloning.
      
      Fix this by reverting commit 0a8068a3 ("btrfs: make ranged full fsyncs
      more efficient") since there is no easy way to work around it.
      
      Fixes: 0a8068a3 ("btrfs: make ranged full fsyncs more efficient")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7af59743
    • Filipe Manana's avatar
      btrfs: fix lost i_size update after cloning inline extent · 4fdb688c
      Filipe Manana authored
      When not using the NO_HOLES feature we were not marking the destination's
      file range as written after cloning an inline extent into it. This can
      lead to a data loss if the current destination file size is smaller than
      the source file's size.
      
      Example:
      
        $ mkfs.btrfs -f -O ^no-holes /dev/sdc
        $ mount /mnt/sdc /mnt
      
        $ echo "hello world" > /mnt/foo
        $ cp --reflink=always /mnt/foo /mnt/bar
        $ rm -f /mnt/foo
        $ umount /mnt
      
        $ mount /mnt/sdc /mnt
        $ cat /mnt/bar
        $
        $ stat -c %s /mnt/bar
        0
      
        # -> the file is empty, since we deleted foo, the data lost is forever
      
      Fix that by calling btrfs_inode_set_file_extent_range() after cloning an
      inline extent.
      
      A test case for fstests will follow soon.
      
      Link: https://lore.kernel.org/linux-btrfs/20200404193846.GA432065@latitude/Reported-by: default avatarJohannes Hirte <johannes.hirte@datenkhaos.de>
      Fixes: 9ddc959e ("btrfs: use the file extent tree infrastructure")
      Tested-by: default avatarJohannes Hirte <johannes.hirte@datenkhaos.de>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4fdb688c
    • Josef Bacik's avatar
      btrfs: check commit root generation in should_ignore_root · 4d4225fc
      Josef Bacik authored
      Previously we would set the reloc root's last snapshot to transid - 1.
      However there was a problem with doing this, and we changed it to
      setting the last snapshot to the generation of the commit node of the fs
      root.
      
      This however broke should_ignore_root().  The assumption is that if we
      are in a generation newer than when the reloc root was created, then we
      would find the reloc root through normal backref lookups, and thus can
      ignore any fs roots we find with an old enough reloc root.
      
      Now that the last snapshot could be considerably further in the past
      than before, we'd end up incorrectly ignoring an fs root.  Thus we'd
      find no nodes for the bytenr we were searching for, and we'd fail to
      relocate anything.  We'd loop through the relocate code again and see
      that there were still used space in that block group, attempt to
      relocate those bytenr's again, fail in the same way, and just loop like
      this forever.  This is tricky in that we have to not modify the fs root
      at all during this time, so we need to have a block group that has data
      in this fs root that is not shared by any other root, which is why this
      has been difficult to reproduce.
      
      Fixes: 054570a1 ("Btrfs: fix relocation incorrectly dropping data references")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d4225fc
  8. 25 Mar, 2020 2 commits
    • Robbie Ko's avatar
      btrfs: fix missing semaphore unlock in btrfs_sync_file · 6ff06729
      Robbie Ko authored
      Ordered ops are started twice in sync file, once outside of inode mutex
      and once inside, taking the dio semaphore. There was one error path
      missing the semaphore unlock.
      
      Fixes: aab15e8e ("Btrfs: fix rare chances for data loss when doing a fast fsync")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [ add changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ff06729
    • Josef Bacik's avatar
      btrfs: use nofs allocations for running delayed items · 351cbf6e
      Josef Bacik authored
      Zygo reported the following lockdep splat while testing the balance
      patches
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.6.0-c6f0579d496a+ #53 Not tainted
      ------------------------------------------------------
      kswapd0/1133 is trying to acquire lock:
      ffff888092f622c0 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node+0x7c/0x5b0
      
      but task is already holding lock:
      ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (fs_reclaim){+.+.}:
             fs_reclaim_acquire.part.91+0x29/0x30
             fs_reclaim_acquire+0x19/0x20
             kmem_cache_alloc_trace+0x32/0x740
             add_block_entry+0x45/0x260
             btrfs_ref_tree_mod+0x6e2/0x8b0
             btrfs_alloc_tree_block+0x789/0x880
             alloc_tree_block_no_bg_flush+0xc6/0xf0
             __btrfs_cow_block+0x270/0x940
             btrfs_cow_block+0x1ba/0x3a0
             btrfs_search_slot+0x999/0x1030
             btrfs_insert_empty_items+0x81/0xe0
             btrfs_insert_delayed_items+0x128/0x7d0
             __btrfs_run_delayed_items+0xf4/0x2a0
             btrfs_run_delayed_items+0x13/0x20
             btrfs_commit_transaction+0x5cc/0x1390
             insert_balance_item.isra.39+0x6b2/0x6e0
             btrfs_balance+0x72d/0x18d0
             btrfs_ioctl_balance+0x3de/0x4c0
             btrfs_ioctl+0x30ab/0x44a0
             ksys_ioctl+0xa1/0xe0
             __x64_sys_ioctl+0x43/0x50
             do_syscall_64+0x77/0x2c0
             entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #0 (&delayed_node->mutex){+.+.}:
             __lock_acquire+0x197e/0x2550
             lock_acquire+0x103/0x220
             __mutex_lock+0x13d/0xce0
             mutex_lock_nested+0x1b/0x20
             __btrfs_release_delayed_node+0x7c/0x5b0
             btrfs_remove_delayed_node+0x49/0x50
             btrfs_evict_inode+0x6fc/0x900
             evict+0x19a/0x2c0
             dispose_list+0xa0/0xe0
             prune_icache_sb+0xbd/0xf0
             super_cache_scan+0x1b5/0x250
             do_shrink_slab+0x1f6/0x530
             shrink_slab+0x32e/0x410
             shrink_node+0x2a5/0xba0
             balance_pgdat+0x4bd/0x8a0
             kswapd+0x35a/0x800
             kthread+0x1e9/0x210
             ret_from_fork+0x3a/0x50
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(fs_reclaim);
                                     lock(&delayed_node->mutex);
                                     lock(fs_reclaim);
        lock(&delayed_node->mutex);
      
       *** DEADLOCK ***
      
      3 locks held by kswapd0/1133:
       #0: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
       #1: ffffffff8fc380d8 (shrinker_rwsem){++++}, at: shrink_slab+0x1e8/0x410
       #2: ffff8881e0e6c0e8 (&type->s_umount_key#42){++++}, at: trylock_super+0x1b/0x70
      
      stack backtrace:
      CPU: 2 PID: 1133 Comm: kswapd0 Not tainted 5.6.0-c6f0579d496a+ #53
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      Call Trace:
       dump_stack+0xc1/0x11a
       print_circular_bug.isra.38.cold.57+0x145/0x14a
       check_noncircular+0x2a9/0x2f0
       ? print_circular_bug.isra.38+0x130/0x130
       ? stack_trace_consume_entry+0x90/0x90
       ? save_trace+0x3cc/0x420
       __lock_acquire+0x197e/0x2550
       ? btrfs_inode_clear_file_extent_range+0x9b/0xb0
       ? register_lock_class+0x960/0x960
       lock_acquire+0x103/0x220
       ? __btrfs_release_delayed_node+0x7c/0x5b0
       __mutex_lock+0x13d/0xce0
       ? __btrfs_release_delayed_node+0x7c/0x5b0
       ? __asan_loadN+0xf/0x20
       ? pvclock_clocksource_read+0xeb/0x190
       ? __btrfs_release_delayed_node+0x7c/0x5b0
       ? mutex_lock_io_nested+0xc20/0xc20
       ? __kasan_check_read+0x11/0x20
       ? check_chain_key+0x1e6/0x2e0
       mutex_lock_nested+0x1b/0x20
       ? mutex_lock_nested+0x1b/0x20
       __btrfs_release_delayed_node+0x7c/0x5b0
       btrfs_remove_delayed_node+0x49/0x50
       btrfs_evict_inode+0x6fc/0x900
       ? btrfs_setattr+0x840/0x840
       ? do_raw_spin_unlock+0xa8/0x140
       evict+0x19a/0x2c0
       dispose_list+0xa0/0xe0
       prune_icache_sb+0xbd/0xf0
       ? invalidate_inodes+0x310/0x310
       super_cache_scan+0x1b5/0x250
       do_shrink_slab+0x1f6/0x530
       shrink_slab+0x32e/0x410
       ? do_shrink_slab+0x530/0x530
       ? do_shrink_slab+0x530/0x530
       ? __kasan_check_read+0x11/0x20
       ? mem_cgroup_protected+0x13d/0x260
       shrink_node+0x2a5/0xba0
       balance_pgdat+0x4bd/0x8a0
       ? mem_cgroup_shrink_node+0x490/0x490
       ? _raw_spin_unlock_irq+0x27/0x40
       ? finish_task_switch+0xce/0x390
       ? rcu_read_lock_bh_held+0xb0/0xb0
       kswapd+0x35a/0x800
       ? _raw_spin_unlock_irqrestore+0x4c/0x60
       ? balance_pgdat+0x8a0/0x8a0
       ? finish_wait+0x110/0x110
       ? __kasan_check_read+0x11/0x20
       ? __kthread_parkme+0xc6/0xe0
       ? balance_pgdat+0x8a0/0x8a0
       kthread+0x1e9/0x210
       ? kthread_create_worker_on_cpu+0xc0/0xc0
       ret_from_fork+0x3a/0x50
      
      This is because we hold that delayed node's mutex while doing tree
      operations.  Fix this by just wrapping the searches in nofs.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      351cbf6e
  9. 23 Mar, 2020 24 commits
    • Takashi Iwai's avatar
      btrfs: sysfs: Use scnprintf() instead of snprintf() · abdd9feb
      Takashi Iwai authored
      snprintf() is a hard-to-use function, and it's especially difficult to
      use it properly for concatenating substrings in a buffer with a limited
      size.  Since snprintf() returns the would-be-output size, not the actual
      size, the subsequent use of snprintf() may point to the incorrect
      position easily.  Also, returning the value from snprintf() directly to
      sysfs show function would pass a bogus value that is higher than the
      actually truncated string.
      
      That said, although the current code doesn't actually overflow the
      buffer with PAGE_SIZE, it's a usage that shouldn't be done.  Or it's
      worse; this gives a wrong confidence as if it were doing safe
      operations.
      
      This patch replaces such snprintf() calls with a safer version,
      scnprintf().  It returns the actual output size, hence it's more
      intuitive and the code does what's expected.
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abdd9feb
    • Josef Bacik's avatar
      btrfs: do not resolve backrefs for roots that are being deleted · 39dba873
      Josef Bacik authored
      Zygo reported a deadlock where a task was stuck in the inode logical
      resolve code.  The deadlock looks like this
      
        Task 1
        btrfs_ioctl_logical_to_ino
        ->iterate_inodes_from_logical
         ->iterate_extent_inodes
          ->path->search_commit_root isn't set, so a transaction is started
            ->resolve_indirect_ref for a root that's being deleted
      	->search for our key, attempt to lock a node, DEADLOCK
      
        Task 2
        btrfs_drop_snapshot
        ->walk down to a leaf, lock it, walk up, lock node
         ->end transaction
          ->start transaction
            -> wait_cur_trans
      
        Task 3
        btrfs_commit_transaction
        ->wait_event(cur_trans->write_wait, num_writers == 1) DEADLOCK
      
      We are holding a transaction open in btrfs_ioctl_logical_to_ino while we
      try to resolve our references.  btrfs_drop_snapshot() holds onto its
      locks while it stops and starts transaction handles, because it assumes
      nobody is going to touch the root now.  Commit just does what commit
      does, waiting for the writers to finish, blocking any new trans handles
      from starting.
      
      Fix this by making the backref code not try to resolve backrefs of roots
      that are currently being deleted.  This will keep us from walking into a
      snapshot that's currently being deleted.
      
      This problem was harder to hit before because we rarely broke out of the
      snapshot delete halfway through, but with my delayed ref throttling code
      it happened much more often.  However we've always been able to do this,
      so it's not a new problem.
      
      Fixes: 8da6d581 ("Btrfs: added btrfs_find_all_roots()")
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39dba873
    • Josef Bacik's avatar
      btrfs: track reloc roots based on their commit root bytenr · ea287ab1
      Josef Bacik authored
      We always search the commit root of the extent tree for looking up back
      references, however we track the reloc roots based on their current
      bytenr.
      
      This is wrong, if we commit the transaction between relocating tree
      blocks we could end up in this code in build_backref_tree
      
        if (key.objectid == key.offset) {
      	  /*
      	   * Only root blocks of reloc trees use backref
      	   * pointing to itself.
      	   */
      	  root = find_reloc_root(rc, cur->bytenr);
      	  ASSERT(root);
      	  cur->root = root;
      	  break;
        }
      
      find_reloc_root() is looking based on the bytenr we had in the commit
      root, but if we've COWed this reloc root we will not find that bytenr,
      and we will trip over the ASSERT(root).
      
      Fix this by using the commit_root->start bytenr for indexing the commit
      root.  Then we change the __update_reloc_root() caller to be used when
      we switch the commit root for the reloc root during commit.
      
      This fixes the panic I was seeing when we started throttling relocation
      for delayed refs.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea287ab1
    • Josef Bacik's avatar
      btrfs: restart relocate_tree_blocks properly · 50dbbb71
      Josef Bacik authored
      There are two bugs here, but fixing them independently would just result
      in pain if you happened to bisect between the two patches.
      
      First is how we handle the -EAGAIN from relocate_tree_block().  We don't
      set error, unless we happen to be the first node, which makes no sense,
      I have no idea what the code was trying to accomplish here.
      
      We in fact _do_ want err set here so that we know we need to restart in
      relocate_block_group().  Also we need finish_pending_nodes() to not
      actually call link_to_upper(), because we didn't actually relocate the
      block.
      
      And then if we do get -EAGAIN we do not want to set our backref cache
      last_trans to the one before ours.  This would force us to update our
      backref cache if we didn't cross transaction ids, which would mean we'd
      have some nodes updated to their new_bytenr, but still able to find
      their old bytenr because we're searching the same commit root as the
      last time we went through relocate_tree_blocks.
      
      Fixing these two things keeps us from panicing when we start breaking
      out of relocate_tree_blocks() either for delayed ref flushing or enospc.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50dbbb71
    • Josef Bacik's avatar
      btrfs: reloc: reorder reservation before root selection · 5f6b2e5c
      Josef Bacik authored
      Since we're not only checking for metadata reservations but also if we
      need to throttle our delayed ref generation, reorder
      reserve_metadata_space() above the select_one_root() call in
      relocate_tree_block().
      
      The reason we want this is because select_reloc_root() will mess with
      the backref cache, and if we're going to bail we want to be able to
      cleanly remove this node from the backref cache and come back along to
      regenerate it.  Move it up so this is the first thing we do to make
      restarting cleaner.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f6b2e5c
    • Josef Bacik's avatar
      btrfs: do not readahead in build_backref_tree · d7ff00f6
      Josef Bacik authored
      Here we are just searching down to the bytenr we're building the backref
      tree for, and all of it's paths to the roots.  These bytenrs are not
      guaranteed to be anywhere near each other, so readahead just generates
      extra latency.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7ff00f6
    • Josef Bacik's avatar
      btrfs: do not use readahead for running delayed refs · cd22a51c
      Josef Bacik authored
      Readahead will generate a lot of extra reads for adjacent nodes, but
      when running delayed refs we have no idea if the next ref is going to be
      adjacent or not, so this potentially just generates a lot of extra IO.
      To make matters worse each ref is truly just looking for one item, it
      doesn't generally search forward, so we simply don't need it here.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd22a51c
    • Nikolay Borisov's avatar
      btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot · 9babda9f
      Nikolay Borisov authored
      With BTRFS_SUBVOL_CREATE_ASYNC support remove it's no longer required to
      pass the async_transid parameter so remove it and any code using it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9babda9f
    • Nikolay Borisov's avatar
      btrfs: Remove transid argument from btrfs_ioctl_snap_create_transid · 5d54c67e
      Nikolay Borisov authored
      btrfs_ioctl_snap_create_transid no longer takes a transid argument, so
      remove it and rename the function to __btrfs_ioctl_snap_create to
      reflect it's an internal, worker function.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d54c67e
    • Nikolay Borisov's avatar
      btrfs: Remove BTRFS_SUBVOL_CREATE_ASYNC support · 9c1036fd
      Nikolay Borisov authored
      This functionality was deprecated in kernel 5.4. Since no one has
      complained of the impending removal it's time we did so.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c1036fd
    • Josef Bacik's avatar
      btrfs: kill the subvol_srcu · c75e8394
      Josef Bacik authored
      Now that we have proper root ref counting everywhere we can kill the
      subvol_srcu.
      
      * removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes
      
      * the refcount_t used for the references checks for accidental 0->1
        in cases where the root lifetime would not be properly protected
      
      * there's a leak detector for roots to catch unfreed roots at umount
        time
      
      * SRCU served us well over the years but is was not a proper
        synchronization mechanism for some cases
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c75e8394
    • Josef Bacik's avatar
      btrfs: make btrfs_cleanup_fs_roots use the radix tree lock · efc34534
      Josef Bacik authored
      The radix root is primarily protected by the fs_roots_radix_lock, so use
      that to lookup and get a ref on all of our fs roots in
      btrfs_cleanup_fs_roots. The tree reference is taken in the protected
      section as before.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efc34534
    • Josef Bacik's avatar
      btrfs: don't take an extra root ref at allocation time · 4785e24f
      Josef Bacik authored
      Now that all the users of roots take references for them we can drop the
      extra root ref we've been taking.  Before we had roots at 2 refs for the
      life of the file system, one for the radix tree, and one simply for
      existing.  Now that we have proper ref accounting in all places that use
      roots we can drop this extra ref simply for existing as we no longer
      need it.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4785e24f
    • Josef Bacik's avatar
      btrfs: hold a ref on the root on the dead roots list · dc9492c1
      Josef Bacik authored
      At the point we add a root to the dead roots list we have no open inodes
      for that root, so we need to hold a ref on that root to keep it from
      disappearing.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc9492c1
    • Josef Bacik's avatar
      btrfs: make inodes hold a ref on their roots · 5c8fd99f
      Josef Bacik authored
      If we make sure all the inodes have refs on their root we don't have to
      worry about the root disappearing while we have open inodes.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c8fd99f
    • Josef Bacik's avatar
      btrfs: move the root freeing stuff into btrfs_put_root · 8c38938c
      Josef Bacik authored
      There are a few different ways to free roots, either you allocated them
      yourself and you just do
      
      free_extent_buffer(root->node);
      free_extent_buffer(root->commit_node);
      btrfs_put_root(root);
      
      Which is the pattern for log roots.  Or for snapshots/subvolumes that
      are being dropped you simply call btrfs_free_fs_root() which does all
      the cleanup for you.
      
      Unify this all into btrfs_put_root(), so that we don't free up things
      associated with the root until the last reference is dropped.  This
      makes the root freeing code much more significant.
      
      The only caveat is at close_ctree() time we have to free the extent
      buffers for all of our main roots (extent_root, chunk_root, etc) because
      we have to drop the btree_inode and we'll run into issues if we hold
      onto those nodes until ->kill_sb() time.  This will be addressed in the
      future when we kill the btree_inode.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c38938c
    • Josef Bacik's avatar
      btrfs: move ino_cache_inode dropping out of btrfs_free_fs_root · 0e996e7f
      Josef Bacik authored
      We are going to make root life be controlled soley by refcounting, and
      inodes will be one of the things that hold a ref on the root.  This
      means we need to handle dropping the ino_cache_inode outside of the root
      freeing logic, so move it into btrfs_drop_and_free_fs_root() so it is
      cleaned up properly on unmount.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e996e7f
    • Josef Bacik's avatar
      btrfs: make the extent buffer leak check per fs info · 3fd63727
      Josef Bacik authored
      I'm going to make the entire destruction of btrfs_root's controlled by
      their refcount, so it will be helpful to notice if we're leaking their
      eb's on umount.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3fd63727
    • Josef Bacik's avatar
      btrfs: remove a BUG_ON() from merge_reloc_roots() · 7b7b7431
      Josef Bacik authored
      This was pretty subtle, we default to reloc roots having 0 root refs, so
      if we crash in the middle of the relocation they can just be deleted.
      If we successfully complete the relocation operations we'll set our root
      refs to 1 in prepare_to_merge() and then go on to merge_reloc_roots().
      
      At prepare_to_merge() time if any of the reloc roots have a 0 reference
      still, we will remove that reloc root from our reloc root rb tree, and
      then clean it up later.
      
      However this only happens if we successfully start a transaction.  If
      we've aborted previously we will skip this step completely, and only
      have reloc roots with a reference count of 0, but were never properly
      removed from the reloc control's rb tree.
      
      This isn't a problem per-se, our references are held by the list the
      reloc roots are on, and by the original root the reloc root belongs to.
      If we end up in this situation all the reloc roots will be added to the
      dirty_reloc_list, and then properly dropped at that point.  The reloc
      control will be free'd and the rb tree is no longer used.
      
      There were two options when fixing this, one was to remove the BUG_ON(),
      the other was to make prepare_to_merge() handle the case where we
      couldn't start a trans handle.
      
      IMO this is the cleaner solution.  I started with handling the error in
      prepare_to_merge(), but it turned out super ugly.  And in the end this
      BUG_ON() simply doesn't matter, the cleanup was happening properly, we
      were just panicing because this BUG_ON() only matters in the success
      case.  So I've opted to just remove it and add a comment where it was.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7b7b7431
    • Josef Bacik's avatar
      btrfs: hold a ref on the root->reloc_root · f44deb74
      Josef Bacik authored
      We previously were relying on root->reloc_root to be cleaned up by the
      drop snapshot, or the error handling.  However if btrfs_drop_snapshot()
      failed it wouldn't drop the ref for the root.  Also we sort of depend on
      the right thing to happen with moving reloc roots between lists and the
      fs root they belong to, which makes it hard to figure out who owns the
      reference.
      
      Fix this by explicitly holding a reference on the reloc root for
      roo->reloc_root.  This means that we hold two references on reloc roots,
      one for whichever reloc_roots list it's attached to, and the
      root->reloc_root we're on.
      
      This makes it easier to reason out who owns a reference on the root, and
      when it needs to be dropped.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f44deb74
    • Josef Bacik's avatar
      btrfs: clear DEAD_RELOC_TREE before dropping the reloc root · f28de8d8
      Josef Bacik authored
      The DEAD_RELOC_TREE flag is in place in order to avoid a use after free
      in init_reloc_root, tracking the presence of reloc_root.  However adding
      the explicit tree references in previous patches makes the use after
      free impossible because at this point we no longer have a reloc_control
      set on the fs_info and thus cannot enter the function.
      
      So move this to be coupled with clearing the root->reloc_root so we're
      consistent with all other operations of the reloc root.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f28de8d8
    • Josef Bacik's avatar
      btrfs: free the reloc_control in a consistent way · 1a0afa0e
      Josef Bacik authored
      If we have an error while processing the reloc roots we could leak roots
      that were added to rc->reloc_roots before we hit the error.  We could
      have also not removed the reloc tree mapping from our rb_tree, so clean
      up any remaining nodes in the reloc root rb_tree.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ use rbtree_postorder_for_each_entry_safe ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a0afa0e
    • Josef Bacik's avatar
      btrfs: do not init a reloc root if we aren't relocating · 2abc726a
      Josef Bacik authored
      We previously were checking if the root had a dead root before accessing
      root->reloc_root in order to avoid a use-after-free type bug.  However
      this scenario happens after we've unset the reloc control, so we would
      have been saved if we'd simply checked for fs_info->reloc_control.  At
      this point during relocation we no longer need to be creating new reloc
      roots, so simply move this check above the reloc_root checks to avoid
      any future races and confusion.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2abc726a
    • Josef Bacik's avatar
      btrfs: reloc: clean dirty subvols if we fail to start a transaction · 6217b0fa
      Josef Bacik authored
      If we do merge_reloc_roots() we could insert a few roots onto the dirty
      subvol roots list, where we hold a ref on them.  If we fail to start the
      transaction we need to run clean_dirty_subvols() in order to cleanup the
      refs.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6217b0fa