1. 25 Sep, 2020 1 commit
  2. 22 Sep, 2020 1 commit
    • Anand Jain's avatar
      btrfs: fix put of uninitialized kobject after seed device delete · b5ddcffa
      Anand Jain authored
      The following test case leads to NULL kobject free error:
      
        mount seed /mnt
        add sprout to /mnt
        umount /mnt
        mount sprout to /mnt
        delete seed
      
        kobject: '(null)' (00000000dd2b87e4): is not initialized, yet kobject_put() is being called.
        WARNING: CPU: 1 PID: 15784 at lib/kobject.c:736 kobject_put+0x80/0x350
        RIP: 0010:kobject_put+0x80/0x350
        ::
        Call Trace:
        btrfs_sysfs_remove_devices_dir+0x6e/0x160 [btrfs]
        btrfs_rm_device.cold+0xa8/0x298 [btrfs]
        btrfs_ioctl+0x206c/0x22a0 [btrfs]
        ksys_ioctl+0xe2/0x140
        __x64_sys_ioctl+0x1e/0x29
        do_syscall_64+0x96/0x150
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f4047c6288b
        ::
      
      This is because, at the end of the seed device-delete, we try to remove
      the seed's devid sysfs entry. But for the seed devices under the sprout
      fs, we don't initialize the devid kobject yet. So add a kobject state
      check, which takes care of the bug.
      
      Fixes: 668e48af ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
      CC: stable@vger.kernel.org # 5.6+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b5ddcffa
  3. 21 Sep, 2020 1 commit
    • Johannes Thumshirn's avatar
      btrfs: fix overflow when copying corrupt csums for a message · 35be8851
      Johannes Thumshirn authored
      Syzkaller reported a buffer overflow in btree_readpage_end_io_hook()
      when loop mounting a crafted image:
      
        detected buffer overflow in memcpy
        ------------[ cut here ]------------
        kernel BUG at lib/string.c:1129!
        invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        CPU: 1 PID: 26 Comm: kworker/u4:2 Not tainted 5.9.0-rc4-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Workqueue: btrfs-endio-meta btrfs_work_helper
        RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
        RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
        RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
        RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
        RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
        R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
        R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
        FS:  0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000557335f440d0 CR3: 000000009647d000 CR4: 00000000001506e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         memcpy include/linux/string.h:405 [inline]
         btree_readpage_end_io_hook.cold+0x206/0x221 fs/btrfs/disk-io.c:642
         end_bio_extent_readpage+0x4de/0x10c0 fs/btrfs/extent_io.c:2854
         bio_endio+0x3cf/0x7f0 block/bio.c:1449
         end_workqueue_fn+0x114/0x170 fs/btrfs/disk-io.c:1695
         btrfs_work_helper+0x221/0xe20 fs/btrfs/async-thread.c:318
         process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
         worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
         kthread+0x3b5/0x4a0 kernel/kthread.c:292
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
        Modules linked in:
        ---[ end trace b68924293169feef ]---
        RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
        RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
        RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
        RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
        RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
        R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
        R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
        FS:  0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f95b7c4d008 CR3: 000000009647d000 CR4: 00000000001506e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      The overflow happens, because in btree_readpage_end_io_hook() we assume
      that we have found a 4 byte checksum instead of the real possible 32
      bytes we have for the checksums.
      
      With the fix applied:
      
      [   35.726623] BTRFS: device fsid 815caf9a-dc43-4d2a-ac54-764b8333d765 devid 1 transid 5 /dev/loop0 scanned by syz-repro (215)
      [   35.738994] BTRFS info (device loop0): disk space caching is enabled
      [   35.738998] BTRFS info (device loop0): has skinny extents
      [   35.743337] BTRFS warning (device loop0): loop0 checksum verify failed on 1052672 wanted 0xf9c035fc8d239a54 found 0x67a25c14b7eabcf9 level 0
      [   35.743420] BTRFS error (device loop0): failed to read chunk root
      [   35.745899] BTRFS error (device loop0): open_ctree failed
      
      Reported-by: syzbot+e864a35d361e1d4e29a5@syzkaller.appspotmail.com
      Fixes: d5178578 ("btrfs: directly call into crypto framework for checksumming")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35be8851
  4. 14 Sep, 2020 1 commit
  5. 07 Sep, 2020 4 commits
    • Filipe Manana's avatar
      btrfs: fix NULL pointer dereference after failure to create snapshot · 2d892ccd
      Filipe Manana authored
      When trying to get a new fs root for a snapshot during the transaction
      at transaction.c:create_pending_snapshot(), if btrfs_get_new_fs_root()
      fails we leave "pending->snap" pointing to an error pointer, and then
      later at ioctl.c:create_snapshot() we dereference that pointer, resulting
      in a crash:
      
        [12264.614689] BUG: kernel NULL pointer dereference, address: 00000000000007c4
        [12264.615650] #PF: supervisor write access in kernel mode
        [12264.616487] #PF: error_code(0x0002) - not-present page
        [12264.617436] PGD 0 P4D 0
        [12264.618328] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        [12264.619150] CPU: 0 PID: 2310635 Comm: fsstress Tainted: G        W         5.9.0-rc3-btrfs-next-67 #1
        [12264.619960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [12264.621769] RIP: 0010:btrfs_mksubvol+0x438/0x4a0 [btrfs]
        [12264.622528] Code: bc ef ff ff (...)
        [12264.624092] RSP: 0018:ffffaa6fc7277cd8 EFLAGS: 00010282
        [12264.624669] RAX: 00000000fffffff4 RBX: ffff9d3e8f151a60 RCX: 0000000000000000
        [12264.625249] RDX: 0000000000000001 RSI: ffffffff9d56c9be RDI: fffffffffffffff4
        [12264.625830] RBP: ffff9d3e8f151b48 R08: 0000000000000000 R09: 0000000000000000
        [12264.626413] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffff4
        [12264.626994] R13: ffff9d3ede380538 R14: ffff9d3ede380500 R15: ffff9d3f61b2eeb8
        [12264.627582] FS:  00007f140d5d8200(0000) GS:ffff9d3fb5e00000(0000) knlGS:0000000000000000
        [12264.628176] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [12264.628773] CR2: 00000000000007c4 CR3: 000000020f8e8004 CR4: 00000000003706f0
        [12264.629379] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [12264.629994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [12264.630594] Call Trace:
        [12264.631227]  btrfs_mksnapshot+0x7b/0xb0 [btrfs]
        [12264.631840]  __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
        [12264.632458]  btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
        [12264.633078]  btrfs_ioctl+0x1864/0x3130 [btrfs]
        [12264.633689]  ? do_sys_openat2+0x1a7/0x2d0
        [12264.634295]  ? kmem_cache_free+0x147/0x3a0
        [12264.634899]  ? __x64_sys_ioctl+0x83/0xb0
        [12264.635488]  __x64_sys_ioctl+0x83/0xb0
        [12264.636058]  do_syscall_64+0x33/0x80
        [12264.636616]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        (gdb) list *(btrfs_mksubvol+0x438)
        0x7c7b8 is in btrfs_mksubvol (fs/btrfs/ioctl.c:858).
        853		ret = 0;
        854		pending_snapshot->anon_dev = 0;
        855	fail:
        856		/* Prevent double freeing of anon_dev */
        857		if (ret && pending_snapshot->snap)
        858			pending_snapshot->snap->anon_dev = 0;
        859		btrfs_put_root(pending_snapshot->snap);
        860		btrfs_subvolume_release_metadata(root, &pending_snapshot->block_rsv);
        861	free_pending:
        862		if (pending_snapshot->anon_dev)
      
      So fix this by setting "pending->snap" to NULL if we get an error from the
      call to btrfs_get_new_fs_root() at transaction.c:create_pending_snapshot().
      
      Fixes: 2dfb1e43 ("btrfs: preallocate anon block device at first phase of snapshot creation")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d892ccd
    • Josef Bacik's avatar
      btrfs: free data reloc tree on failed mount · 9e3aa805
      Josef Bacik authored
      While testing a weird problem with -o degraded, I noticed I was getting
      leaked root errors
      
        BTRFS warning (device loop0): writable mount is not allowed due to too many missing devices
        BTRFS error (device loop0): open_ctree failed
        BTRFS error (device loop0): leaked root -9-0 refcount 1
      
      This is the DATA_RELOC root, which gets read before the other fs roots,
      but is included in the fs roots radix tree.  Handle this by adding a
      btrfs_drop_and_free_fs_root() on the data reloc root if it exists.  This
      is ok to do here if we fail further up because we will only drop the ref
      if we delete the root from the radix tree, and all other cleanup won't
      be duplicated.
      
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e3aa805
    • Qu Wenruo's avatar
      btrfs: require only sector size alignment for parent eb bytenr · ea57788e
      Qu Wenruo authored
      [BUG]
      A completely sane converted fs will cause kernel warning at balance
      time:
      
        [ 1557.188633] BTRFS info (device sda7): relocating block group 8162107392 flags data
        [ 1563.358078] BTRFS info (device sda7): found 11722 extents
        [ 1563.358277] BTRFS info (device sda7): leaf 7989321728 gen 95 total ptrs 213 free space 3458 owner 2
        [ 1563.358280] 	item 0 key (7984947200 169 0) itemoff 16250 itemsize 33
        [ 1563.358281] 		extent refs 1 gen 90 flags 2
        [ 1563.358282] 		ref#0: tree block backref root 4
        [ 1563.358285] 	item 1 key (7985602560 169 0) itemoff 16217 itemsize 33
        [ 1563.358286] 		extent refs 1 gen 93 flags 258
        [ 1563.358287] 		ref#0: shared block backref parent 7985602560
        [ 1563.358288] 			(parent 7985602560 is NOT ALIGNED to nodesize 16384)
        [ 1563.358290] 	item 2 key (7985635328 169 0) itemoff 16184 itemsize 33
        ...
        [ 1563.358995] BTRFS error (device sda7): eb 7989321728 invalid extent inline ref type 182
        [ 1563.358996] ------------[ cut here ]------------
        [ 1563.359005] WARNING: CPU: 14 PID: 2930 at 0xffffffff9f231766
      
      Then with transaction abort, and obviously failed to balance the fs.
      
      [CAUSE]
      That mentioned inline ref type 182 is completely sane, it's
      BTRFS_SHARED_BLOCK_REF_KEY, it's some extra check making kernel to
      believe it's invalid.
      
      Commit 64ecdb64 ("Btrfs: add one more sanity check for shared ref
      type") introduced extra checks for backref type.
      
      One of the requirement is, parent bytenr must be aligned to node size,
      which is not correct.
      
      One example is like this:
      
      0	1G  1G+4K		2G 2G+4K
      	|   |///////////////////|//|  <- A chunk starts at 1G+4K
                  |   |	<- A tree block get reserved at bytenr 1G+4K
      
      Then we have a valid tree block at bytenr 1G+4K, but not aligned to
      nodesize (16K).
      
      Such chunk is not ideal, but current kernel can handle it pretty well.
      We may warn about such tree block in the future, but should not reject
      them.
      
      [FIX]
      Change the alignment requirement from node size alignment to sector size
      alignment.
      
      Also, to make our lives a little easier, also output @iref when
      btrfs_get_extent_inline_ref_type() failed, so we can locate the item
      easier.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205475
      Fixes: 64ecdb64 ("Btrfs: add one more sanity check for shared ref type")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ update comments and messages ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea57788e
    • Josef Bacik's avatar
      btrfs: fix lockdep splat in add_missing_dev · fccc0007
      Josef Bacik authored
      Nikolay reported a lockdep splat in generic/476 that I could reproduce
      with btrfs/187.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc2+ #1 Tainted: G        W
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc_trace+0x3a/0x1a0
      	 btrfs_alloc_device+0x43/0x210
      	 add_missing_dev+0x20/0x90
      	 read_one_chunk+0x301/0x430
      	 btrfs_read_sys_array+0x17b/0x1b0
      	 open_ctree+0xa62/0x1896
      	 btrfs_mount_root.cold+0x12/0xea
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 vfs_kern_mount.part.0+0x71/0xb0
      	 btrfs_mount+0x10d/0x379
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_lookup_inode+0x2a/0x8f
      	 __btrfs_update_delayed_inode+0x80/0x240
      	 btrfs_commit_inode_delayed_inode+0x119/0x120
      	 btrfs_evict_inode+0x357/0x500
      	 evict+0xcf/0x1f0
      	 vfs_rmdir.part.0+0x149/0x160
      	 do_rmdir+0x136/0x1a0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1184/0x1fa0
      	 lock_acquire+0xa4/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(&fs_info->chunk_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 1 PID: 100 Comm: kswapd0 Tainted: G        W         5.9.0-rc2+ #1
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x92/0xc8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x1184/0x1fa0
         lock_acquire+0xa4/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa4/0x3d0
         ? btrfs_evict_inode+0x11e/0x500
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x46/0x60
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This is because we are holding the chunk_mutex when we call
      btrfs_alloc_device, which does a GFP_KERNEL allocation.  We don't want
      to switch that to a GFP_NOFS lock because this is the only place where
      it matters.  So instead use memalloc_nofs_save() around the allocation
      in order to avoid the lockdep splat.
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fccc0007
  6. 27 Aug, 2020 7 commits
    • Qu Wenruo's avatar
      btrfs: tree-checker: fix the error message for transid error · f96d6960
      Qu Wenruo authored
      The error message for inode transid is the same as for inode generation,
      which makes us unable to detect the real problem.
      Reported-by: default avatarTyler Richmond <t.d.richmond@gmail.com>
      Fixes: 496245ca ("btrfs: tree-checker: Verify inode item")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f96d6960
    • Josef Bacik's avatar
      btrfs: set the lockdep class for log tree extent buffers · d3beaa25
      Josef Bacik authored
      These are special extent buffers that get rewound in order to lookup
      the state of the tree at a specific point in time.  As such they do not
      go through the normal initialization paths that set their lockdep class,
      so handle them appropriately when they are created and before they are
      locked.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3beaa25
    • Josef Bacik's avatar
      btrfs: set the correct lockdep class for new nodes · ad244665
      Josef Bacik authored
      When flipping over to the rw_semaphore I noticed I'd get a lockdep splat
      in replace_path(), which is weird because we're swapping the reloc root
      with the actual target root.  Turns out this is because we're using the
      root->root_key.objectid as the root id for the newly allocated tree
      block when setting the lockdep class, however we need to be using the
      actual owner of this new block, which is saved in owner.
      
      The affected path is through btrfs_copy_root as all other callers of
      btrfs_alloc_tree_block (which calls init_new_buffer) have root_objectid
      == root->root_key.objectid .
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad244665
    • Josef Bacik's avatar
      btrfs: allocate scrub workqueues outside of locks · e89c4a9c
      Josef Bacik authored
      I got the following lockdep splat while testing:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00172-g021118712e59 #932 Not tainted
        ------------------------------------------------------
        btrfs/229626 is trying to acquire lock:
        ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
      
        but task is already holding lock:
        ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_scrub_dev+0x11c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_run_dev_stats+0x49/0x480
      	 commit_cowonly_roots+0xb5/0x2a0
      	 btrfs_commit_transaction+0x516/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_commit_transaction+0x4bb/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_record_root_in_trans+0x43/0x70
      	 start_transaction+0xd1/0x5d0
      	 btrfs_dirty_inode+0x42/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 perf_read+0x141/0x2c0
      	 vfs_read+0xad/0x1b0
      	 ksys_read+0x5f/0xe0
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x88/0x150
      	 perf_event_init+0x1db/0x20b
      	 start_kernel+0x3ae/0x53c
      	 secondary_startup_64+0xa4/0xb0
      
        -> #1 (pmus_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x4f/0x150
      	 cpuhp_invoke_callback+0xb1/0x900
      	 _cpu_up.constprop.26+0x9f/0x130
      	 cpu_up+0x7b/0xc0
      	 bringup_nonboot_cpus+0x4f/0x60
      	 smp_init+0x26/0x71
      	 kernel_init_freeable+0x110/0x258
      	 kernel_init+0xa/0x103
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (cpu_hotplug_lock){++++}-{0:0}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 cpus_read_lock+0x39/0xb0
      	 alloc_workqueue+0x378/0x450
      	 __btrfs_alloc_workqueue+0x15d/0x200
      	 btrfs_alloc_workqueue+0x51/0x160
      	 scrub_workers_get+0x5a/0x170
      	 btrfs_scrub_dev+0x18c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->scrub_lock);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->scrub_lock);
          lock(cpu_hotplug_lock);
      
         *** DEADLOCK ***
      
        2 locks held by btrfs/229626:
         #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
         #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        stack backtrace:
        CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? alloc_workqueue+0x378/0x450
         cpus_read_lock+0x39/0xb0
         ? alloc_workqueue+0x378/0x450
         alloc_workqueue+0x378/0x450
         ? rcu_read_lock_sched_held+0x52/0x80
         __btrfs_alloc_workqueue+0x15d/0x200
         btrfs_alloc_workqueue+0x51/0x160
         scrub_workers_get+0x5a/0x170
         btrfs_scrub_dev+0x18c/0x630
         ? start_transaction+0xd1/0x5d0
         btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
         btrfs_ioctl+0x2799/0x30a0
         ? do_sigaction+0x102/0x250
         ? lockdep_hardirqs_on_prepare+0xca/0x160
         ? _raw_spin_unlock_irq+0x24/0x30
         ? trace_hardirqs_on+0x1c/0xe0
         ? _raw_spin_unlock_irq+0x24/0x30
         ? do_sigaction+0x102/0x250
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens because we're allocating the scrub workqueues under the
      scrub and device list mutex, which brings in a whole host of other
      dependencies.
      
      Because the work queue allocation is done with GFP_KERNEL, it can
      trigger reclaim, which can lead to a transaction commit, which in turns
      needs the device_list_mutex, it can lead to a deadlock. A different
      problem for which this fix is a solution.
      
      Fix this by moving the actual allocation outside of the
      scrub lock, and then only take the lock once we're ready to actually
      assign them to the fs_info.  We'll now have to cleanup the workqueues in
      a few more places, so I've added a helper to do the refcount dance to
      safely free the workqueues.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e89c4a9c
    • Josef Bacik's avatar
      btrfs: fix potential deadlock in the search ioctl · a48b73ec
      Josef Bacik authored
      With the conversion of the tree locks to rwsem I got the following
      lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
        ------------------------------------------------------
        compsize/11122 is trying to acquire lock:
        ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90
      
        but task is already holding lock:
        ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-fs-00){++++}-{3:3}:
      	 down_write_nested+0x3b/0x70
      	 __btrfs_tree_lock+0x24/0x120
      	 btrfs_search_slot+0x756/0x990
      	 btrfs_lookup_inode+0x3a/0xb4
      	 __btrfs_update_delayed_inode+0x93/0x270
      	 btrfs_async_run_delayed_root+0x168/0x230
      	 btrfs_work_helper+0xd4/0x570
      	 process_one_work+0x2ad/0x5f0
      	 worker_thread+0x3a/0x3d0
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_delayed_update_inode+0x50/0x440
      	 btrfs_update_inode+0x8a/0xf0
      	 btrfs_dirty_inode+0x5b/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 copy_to_sk.isra.32+0x121/0x300
      	 search_ioctl+0x106/0x200
      	 btrfs_ioctl_tree_search_v2+0x7b/0xf0
      	 btrfs_ioctl+0x106f/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-fs-00);
      				 lock(&delayed_node->mutex);
      				 lock(btrfs-fs-00);
          lock(&mm->mmap_lock#2);
      
         *** DEADLOCK ***
      
        1 lock held by compsize/11122:
         #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        stack backtrace:
        CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? __might_fault+0x3e/0x90
         ? find_held_lock+0x72/0x90
         __might_fault+0x68/0x90
         ? __might_fault+0x3e/0x90
         _copy_to_user+0x1e/0x80
         copy_to_sk.isra.32+0x121/0x300
         ? btrfs_search_forward+0x2a6/0x360
         search_ioctl+0x106/0x200
         btrfs_ioctl_tree_search_v2+0x7b/0xf0
         btrfs_ioctl+0x106f/0x30a0
         ? __do_sys_newfstat+0x5a/0x70
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is we're doing a copy_to_user() while holding tree locks,
      which can deadlock if we have to do a page fault for the copy_to_user().
      This exists even without my locking changes, so it needs to be fixed.
      Rework the search ioctl to do the pre-fault and then
      copy_to_user_nofault for the copying.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a48b73ec
    • Josef Bacik's avatar
      btrfs: drop path before adding new uuid tree entry · 9771a5cf
      Josef Bacik authored
      With the conversion of the tree locks to rwsem I got the following
      lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
        ------------------------------------------------------
        btrfs-uuid/7955 is trying to acquire lock:
        ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        but task is already holding lock:
        ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-uuid-00){++++}-{3:3}:
      	 down_read_nested+0x3e/0x140
      	 __btrfs_tree_read_lock+0x39/0x180
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x4bd/0x990
      	 btrfs_uuid_tree_add+0x89/0x2d0
      	 btrfs_uuid_scan_kthread+0x330/0x390
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 down_read_nested+0x3e/0x140
      	 __btrfs_tree_read_lock+0x39/0x180
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x4bd/0x990
      	 btrfs_find_root+0x45/0x1b0
      	 btrfs_read_tree_root+0x61/0x100
      	 btrfs_get_root_ref.part.50+0x143/0x630
      	 btrfs_uuid_tree_iterate+0x207/0x314
      	 btrfs_uuid_rescan_kthread+0x12/0x50
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-uuid-00);
      				 lock(btrfs-root-00);
      				 lock(btrfs-uuid-00);
          lock(btrfs-root-00);
      
         *** DEADLOCK ***
      
        1 lock held by btrfs-uuid/7955:
         #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        stack backtrace:
        CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? __btrfs_tree_read_lock+0x39/0x180
         ? btrfs_root_node+0x1c/0x1d0
         down_read_nested+0x3e/0x140
         ? __btrfs_tree_read_lock+0x39/0x180
         __btrfs_tree_read_lock+0x39/0x180
         __btrfs_read_lock_root_node+0x3a/0x50
         btrfs_search_slot+0x4bd/0x990
         btrfs_find_root+0x45/0x1b0
         btrfs_read_tree_root+0x61/0x100
         btrfs_get_root_ref.part.50+0x143/0x630
         btrfs_uuid_tree_iterate+0x207/0x314
         ? btree_readpage+0x20/0x20
         btrfs_uuid_rescan_kthread+0x12/0x50
         kthread+0x133/0x150
         ? kthread_create_on_node+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      This problem exists because we have two different rescan threads,
      btrfs_uuid_scan_kthread which creates the uuid tree, and
      btrfs_uuid_tree_iterate that goes through and updates or deletes any out
      of date roots.  The problem is they both do things in different order.
      btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
      into the uuid_root.  btrfs_uuid_tree_iterate() scans the uuid_root, but
      then does a btrfs_get_fs_root() which can read from the tree_root.
      
      It's actually easy enough to not be holding the path in
      btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
      it further down and re-start the search when we loop.  So simply move
      the path release before we add our entry to the uuid tree.
      
      This also fixes a problem where we're holding a path open after we do
      btrfs_end_transaction(), which has it's own problems.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9771a5cf
    • Marcos Paulo de Souza's avatar
      btrfs: block-group: fix free-space bitmap threshold · e3e39c72
      Marcos Paulo de Souza authored
      [BUG]
      After commit 9afc6649 ("btrfs: block-group: refactor how we read one
      block group item"), cache->length is being assigned after calling
      btrfs_create_block_group_cache. This causes a problem since
      set_free_space_tree_thresholds calculates the free-space threshold to
      decide if the free-space tree should convert from extents to bitmaps.
      
      The current code calls set_free_space_tree_thresholds with cache->length
      being 0, which then makes cache->bitmap_high_thresh zero. This implies
      the system will always use bitmap instead of extents, which is not
      desired if the block group is not fragmented.
      
      This behavior can be seen by a test that expects to repair systems
      with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only
      created FREE_SPACE_BITMAP.
      
      [FIX]
      Call set_free_space_tree_thresholds after setting cache->length. There
      is now a WARN_ON in set_free_space_tree_thresholds to help preventing
      the same mistake to happen again in the future.
      
      Link: https://github.com/kdave/btrfs-progs/issues/251
      Fixes: 9afc6649 ("btrfs: block-group: refactor how we read one block group item")
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3e39c72
  7. 21 Aug, 2020 2 commits
    • Boris Burkov's avatar
      btrfs: detect nocow for swap after snapshot delete · a84d5d42
      Boris Burkov authored
      can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
      detecting a must cow condition which is not exactly accurate, but saves
      unnecessary tree traversal. The incorrect assumption is that if the
      extent was created in a generation smaller than the last snapshot
      generation, it must be referenced by that snapshot. That is true, except
      the snapshot could have since been deleted, without affecting the last
      snapshot generation.
      
      The original patch claimed a performance win from this check, but it
      also leads to a bug where you are unable to use a swapfile if you ever
      snapshotted the subvolume it's in. Make the check slower and more strict
      for the swapon case, without modifying the general cow checks as a
      compromise. Turning swap on does not seem to be a particularly
      performance sensitive operation, so incurring a possibly unnecessary
      btrfs_search_slot seems worthwhile for the added usability.
      
      Note: Until the snapshot is competely cleaned after deletion,
      check_committed_refs will still cause the logic to think that cow is
      necessary, so the user must until 'btrfs subvolu sync' finished before
      activating the swapfile swapon.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a84d5d42
    • Josef Bacik's avatar
      btrfs: check the right error variable in btrfs_del_dir_entries_in_log · fb2fecba
      Josef Bacik authored
      With my new locking code dbench is so much faster that I tripped over a
      transaction abort from ENOSPC.  This turned out to be because
      btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
      function sets err on error, and returns err.  So instead of properly
      marking the inode as needing a full commit, we were returning -ENOSPC
      and aborting in __btrfs_unlink_inode.  Fix this by checking the proper
      variable so that we return the correct thing in the case of ENOSPC.
      
      The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
      can return -ENOENT if the dir item isn't in the tree log (which would
      happen if we hadn't fsync'ed this guy).  We actually handle that case in
      __btrfs_unlink_inode, so it's an expected error to get back.
      
      Fixes: 4a500fd1 ("Btrfs: Metadata ENOSPC handling for tree log")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note and comment about ENOENT ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb2fecba
  8. 19 Aug, 2020 4 commits
    • Filipe Manana's avatar
      btrfs: fix space cache memory leak after transaction abort · bbc37d6e
      Filipe Manana authored
      If a transaction aborts it can cause a memory leak of the pages array of
      a block group's io_ctl structure. The following steps explain how that can
      happen:
      
      1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
         and it's about to start writing out dirty extent buffers;
      
      2) Transaction N + 1 already started and another task, task A, just called
         btrfs_commit_transaction() on it;
      
      3) Block group B was dirtied (extents allocated from it) by transaction
         N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
         very beginning of the transaction commit, it starts writeback for the
         block group's space cache by calling btrfs_write_out_cache(), which
         allocates the pages array for the block group's io_ctl with a call to
         io_ctl_init(). Block group A is added to the io_list of transaction
         N + 1 by btrfs_start_dirty_block_groups();
      
      4) While transaction N's commit is writing out the extent buffers, it gets
         an IO error and aborts transaction N, also setting the file system to
         RO mode;
      
      5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
         btrfs_commit_transaction() and has set transaction N + 1 state to
         TRANS_STATE_COMMIT_START. Immediately after that it checks that the
         filesystem was turned to RO mode, due to transaction N's abort, and
         jumps to the "cleanup_transaction" label. After that we end up at
         btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
         That helper finds block group B in the transaction's io_list but it
         never releases the pages array of the block group's io_ctl, resulting in
         a memory leak.
      
      In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
      array points to pages that were already released by us at
      __btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
      up freeing the pages array only after waiting for the ordered extent to
      complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
      that. But in the transaction abort case we don't wait for the space cache's
      ordered extent to complete through a call to btrfs_wait_cache_io(), so
      that's why we end up with a memory leak - we wait for the ordered extent
      to complete indirectly by shutting down the work queues and waiting for
      any jobs in them to complete before returning from close_ctree().
      
      We can solve the leak simply by freeing the pages array right after
      releasing the pages (with the call to io_ctl_drop_pages()) at
      __btrfs_write_out_cache(), since we will never use it anymore after that
      and the pages array points to already released pages at that point, which
      is currently not a problem since no one will use it after that, but not a
      good practice anyway since it can easily lead to use-after-free issues.
      
      So fix this by freeing the pages array right after releasing the pages at
      __btrfs_write_out_cache().
      
      This issue can often be reproduced with test case generic/475 from fstests
      and kmemleak can detect it and reports it with the following trace:
      
      unreferenced object 0xffff9bbf009fa600 (size 512):
        comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
        hex dump (first 32 bytes):
          00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff  ..|M=...@.|M=...
          80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff  ..|M=.....|M=...
        backtrace:
          [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
          [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
          [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
          [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
          [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
          [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
          [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
          [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
          [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
          [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
          [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
          [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
          [<0000000043ace2c9>] do_syscall_64+0x33/0x80
          [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbc37d6e
    • David Sterba's avatar
      btrfs: use the correct const function attribute for btrfs_get_num_csums · 604997b4
      David Sterba authored
      The build robot reports
      
      compiler: h8300-linux-gcc (GCC) 9.3.0
         In file included from fs/btrfs/tests/extent-map-tests.c:8:
      >> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
          2166 | size_t __const btrfs_get_num_csums(void);
               |        ^~~~~~~
      
      The function attribute for const does not follow the expected scheme and
      in this case is confused with a const type qualifier.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      604997b4
    • Marcos Paulo de Souza's avatar
      btrfs: reset compression level for lzo on remount · 282dd7d7
      Marcos Paulo de Souza authored
      Currently a user can set mount "-o compress" which will set the
      compression algorithm to zlib, and use the default compress level for
      zlib (3):
      
        relatime,compress=zlib:3,space_cache
      
      If the user remounts the fs using "-o compress=lzo", then the old
      compress_level is used:
      
        relatime,compress=lzo:3,space_cache
      
      But lzo does not expose any tunable compression level. The same happens
      if we set any compress argument with different level, also with zstd.
      
      Fix this by resetting the compress_level when compress=lzo is
      specified.  With the fix applied, lzo is shown without compress level:
      
        relatime,compress=lzo,space_cache
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      282dd7d7
    • Johannes Thumshirn's avatar
      btrfs: handle errors from async submission · c965d640
      Johannes Thumshirn authored
      Btrfs' async submit mechanism is able to handle errors in the submission
      path and the meta-data async submit function correctly passes the error
      code to the caller.
      
      In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're
      not handling the errors returned by btrfs_csum_one_bio() correctly though
      and simply call BUG_ON(). This is unnecessary as the caller of these two
      functions - run_one_async_start - correctly checks for the return values
      and sets the status of the async_submit_bio. The actual bio submission
      will be handled later on by run_one_async_done only if
      async_submit_bio::status is 0, so the data won't be written if we
      encountered an error in the checksum process.
      
      Simply return the error from btrfs_csum_one_bio() to the async submitters,
      like it's done in btree_submit_bio_start().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c965d640
  9. 12 Aug, 2020 1 commit
    • Qu Wenruo's avatar
      btrfs: trim: fix underflow in trim length to prevent access beyond device boundary · c57dd1f2
      Qu Wenruo authored
      [BUG]
      The following script can lead to tons of beyond device boundary access:
      
        mkfs.btrfs -f $dev -b 10G
        mount $dev $mnt
        trimfs $mnt
        btrfs filesystem resize 1:-1G $mnt
        trimfs $mnt
      
      [CAUSE]
      Since commit 929be17a ("btrfs: Switch btrfs_trim_free_extents to
      find_first_clear_extent_bit"), we try to avoid trimming ranges that's
      already trimmed.
      
      So we check device->alloc_state by finding the first range which doesn't
      have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
      
      But if we shrunk the device, that bits are not cleared, thus we could
      easily got a range starts beyond the shrunk device size.
      
      This results the returned @start and @end are all beyond device size,
      then we call "end = min(end, device->total_bytes -1);" making @end
      smaller than device size.
      
      Then finally we goes "len = end - start + 1", totally underflow the
      result, and lead to the beyond-device-boundary access.
      
      [FIX]
      This patch will fix the problem in two ways:
      
      - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
        This is the root fix
      
      - Add extra safety check when trimming free device extents
        We check and warn if the returned range is already beyond current
        device.
      
      Link: https://github.com/kdave/btrfs-progs/issues/282
      Fixes: 929be17a ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c57dd1f2
  10. 11 Aug, 2020 1 commit
  11. 10 Aug, 2020 8 commits
    • Qu Wenruo's avatar
      btrfs: sysfs: fix NULL pointer dereference at btrfs_sysfs_del_qgroups() · 62ab2cc0
      Qu Wenruo authored
      [BUG]
      Unmounting a btrfs filesystem with quota disabled will cause the
      following NULL pointer dereference:
      
        BTRFS info (device dm-5): has skinny extents
        BUG: kernel NULL pointer dereference, address: 0000000000000018
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        CPU: 7 PID: 637 Comm: umount Not tainted 5.8.0-rc7-next-20200731-custom #76
        RIP: 0010:kobject_del+0x6/0x20
        Call Trace:
         btrfs_sysfs_del_qgroups+0xac/0xf0 [btrfs]
         btrfs_free_qgroup_config+0x63/0x70 [btrfs]
         close_ctree+0x1f5/0x323 [btrfs]
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         exit_to_user_mode_prepare+0x18a/0x190
         syscall_exit_to_user_mode+0x4f/0x270
         do_syscall_64+0x45/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 37b7adca5c1d5c5d ]---
      
      [CAUSE]
      Commit 079ad2fb ("kobject: Avoid premature parent object freeing in
      kobject_cleanup()") changed kobject_del() that it no longer accepts NULL
      pointer.
      
      Before that commit, kobject_del() and kobject_put() all accept NULL
      pointers and just ignore such NULL pointers.
      
      But that mentioned commit needs to access the parent node, killing the
      old NULL pointer behavior.
      
      Unfortunately btrfs is relying on that hidden feature thus we will
      trigger such NULL pointer dereference.
      
      [FIX]
      Instead of just saving several lines, do proper fs_info->qgroups_kobj
      check before calling kobject_del() and kobject_put().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62ab2cc0
    • Boleyn Su's avatar
      btrfs: check correct variable after allocation in btrfs_backref_iter_alloc · c15c2ec0
      Boleyn Su authored
      The `if (!ret)` check will always be false and it may result in
      ret->path being dereferenced while it is a NULL pointer.
      
      Fixes: a37f232b ("btrfs: backref: introduce the skeleton of btrfs_backref_iter")
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoleyn Su <boleynsu@google.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c15c2ec0
    • Josef Bacik's avatar
      btrfs: make sure SB_I_VERSION doesn't get unset by remount · faa00889
      Josef Bacik authored
      There's some inconsistency around SB_I_VERSION handling with mount and
      remount.  Since we don't really want it to be off ever just work around
      this by making sure we don't get the flag cleared on remount.
      
      There's a tiny cpu cost of setting the bit, otherwise all changes to
      i_version also change some of the times (ctime/mtime) so the inode needs
      to be synced. We wouldn't save anything by disabling it.
      Reported-by: default avatarEric Sandeen <sandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add perf impact analysis ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      faa00889
    • Filipe Manana's avatar
      btrfs: fix memory leaks after failure to lookup checksums during inode logging · 4f26433e
      Filipe Manana authored
      While logging an inode, at copy_items(), if we fail to lookup the checksums
      for an extent we release the destination path, free the ins_data array and
      then return immediately. However a previous iteration of the for loop may
      have added checksums to the ordered_sums list, in which case we leak the
      memory used by them.
      
      So fix this by making sure we iterate the ordered_sums list and free all
      its checksums before returning.
      
      Fixes: 3650860b ("Btrfs: remove almost all of the BUG()'s from tree-log.c")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4f26433e
    • Josef Bacik's avatar
      btrfs: don't show full path of bind mounts in subvol= · 3ef3959b
      Josef Bacik authored
      Chris Murphy reported a problem where rpm ostree will bind mount a bunch
      of things for whatever voodoo it's doing.  But when it does this
      /proc/mounts shows something like
      
        /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
        /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo/bar 0 0
      
      Despite subvolid=256 being subvol=/foo.  This is because we're just
      spitting out the dentry of the mount point, which in the case of bind
      mounts is the source path for the mountpoint.  Instead we should spit
      out the path to the actual subvol.  Fix this by looking up the name for
      the subvolid we have mounted.  With this fix the same test looks like
      this
      
        /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
        /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
      Reported-by: default avatarChris Murphy <chris@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ef3959b
    • David Sterba's avatar
      btrfs: fix messages after changing compression level by remount · 27942c99
      David Sterba authored
      Reported by Forza on IRC that remounting with compression options does
      not reflect the change in level, or at least it does not appear to do so
      according to the messages:
      
        mount -o compress=zstd:1 /dev/sda /mnt
        mount -o remount,compress=zstd:15 /mnt
      
      does not print the change to the level to syslog:
      
        [   41.366060] BTRFS info (device vda): use zstd compression, level 1
        [   41.368254] BTRFS info (device vda): disk space caching is enabled
        [   41.390429] BTRFS info (device vda): disk space caching is enabled
      
      What really happens is that the message is lost but the level is actualy
      changed.
      
      There's another weird output, if compression is reset to 'no':
      
        [   45.413776] BTRFS info (device vda): use no compression, level 4
      
      To fix that, save the previous compression level and print the message
      in that case too and use separate message for 'no' compression.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27942c99
    • Josef Bacik's avatar
      btrfs: only search for left_info if there is no right_info in try_merge_free_space · bf53d468
      Josef Bacik authored
      In try_to_merge_free_space we attempt to find entries to the left and
      right of the entry we are adding to see if they can be merged.  We
      search for an entry past our current info (saved into right_info), and
      then if right_info exists and it has a rb_prev() we save the rb_prev()
      into left_info.
      
      However there's a slight problem in the case that we have a right_info,
      but no entry previous to that entry.  At that point we will search for
      an entry just before the info we're attempting to insert.  This will
      simply find right_info again, and assign it to left_info, making them
      both the same pointer.
      
      Now if right_info _can_ be merged with the range we're inserting, we'll
      add it to the info and free right_info.  However further down we'll
      access left_info, which was right_info, and thus get a use-after-free.
      
      Fix this by only searching for the left entry if we don't find a right
      entry at all.
      
      The CVE referenced had a specially crafted file system that could
      trigger this use-after-free. However with the tree checker improvements
      we no longer trigger the conditions for the UAF.  But the original
      conditions still apply, hence this fix.
      
      Reference: CVE-2019-19448
      Fixes: 96303081 ("Btrfs: use hybrid extents+bitmap rb tree for free space")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf53d468
    • Qu Wenruo's avatar
      btrfs: inode: fix NULL pointer dereference if inode doesn't need compression · 1e6e238c
      Qu Wenruo authored
      [BUG]
      There is a bug report of NULL pointer dereference caused in
      compress_file_extent():
      
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
        NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs]
        LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs]
        Call Trace:
        [c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable)
        [c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs]
        [c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs]
        [c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0
        [c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660
        [c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0
        [c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c
        ---[ end trace f16954aa20d822f6 ]---
      
      [CAUSE]
      For the following execution route of compress_file_range(), it's
      possible to hit NULL pointer dereference:
      
       compress_file_extent()
       |- pages = NULL;
       |- start = async_chunk->start = 0;
       |- end = async_chunk = 4095;
       |- nr_pages = 1;
       |- inode_need_compress() == false; <<< Possible, see later explanation
       |  Now, we have nr_pages = 1, pages = NULL
       |- cont:
       |- 		ret = cow_file_range_inline();
       |- 		if (ret <= 0) {
       |-		for (i = 0; i < nr_pages; i++) {
       |-			WARN_ON(pages[i]->mapping);	<<< Crash
      
      To enter above call execution branch, we need the following race:
      
          Thread 1 (chattr)     |            Thread 2 (writeback)
      --------------------------+------------------------------
                                | btrfs_run_delalloc_range
                                | |- inode_need_compress = true
                                | |- cow_file_range_async()
      btrfs_ioctl_set_flag()    |
      |- binode_flags |=        |
         BTRFS_INODE_NOCOMPRESS |
                                | compress_file_range()
                                | |- inode_need_compress = false
                                | |- nr_page = 1 while pages = NULL
                                | |  Then hit the crash
      
      [FIX]
      This patch will fix it by checking @pages before doing accessing it.
      This patch is only designed as a hot fix and easy to backport.
      
      More elegant fix may make btrfs only check inode_need_compress() once to
      avoid such race, but that would be another story.
      Reported-by: default avatarLuciano Chavez <chavez@us.ibm.com>
      Fixes: 4d3a800e ("btrfs: merge nr_pages input and output parameter in compress_pages")
      CC: stable@vger.kernel.org # 4.14.x: cecc8d90: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e6e238c
  12. 27 Jul, 2020 9 commits
    • Filipe Manana's avatar
      btrfs: do not set the full sync flag on the inode during page release · 5e548b32
      Filipe Manana authored
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we always set the full
      sync flag on the inode, which forces the next fsync to use a slower code
      path.
      
      This hurts performance for workloads that dirty an amount of data that
      exceeds or is very close to the system's RAM memory and do frequent fsync
      operations (like database servers can for example). In particular if there
      are concurrent fsyncs against different files, by falling back to a full
      fsync we do a lot more checksum lookups in the checksums btree, as we do
      it for all the extents created in the current transaction, instead of only
      the new ones since the last fsync. These checksums lookups not only take
      some time but, more importantly, they also cause contention on the
      checksums btree locks due to the concurrency with checksum insertions in
      the btree by ordered extents from other inodes.
      
      We actually don't need to set the full sync flag on the inode, because we
      only remove extent maps that are in the list of modified extents if they
      were created in a past transaction, in which case an fsync skips them as
      it's pointless to log them. So stop setting the full fsync flag on the
      inode whenever we remove an extent map.
      
      This patch is part of a patchset that consists of 3 patches, which have
      the following subjects:
      
      1/3 btrfs: fix race between page release and a fast fsync
      2/3 btrfs: release old extent maps during page release
      3/3 btrfs: do not set the full sync flag on the inode during page release
      
      Performance tests were ran against a branch (misc-next) containing the
      whole patchset. The test exercises a workload where there are multiple
      processes writing to files and fsyncing them (each writing and fsyncing
      its own file), and in total the amount of data dirtied ranges from 2x to
      4x the system's RAM memory (16GiB), so that the page release callback is
      invoked frequently.
      
      The following script, using fio, was used to perform the tests:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 3 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=write
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=64k
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were performed for different numbers of jobs, file sizes and
      fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
      12 cores, with cpu governance set to performance mode on all cores), 16GiB
      of ram (the host has 64GiB) and using a NVMe device directly (without an
      intermediary filesystem in the host). While running the tests, the host
      was not used for anything else, to avoid disturbing the tests.
      
      The obtained results were the following, and the last line printed by
      fio is pasted (includes aggregated throughput and test run time).
      
          *****************************************************
          ****     1 job, 32GiB file, fsync frequency 1     ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
      
      After patchset:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
      (+0.7% throughput, -0.8% run time)
      
          *****************************************************
          ****     2 jobs, 16GiB files, fsync frequency 1   ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
      
      After patchset:
      
      WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
      (+19.1% throughput, -16.1% runtime)
      
          *****************************************************
          ****     4 jobs, 8GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
      
      After patchset:
      
      WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
      (+37.8% throughput, -27.5% runtime)
      
          *****************************************************
          ****     8 jobs, 4GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
      
      After patchset:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
      (+74.7% throughput, -43.0% run time)
      
          *****************************************************
          ****    16 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
      
      After patchset:
      
      WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
      (+146.3% throughput, -59.5% run time)
      
          *****************************************************
          ****    32 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
      
      After patchset:
      
      WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
      (+212.7% throughput, -68.0% run time)
      
          *****************************************************
          ****    64 jobs, 1GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
      
      After patchset:
      
      WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
      (+197.6% throughput, -66.4% run time)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e548b32
    • Filipe Manana's avatar
      btrfs: release old extent maps during page release · fbc2bd7e
      Filipe Manana authored
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we never release an extent
      map that is in the list of modified extents. This is to prevent races with
      a concurrent fsync using the fast path, which could lead to not logging an
      extent created in the current transaction.
      
      However we can safely remove an extent map created in a past transaction
      that is still in the list of modified extents (because no one fsynced yet
      the inode after that transaction got commited), because such extents are
      skipped during an fsync as it is pointless to log them. This change does
      that.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fbc2bd7e
    • Filipe Manana's avatar
      btrfs: fix race between page release and a fast fsync · 3d6448e6
      Filipe Manana authored
      When releasing an extent map, done through the page release callback, we
      can race with an ongoing fast fsync and cause the fsync to miss a new
      extent and not log it. The steps for this to happen are the following:
      
      1) A page is dirtied for some inode I;
      
      2) Writeback for that page is triggered by a path other than fsync, for
         example by the system due to memory pressure;
      
      3) When the ordered extent for the extent (a single 4K page) finishes,
         we unpin the corresponding extent map and set its generation to N,
         the current transaction's generation;
      
      4) The btrfs_releasepage() callback is invoked by the system due to
         memory pressure for that no longer dirty page of inode I;
      
      5) At the same time, some task calls fsync on inode I, joins transaction
         N, and at btrfs_log_inode() it sees that the inode does not have the
         full sync flag set, so we proceed with a fast fsync. But before we get
         into btrfs_log_changed_extents() and lock the inode's extent map tree:
      
      6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
         and we remove the extent map for the new 4Kb extent, because it is
         neither pinned anymore nor locked. By calling remove_extent_mapping(),
         we remove the extent map from the list of modified extents, since the
         extent map does not have the logging flag set. We unlock the inode's
         extent map tree;
      
      7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
         locks the inode's extent map tree and iterates its list of modified
         extents, which no longer has the 4Kb extent in it, so it does not log
         the extent;
      
      8) The fsync finishes;
      
      9) Before transaction N is committed, a power failure happens. After
         replaying the log, the 4K extent of inode I will be missing, since
         it was not logged due to the race with try_release_extent_mapping().
      
      So fix this by teaching try_release_extent_mapping() to not remove an
      extent map if it's still in the list of modified extents.
      
      Fixes: ff44c6e3 ("Btrfs: do not hold the write_lock on the extent tree while logging")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d6448e6
    • Johannes Thumshirn's avatar
      btrfs: open-code remount flag setting in btrfs_remount · 88c4703f
      Johannes Thumshirn authored
      When we're (re)mounting a btrfs filesystem we set the
      BTRFS_FS_STATE_REMOUNTING state in fs_info to serialize against async
      reclaim or defrags.
      
      This flag is set in btrfs_remount_prepare() called by btrfs_remount().
      As btrfs_remount_prepare() does nothing but setting this flag and
      doesn't have a second caller, we can just open-code the flag setting in
      btrfs_remount().
      
      Similarly do for so clearing of the flag by moving it out of
      btrfs_remount_cleanup() into btrfs_remount() to be symmetrical.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88c4703f
    • Josef Bacik's avatar
      btrfs: if we're restriping, use the target restripe profile · 162e0a16
      Josef Bacik authored
      Previously we depended on some weird behavior in our chunk allocator to
      force the allocation of new stripes, so by the time we got to doing the
      reduce we would usually already have a chunk with the proper target.
      
      However that behavior causes other problems and needs to be removed.
      First however we need to remove this check to only restripe if we
      already have those available profiles, because if we're allocating our
      first chunk it obviously will not be available.  Simply use the target
      as specified, and if that fails it'll be because we're out of space.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      162e0a16
    • Josef Bacik's avatar
      btrfs: don't adjust bg flags and use default allocation profiles · 349e120e
      Josef Bacik authored
      btrfs/061 has been failing consistently for me recently with a
      transaction abort.  We run out of space in the system chunk array, which
      means we've allocated way too many system chunks than we need.
      
      Chris added this a long time ago for balance as a poor mans restriping.
      If you had a single disk and then added another disk and then did a
      balance, update_block_group_flags would then figure out which RAID level
      you needed.
      
      Fast forward to today and we have restriping behavior, so we can
      explicitly tell the fs that we're trying to change the raid level.  This
      is accomplished through the normal get_alloc_profile path.
      
      Furthermore this code actually causes btrfs/061 to fail, because we do
      things like mkfs -m dup -d single with multiple devices.  This trips
      this check
      
      alloc_flags = update_block_group_flags(fs_info, cache->flags);
      if (alloc_flags != cache->flags) {
      	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
      
      in btrfs_inc_block_group_ro.  Because we're balancing and scrubbing, but
      not actually restriping, we keep forcing chunk allocation of RAID1
      chunks.  This eventually causes us to run out of system space and the
      file system aborts and flips read only.
      
      We don't need this poor mans restriping any more, simply use the normal
      get_alloc_profile helper, which will get the correct alloc_flags and
      thus make the right decision for chunk allocation.  This keeps us from
      allocating a billion system chunks and falling over.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      349e120e
    • Josef Bacik's avatar
      btrfs: fix lockdep splat from btrfs_dump_space_info · ab0db043
      Josef Bacik authored
      When running with -o enospc_debug you can get the following splat if one
      of the dump_space_info's trip
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc5+ #20 Tainted: G           OE
        ------------------------------------------------------
        dd/563090 is trying to acquire lock:
        ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs]
      
        but task is already holding lock:
        ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&cache->lock){+.+.}-{2:2}:
      	 _raw_spin_lock+0x25/0x30
      	 btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs]
      	 find_free_extent+0x7ef/0x13b0 [btrfs]
      	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
      	 btrfs_alloc_tree_block+0xc1/0x340 [btrfs]
      	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
      	 __btrfs_cow_block+0x122/0x530 [btrfs]
      	 btrfs_cow_block+0x106/0x210 [btrfs]
      	 commit_cowonly_roots+0x55/0x300 [btrfs]
      	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20 [btrfs]
      	 deactivate_locked_super+0x36/0x70
      	 cleanup_mnt+0x104/0x160
      	 task_work_run+0x5f/0x90
      	 __prepare_exit_to_usermode+0x1bd/0x1c0
      	 do_syscall_64+0x5e/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (&space_info->lock){+.+.}-{2:2}:
      	 _raw_spin_lock+0x25/0x30
      	 btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs]
      	 btrfs_inode_rsv_release+0x4f/0x170 [btrfs]
      	 btrfs_clear_delalloc_extent+0x155/0x480 [btrfs]
      	 clear_state_bit+0x81/0x1a0 [btrfs]
      	 __clear_extent_bit+0x25c/0x5d0 [btrfs]
      	 clear_extent_bit+0x15/0x20 [btrfs]
      	 btrfs_invalidatepage+0x2b7/0x3c0 [btrfs]
      	 truncate_cleanup_page+0x47/0xe0
      	 truncate_inode_pages_range+0x238/0x840
      	 truncate_pagecache+0x44/0x60
      	 btrfs_setattr+0x202/0x5e0 [btrfs]
      	 notify_change+0x33b/0x490
      	 do_truncate+0x76/0xd0
      	 path_openat+0x687/0xa10
      	 do_filp_open+0x91/0x100
      	 do_sys_openat2+0x215/0x2d0
      	 do_sys_open+0x44/0x80
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&tree->lock#2){+.+.}-{2:2}:
      	 _raw_spin_lock+0x25/0x30
      	 find_first_extent_bit+0x32/0x150 [btrfs]
      	 write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs]
      	 __btrfs_write_out_cache+0x172/0x480 [btrfs]
      	 btrfs_write_out_cache+0x7a/0xf0 [btrfs]
      	 btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs]
      	 commit_cowonly_roots+0x245/0x300 [btrfs]
      	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
      	 close_ctree+0xf9/0x2f5 [btrfs]
      	 generic_shutdown_super+0x6c/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20 [btrfs]
      	 deactivate_locked_super+0x36/0x70
      	 cleanup_mnt+0x104/0x160
      	 task_work_run+0x5f/0x90
      	 __prepare_exit_to_usermode+0x1bd/0x1c0
      	 do_syscall_64+0x5e/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&ctl->tree_lock){+.+.}-{2:2}:
      	 __lock_acquire+0x1240/0x2460
      	 lock_acquire+0xab/0x360
      	 _raw_spin_lock+0x25/0x30
      	 btrfs_dump_free_space+0x2b/0xa0 [btrfs]
      	 btrfs_dump_space_info+0xf4/0x120 [btrfs]
      	 btrfs_reserve_extent+0x176/0x180 [btrfs]
      	 __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
      	 cache_save_setup+0x28d/0x3b0 [btrfs]
      	 btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
      	 btrfs_commit_transaction+0xcc/0xac0 [btrfs]
      	 btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
      	 btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
      	 btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
      	 btrfs_file_write_iter+0x3cf/0x610 [btrfs]
      	 new_sync_write+0x11e/0x1b0
      	 vfs_write+0x1c9/0x200
      	 ksys_write+0x68/0xe0
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &ctl->tree_lock --> &space_info->lock --> &cache->lock
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&cache->lock);
      				 lock(&space_info->lock);
      				 lock(&cache->lock);
          lock(&ctl->tree_lock);
      
         *** DEADLOCK ***
      
        6 locks held by dd/563090:
         #0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200
         #1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs]
         #2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs]
         #3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs]
         #4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs]
         #5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
      
        stack backtrace:
        CPU: 3 PID: 563090 Comm: dd Tainted: G           OE     5.8.0-rc5+ #20
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
        Call Trace:
         dump_stack+0x96/0xd0
         check_noncircular+0x162/0x180
         __lock_acquire+0x1240/0x2460
         ? wake_up_klogd.part.0+0x30/0x40
         lock_acquire+0xab/0x360
         ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
         _raw_spin_lock+0x25/0x30
         ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
         btrfs_dump_free_space+0x2b/0xa0 [btrfs]
         btrfs_dump_space_info+0xf4/0x120 [btrfs]
         btrfs_reserve_extent+0x176/0x180 [btrfs]
         __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
         ? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs]
         cache_save_setup+0x28d/0x3b0 [btrfs]
         btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
         btrfs_commit_transaction+0xcc/0xac0 [btrfs]
         ? start_transaction+0xe0/0x5b0 [btrfs]
         btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
         btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
         btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
         ? ktime_get_coarse_real_ts64+0xa8/0xd0
         ? trace_hardirqs_on+0x1c/0xe0
         btrfs_file_write_iter+0x3cf/0x610 [btrfs]
         new_sync_write+0x11e/0x1b0
         vfs_write+0x1c9/0x200
         ksys_write+0x68/0xe0
         do_syscall_64+0x52/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is because we're holding the block_group->lock while trying to dump
      the free space cache.  However we don't need this lock, we just need it
      to read the values for the printk, so move the free space cache dumping
      outside of the block group lock.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab0db043
    • Josef Bacik's avatar
      btrfs: move the chunk_mutex in btrfs_read_chunk_tree · 01d01caf
      Josef Bacik authored
      We are currently getting this lockdep splat in btrfs/161:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc5+ #20 Tainted: G            E
        ------------------------------------------------------
        mount/678048 is trying to acquire lock:
        ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
      
        but task is already holding lock:
        ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x8b/0x8f0
      	 btrfs_init_new_device+0x2d2/0x1240 [btrfs]
      	 btrfs_ioctl+0x1de/0x2d20 [btrfs]
      	 ksys_ioctl+0x87/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1240/0x2460
      	 lock_acquire+0xab/0x360
      	 __mutex_lock+0x8b/0x8f0
      	 clone_fs_devices+0x4d/0x170 [btrfs]
      	 btrfs_read_chunk_tree+0x330/0x800 [btrfs]
      	 open_ctree+0xb7c/0x18ce [btrfs]
      	 btrfs_mount_root.cold+0x13/0xfa [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 fc_mount+0xe/0x40
      	 vfs_kern_mount.part.0+0x71/0x90
      	 btrfs_mount+0x13b/0x3e0 [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 do_mount+0x7de/0xb30
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->chunk_mutex);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->chunk_mutex);
          lock(&fs_devs->device_list_mutex);
      
         *** DEADLOCK ***
      
        3 locks held by mount/678048:
         #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
         #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
         #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        stack backtrace:
        CPU: 2 PID: 678048 Comm: mount Tainted: G            E     5.8.0-rc5+ #20
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
        Call Trace:
         dump_stack+0x96/0xd0
         check_noncircular+0x162/0x180
         __lock_acquire+0x1240/0x2460
         ? asm_sysvec_apic_timer_interrupt+0x12/0x20
         lock_acquire+0xab/0x360
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         __mutex_lock+0x8b/0x8f0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? rcu_read_lock_sched_held+0x52/0x60
         ? cpumask_next+0x16/0x20
         ? module_assert_mutex_or_preempt+0x14/0x40
         ? __module_address+0x28/0xf0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? static_obj+0x4f/0x60
         ? lockdep_init_map_waits+0x43/0x200
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         clone_fs_devices+0x4d/0x170 [btrfs]
         btrfs_read_chunk_tree+0x330/0x800 [btrfs]
         open_ctree+0xb7c/0x18ce [btrfs]
         ? super_setup_bdi_name+0x79/0xd0
         btrfs_mount_root.cold+0x13/0xfa [btrfs]
         ? vfs_parse_fs_string+0x84/0xb0
         ? rcu_read_lock_sched_held+0x52/0x60
         ? kfree+0x2b5/0x310
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         fc_mount+0xe/0x40
         vfs_kern_mount.part.0+0x71/0x90
         btrfs_mount+0x13b/0x3e0 [btrfs]
         ? cred_has_capability+0x7c/0x120
         ? rcu_read_lock_sched_held+0x52/0x60
         ? legacy_get_tree+0x30/0x50
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         do_mount+0x7de/0xb30
         ? memdup_user+0x4e/0x90
         __x64_sys_mount+0x8e/0xd0
         do_syscall_64+0x52/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
      then read the device, which takes the device_list_mutex.  The
      device_list_mutex needs to be taken before the chunk_mutex, so this is a
      problem.  We only really need the chunk mutex around adding the chunk,
      so move the mutex around read_one_chunk.
      
      An argument could be made that we don't even need the chunk_mutex here
      as it's during mount, and we are protected by various other locks.
      However we already have special rules for ->device_list_mutex, and I'd
      rather not have another special case for ->chunk_mutex.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      01d01caf
    • Josef Bacik's avatar
      btrfs: open device without device_list_mutex · 18c850fd
      Josef Bacik authored
      There's long existed a lockdep splat because we open our bdev's under
      the ->device_list_mutex at mount time, which acquires the bd_mutex.
      Usually this goes unnoticed, but if you do loopback devices at all
      suddenly the bd_mutex comes with a whole host of other dependencies,
      which results in the splat when you mount a btrfs file system.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
      ------------------------------------------------------
      systemd-journal/509 is trying to acquire lock:
      ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
      
      but task is already holding lock:
      ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
       -> #6 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x13e/0x220
             btrfs_page_mkwrite+0x59/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
       -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x60/0x80
             _copy_from_user+0x20/0xb0
             get_sg_io_hdr+0x9a/0xb0
             scsi_cmd_ioctl+0x1ea/0x2f0
             cdrom_ioctl+0x3c/0x12b4
             sr_block_ioctl+0xa4/0xd0
             block_ioctl+0x3f/0x50
             ksys_ioctl+0x82/0xc0
             __x64_sys_ioctl+0x16/0x20
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #4 (&cd->lock){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             sr_block_open+0xa2/0x180
             __blkdev_get+0xdd/0x550
             blkdev_get+0x38/0x150
             do_dentry_open+0x16b/0x3e0
             path_openat+0x3c9/0xa00
             do_filp_open+0x75/0x100
             do_sys_openat2+0x8a/0x140
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             __blkdev_get+0x6a/0x550
             blkdev_get+0x85/0x150
             blkdev_get_by_path+0x2c/0x70
             btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
             open_fs_devices+0x88/0x240 [btrfs]
             btrfs_open_devices+0x92/0xa0 [btrfs]
             btrfs_mount_root+0x250/0x490 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             vfs_kern_mount.part.0+0x71/0xb0
             btrfs_mount+0x119/0x380 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             do_mount+0x8c6/0xca0
             __x64_sys_mount+0x8e/0xd0
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_run_dev_stats+0x36/0x420 [btrfs]
             commit_cowonly_roots+0x91/0x2d0 [btrfs]
             btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1241/0x20c0
             lock_acquire+0xb0/0x400
             __mutex_lock+0x7b/0x820
             btrfs_record_root_in_trans+0x44/0x70 [btrfs]
             start_transaction+0xd2/0x500 [btrfs]
             btrfs_dirty_inode+0x44/0xd0 [btrfs]
             file_update_time+0xc6/0x120
             btrfs_page_mkwrite+0xda/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
      Possible unsafe locking scenario:
      
           CPU0                    CPU1
           ----                    ----
       lock(sb_pagefaults);
                                   lock(&mm->mmap_lock#2);
                                   lock(sb_pagefaults);
       lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by systemd-journal/509:
       #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
       #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
       #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Call Trace:
       dump_stack+0x92/0xc8
       check_noncircular+0x134/0x150
       __lock_acquire+0x1241/0x20c0
       lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       __mutex_lock+0x7b/0x820
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? kvm_sched_clock_read+0x14/0x30
       ? sched_clock+0x5/0x10
       ? sched_clock_cpu+0xc/0xb0
       btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       start_transaction+0xd2/0x500 [btrfs]
       btrfs_dirty_inode+0x44/0xd0 [btrfs]
       file_update_time+0xc6/0x120
       btrfs_page_mkwrite+0xda/0x560 [btrfs]
       ? sched_clock+0x5/0x10
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7fa3972fdbfe
      Code: Bad RIP value.
      
      Fix this by not holding the ->device_list_mutex at this point.  The
      device_list_mutex exists to protect us from modifying the device list
      while the file system is running.
      
      However it can also be modified by doing a scan on a device.  But this
      action is specifically protected by the uuid_mutex, which we are holding
      here.  We cannot race with opening at this point because we have the
      ->s_mount lock held during the mount.  Not having the
      ->device_list_mutex here is perfectly safe as we're not going to change
      the devices at this point.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add some comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18c850fd