1. 11 Jul, 2024 3 commits
    • Filipe Manana's avatar
      btrfs: avoid races when tracking progress for extent map shrinking · 44849405
      Filipe Manana authored
      We store the progress (root and inode numbers) of the extent map shrinker
      in fs_info without any synchronization but we can have multiple tasks
      calling into the shrinker during memory allocations when there's enough
      memory pressure for example.
      
      This can result in a task A reading fs_info->extent_map_shrinker_last_ino
      after another task B updates it, and task A reading
      fs_info->extent_map_shrinker_last_root before task B updates it, making
      task A see an odd state that isn't necessarily harmful but may make it
      skip certain inode ranges or do more work than necessary by going over
      the same inodes again. These unprotected accesses would also trigger
      warnings from tools like KCSAN.
      
      So add a lock to protect access to these progress fields.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44849405
    • Filipe Manana's avatar
      btrfs: stop extent map shrinker if reschedule is needed · b3ebb9b7
      Filipe Manana authored
      The extent map shrinker can be called in a variety of contexts where we
      are under memory pressure, and of them is when a task is trying to
      allocate memory. For this reason the shrinker is typically called with a
      value of struct shrink_control::nr_to_scan that is much smaller than what
      we return in the nr_cached_objects callback of struct super_operations
      (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does
      not take a long time and cause high latencies. However we can still take
      a lot of time in the shrinker even for a limited amount of nr_to_scan:
      
      1) When traversing the red black tree that tracks open inodes in a root,
         as for example with millions of open inodes we get a deep tree which
         takes time searching for an inode;
      
      2) Iterating over the extent map tree, which is a red black tree, of an
         inode when doing the rb_next() calls and when removing an extent map
         from the tree, since often that requires rebalancing the red black
         tree;
      
      3) When trying to write lock an inode's extent map tree we may wait for a
         significant amount of time, because there's either another task about
         to do IO and searching for an extent map in the tree or inserting an
         extent map in the tree, and we can have thousands or even millions of
         extent maps for an inode. Furthermore, there can be concurrent calls
         to the shrinker so the lock might be busy simply because there is
         already another task shrinking extent maps for the same inode;
      
      4) We often reschedule if we need to, which further increases latency.
      
      So improve on this by stopping the extent map shrinking code whenever we
      need to reschedule and make it skip an inode if we can't immediately lock
      its extent map tree.
      Reported-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Reported-by: default avatarAndrea Gelmini <andrea.gelmini@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3ebb9b7
    • Filipe Manana's avatar
      btrfs: use delayed iput during extent map shrinking · 68a3ebd1
      Filipe Manana authored
      When putting an inode during extent map shrinking we're doing a standard
      iput() but that may take a long time in case the inode is dirty and we are
      doing the final iput that triggers eviction - the VFS will have to wait
      for writeback before calling the btrfs evict callback (see
      fs/inode.c:evict()).
      
      This slows down the task running the shrinker which may have been
      triggered while updating some tree for example, meaning locks are held
      as well as an open transaction handle.
      
      Also if the iput() ends up triggering eviction and the inode has no links
      anymore, then we trigger item truncation which requires flushing delayed
      items, space reservation to start a transaction and that may trigger the
      space reclaim task and wait for it, resulting in deadlocks in case the
      reclaim task needs for example to commit a transaction and the shrinker
      is being triggered from a path holding a transaction handle.
      
      Syzbot reported such a case with the following stack traces:
      
         ======================================================
         WARNING: possible circular locking dependency detected
         6.10.0-rc2-syzkaller-00010-g2ab79514 #0 Not tainted
         ------------------------------------------------------
         kswapd0/111 is trying to acquire lock:
         ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
      
         but task is already holding lock:
         ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
      
         which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
      
         -> #3 (fs_reclaim){+.+.}-{0:0}:
                __fs_reclaim_acquire mm/page_alloc.c:3783 [inline]
                fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797
                might_alloc include/linux/sched/mm.h:334 [inline]
                slab_pre_alloc_hook mm/slub.c:3890 [inline]
                slab_alloc_node mm/slub.c:3980 [inline]
                kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019
                btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
                alloc_inode+0x5d/0x230 fs/inode.c:261
                iget5_locked fs/inode.c:1235 [inline]
                iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
                btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
                btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
                btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
                create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911
                btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114
                btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373
                __btrfs_balance fs/btrfs/volumes.c:4157 [inline]
                btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534
                btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline]
                btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742
                __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007
                do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
                __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
                do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
                entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
         -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}:
                join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315
                start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
                btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
                btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
                open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
                btrfs_fill_super fs/btrfs/super.c:946 [inline]
                btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
                btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                fc_mount+0x16/0xc0 fs/namespace.c:1125
                btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
                btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                do_new_mount fs/namespace.c:3352 [inline]
                path_mount+0x6e1/0x1f10 fs/namespace.c:3679
                do_mount fs/namespace.c:3692 [inline]
                __do_sys_mount fs/namespace.c:3898 [inline]
                __se_sys_mount fs/namespace.c:3875 [inline]
                __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
                do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
                __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
                do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
                entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
         -> #1 (btrfs_trans_num_writers){++++}-{0:0}:
                join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314
                start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
                btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
                btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
                open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
                btrfs_fill_super fs/btrfs/super.c:946 [inline]
                btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
                btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                fc_mount+0x16/0xc0 fs/namespace.c:1125
                btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
                btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                do_new_mount fs/namespace.c:3352 [inline]
                path_mount+0x6e1/0x1f10 fs/namespace.c:3679
                do_mount fs/namespace.c:3692 [inline]
                __do_sys_mount fs/namespace.c:3898 [inline]
                __se_sys_mount fs/namespace.c:3875 [inline]
                __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
                do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
                __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
                do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
                entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
         -> #0 (sb_internal#3){.+.+}-{0:0}:
                check_prev_add kernel/locking/lockdep.c:3134 [inline]
                check_prevs_add kernel/locking/lockdep.c:3253 [inline]
                validate_chain kernel/locking/lockdep.c:3869 [inline]
                __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
                lock_acquire kernel/locking/lockdep.c:5754 [inline]
                lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
                percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
                __sb_start_write include/linux/fs.h:1655 [inline]
                sb_start_intwrite include/linux/fs.h:1838 [inline]
                start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
                btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
                btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
                evict+0x2ed/0x6c0 fs/inode.c:667
                iput_final fs/inode.c:1741 [inline]
                iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
                iput+0x5c/0x80 fs/inode.c:1757
                btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
                btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
                super_cache_scan+0x409/0x550 fs/super.c:227
                do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
                shrink_slab+0x18a/0x1310 mm/shrinker.c:662
                shrink_one+0x493/0x7c0 mm/vmscan.c:4790
                shrink_many mm/vmscan.c:4851 [inline]
                lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
                shrink_node mm/vmscan.c:5910 [inline]
                kswapd_shrink_node mm/vmscan.c:6720 [inline]
                balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
                kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
                kthread+0x2c1/0x3a0 kernel/kthread.c:389
                ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
                ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
         other info that might help us debug this:
      
         Chain exists of:
           sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim
      
          Possible unsafe locking scenario:
      
                CPU0                    CPU1
                ----                    ----
           lock(fs_reclaim);
                                        lock(btrfs_trans_num_extwriters);
                                        lock(fs_reclaim);
           rlock(sb_internal#3);
      
          *** DEADLOCK ***
      
         2 locks held by kswapd0/111:
          #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
          #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline]
          #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196
      
         stack backtrace:
         CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab79514 #0
         Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
         Call Trace:
          <TASK>
          __dump_stack lib/dump_stack.c:88 [inline]
          dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114
          check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187
          check_prev_add kernel/locking/lockdep.c:3134 [inline]
          check_prevs_add kernel/locking/lockdep.c:3253 [inline]
          validate_chain kernel/locking/lockdep.c:3869 [inline]
          __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
          lock_acquire kernel/locking/lockdep.c:5754 [inline]
          lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
          percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
          __sb_start_write include/linux/fs.h:1655 [inline]
          sb_start_intwrite include/linux/fs.h:1838 [inline]
          start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
          btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
          btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
          evict+0x2ed/0x6c0 fs/inode.c:667
          iput_final fs/inode.c:1741 [inline]
          iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
          iput+0x5c/0x80 fs/inode.c:1757
          btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
          btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
          super_cache_scan+0x409/0x550 fs/super.c:227
          do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
          shrink_slab+0x18a/0x1310 mm/shrinker.c:662
          shrink_one+0x493/0x7c0 mm/vmscan.c:4790
          shrink_many mm/vmscan.c:4851 [inline]
          lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
          shrink_node mm/vmscan.c:5910 [inline]
          kswapd_shrink_node mm/vmscan.c:6720 [inline]
          balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
          kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
          kthread+0x2c1/0x3a0 kernel/kthread.c:389
          ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
          ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
          </TASK>
      
      So fix this by using btrfs_add_delayed_iput() so that the final iput is
      delegated to the cleaner kthread.
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/
      Reported-by: syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com
      Fixes: 956a17d9 ("btrfs: add a shrinker for extent maps")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68a3ebd1
  2. 04 Jul, 2024 2 commits
    • Boris Burkov's avatar
      btrfs: fix folio refcount in __alloc_dummy_extent_buffer() · a56c85fa
      Boris Burkov authored
      Another improper use of __folio_put() in an error path after freshly
      allocating pages/folios which returns them with the refcount initialized
      to 1. The refactor from __free_pages() -> __folio_put() (instead of
      folio_put) removed a refcount decrement found in __free_pages() and
      folio_put but absent from __folio_put().
      
      Fixes: 13df3775 ("btrfs: cleanup metadata page pointer usage")
      CC: stable@vger.kernel.org # 6.8+
      Tested-by: default avatarEd Tomlinson <edtoml@gmail.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a56c85fa
    • Boris Burkov's avatar
      btrfs: fix folio refcount in btrfs_do_encoded_write() · da0386c1
      Boris Burkov authored
      The conversion to folios switched __free_page() to __folio_put() in the
      error path in btrfs_do_encoded_write().
      
      However, this gets the page refcounting wrong. If we do hit that error
      path (I reproduced by modifying btrfs_do_encoded_write to pretend to
      always fail in a way that jumps to out_folios and running the fstests
      case btrfs/281), then we always hit the following BUG freeing the folio:
      
        BUG: Bad page state in process btrfs  pfn:40ab0b
        page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
         flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
        raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000
        raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000
        page dumped because: nonzero _refcount
        Call Trace:
        <TASK>
        dump_stack_lvl+0x3d/0xe0
        bad_page+0xea/0xf0
        free_unref_page+0x8e1/0x900
        ? __mem_cgroup_uncharge+0x69/0x90
        __folio_put+0xe6/0x190
        btrfs_do_encoded_write+0x445/0x780
        ? current_time+0x25/0xd0
        btrfs_do_write_iter+0x2cc/0x4b0
        btrfs_ioctl_encoded_write+0x2b6/0x340
      
      It turns out __free_page() decreases the page reference count while
      __folio_put() does not. Switch __folio_put() to folio_put() which
      decreases the folio reference count first.
      
      Fixes: 400b172b ("btrfs: compression: migrate compression/decompression paths to folios")
      Tested-by: default avatarEd Tomlinson <edtoml@gmail.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da0386c1
  3. 02 Jul, 2024 3 commits
    • Filipe Manana's avatar
      btrfs: fix uninitialized return value in the ref-verify tool · 9da45c88
      Filipe Manana authored
      In the ref-verify tool, when processing the inline references of an extent
      item, we may end up returning with uninitialized return value, because:
      
      1) The 'ret' variable is not initialized if there are no inline extent
         references ('ptr' == 'end' before the while loop starts);
      
      2) If we find an extent owner inline reference we don't initialize 'ret'.
      
      So fix these cases by initializing 'ret' to 0 when declaring the variable
      and set it to -EINVAL if we find an extent owner inline references and
      simple quotas are not enabled (as well as print an error message).
      Reported-by: default avatarMirsad Todorovac <mtodorovac69@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/59b40ebe-c824-457d-8b24-0bbca69d472b@gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9da45c88
    • Qu Wenruo's avatar
      btrfs: always do the basic checks for btrfs_qgroup_inherit structure · 724d8042
      Qu Wenruo authored
      [BUG]
      Syzbot reports the following regression detected by KASAN:
      
        BUG: KASAN: slab-out-of-bounds in btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
        Read of size 8 at addr ffff88814628ca50 by task syz-executor318/5171
      
        CPU: 0 PID: 5171 Comm: syz-executor318 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab79514 #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
         print_address_description mm/kasan/report.c:377 [inline]
         print_report+0x169/0x550 mm/kasan/report.c:488
         kasan_report+0x143/0x180 mm/kasan/report.c:601
         btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
         create_pending_snapshot+0x1359/0x29b0 fs/btrfs/transaction.c:1854
         create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1922
         btrfs_commit_transaction+0xf20/0x3740 fs/btrfs/transaction.c:2382
         create_snapshot+0x6a1/0x9e0 fs/btrfs/ioctl.c:875
         btrfs_mksubvol+0x58f/0x710 fs/btrfs/ioctl.c:1029
         btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1075
         __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1340
         btrfs_ioctl_snap_create_v2+0x1f2/0x3a0 fs/btrfs/ioctl.c:1422
         btrfs_ioctl+0x99e/0xc60
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:907 [inline]
         __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x77/0x7f
        RIP: 0033:0x7fcbf1992509
        RSP: 002b:00007fcbf1928218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007fcbf1a1f618 RCX: 00007fcbf1992509
        RDX: 0000000020000280 RSI: 0000000050009417 RDI: 0000000000000003
        RBP: 00007fcbf1a1f610 R08: 00007ffea1298e97 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 00007fcbf19eb660
        R13: 00000000200002b8 R14: 00007fcbf19e60c0 R15: 0030656c69662f2e
         </TASK>
      
      And it also pinned it down to commit b5357cb2 ("btrfs: qgroup: do not
      check qgroup inherit if qgroup is disabled").
      
      [CAUSE]
      That offending commit skips the whole qgroup inherit check if qgroup is
      not enabled.
      
      But that also skips the very basic checks like
      num_ref_copies/num_excl_copies and the structure size checks.
      
      Meaning if a qgroup enable/disable race is happening at the background,
      and we pass a btrfs_qgroup_inherit structure when the qgroup is
      disabled, the check would be completely skipped.
      
      Then at the time of transaction commitment, qgroup is re-enabled and
      btrfs_qgroup_inherit() is going to use the incorrect structure and
      causing the above KASAN error.
      
      [FIX]
      Make btrfs_qgroup_check_inherit() only skip the source qgroup checks.
      So that even if invalid btrfs_qgroup_inherit structure is passed in, we
      can still reject invalid ones no matter if qgroup is enabled or not.
      
      Furthermore we do already have an extra safety inside
      btrfs_qgroup_inherit(), which would just ignore invalid qgroup sources,
      so even if we only skip the qgroup source check we're still safe.
      
      Reported-by: syzbot+a0d1f7e26910be4dc171@syzkaller.appspotmail.com
      Fixes: b5357cb2 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled")
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarJeongjun Park <aha310510@gmail.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      724d8042
    • Naohiro Aota's avatar
      btrfs: zoned: fix calc_available_free_space() for zoned mode · 64d2c847
      Naohiro Aota authored
      calc_available_free_space() returns the total size of metadata (or
      system) block groups, which can be allocated from unallocated disk
      space. The logic is wrong on zoned mode in two places.
      
      First, the calculation of data_chunk_size is wrong. We always allocate
      one zone as one chunk, and no partial allocation of a zone. So, we
      should use zone_size (= data_sinfo->chunk_size) as it is.
      
      Second, the result "avail" may not be zone aligned. Since we always
      allocate one zone as one chunk on zoned mode, returning non-zone size
      aligned bytes will result in less pressure on the async metadata reclaim
      process.
      
      This is serious for the nearly full state with a large zone size device.
      Allowing over-commit too much will result in less async reclaim work and
      end up in ENOSPC. We can align down to the zone size to avoid that.
      
      Fixes: cb6cbab7 ("btrfs: adjust overcommit logic when very close to full")
      CC: stable@vger.kernel.org # 6.9
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64d2c847
  4. 01 Jul, 2024 1 commit
    • Naohiro Aota's avatar
      btrfs: fix adding block group to a reclaim list and the unused list during reclaim · 48f091fd
      Naohiro Aota authored
      There is a potential parallel list adding for retrying in
      btrfs_reclaim_bgs_work and adding to the unused list. Since the block
      group is removed from the reclaim list and it is on a relocation work,
      it can be added into the unused list in parallel. When that happens,
      adding it to the reclaim list will corrupt the list head and trigger
      list corruption like below.
      
      Fix it by taking fs_info->unused_bgs_lock.
      
        [177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
        [177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
        [177.529][T2585409] ------------[ cut here ]------------
        [177.537][T2585409] kernel BUG at lib/list_debug.c:65!
        [177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
        [177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
        [177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
        [177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
        [177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
        [177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
        [177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
        [177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
        [177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
        [177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
        [177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
        [177.687][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
        [177.700][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
        [177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
        [177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
        [177.742][T2585409] PKRU: 55555554
        [177.748][T2585409] Call Trace:
        [177.753][T2585409]  <TASK>
        [177.759][T2585409]  ? __die_body.cold+0x19/0x27
        [177.766][T2585409]  ? die+0x2e/0x50
        [177.772][T2585409]  ? do_trap+0x1ea/0x2d0
        [177.779][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
        [177.788][T2585409]  ? do_error_trap+0xa3/0x160
        [177.795][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
        [177.805][T2585409]  ? handle_invalid_op+0x2c/0x40
        [177.812][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
        [177.820][T2585409]  ? exc_invalid_op+0x2d/0x40
        [177.827][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
        [177.834][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
        [177.843][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
      
      There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
      safe, AFAICS. Since the block group was in the unused list, the used bytes
      should be 0 when it was added to the unused list. Then, it checks
      block_group->{used,reserved,pinned} are still 0 under the
      block_group->lock. So, they should be still eligible for the unused list,
      not the reclaim list.
      
      The reason it is safe there it's because because we're holding
      space_info->groups_sem in write mode.
      
      That means no other task can allocate from the block group, so while we
      are at deleted_unused_bgs() it's not possible for other tasks to
      allocate and deallocate extents from the block group, so it can't be
      added to the unused list or the reclaim list by anyone else.
      
      The bug can be reproduced by btrfs/166 after a few rounds. In practice
      this can be hit when relocation cannot find more chunk space and ends
      with ENOSPC.
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Suggested-by: default avatarJohannes Thumshirn <Johannes.Thumshirn@wdc.com>
      Fixes: 4eb4e85c ("btrfs: retry block group reclaim without infinite loop")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48f091fd
  5. 24 Jun, 2024 4 commits
    • Filipe Manana's avatar
      btrfs: qgroup: fix quota root leak after quota disable failure · a7e4c6a3
      Filipe Manana authored
      If during the quota disable we fail when cleaning the quota tree or when
      deleting the root from the root tree, we jump to the 'out' label without
      ever dropping the reference on the quota root, resulting in a leak of the
      root since fs_info->quota_root is no longer pointing to the root (we have
      set it to NULL just before those steps).
      
      Fix this by always doing a btrfs_put_root() call under the 'out' label.
      This is a problem that exists since qgroups were first added in 2012 by
      commit bed92eae ("Btrfs: qgroup implementation and prototypes"), but
      back then we missed a kfree on the quota root and free_extent_buffer()
      calls on its root and commit root nodes, since back then roots were not
      yet reference counted.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7e4c6a3
    • Qu Wenruo's avatar
      btrfs: scrub: handle RST lookup error correctly · 2c499086
      Qu Wenruo authored
      [BUG]
      When running btrfs/060 with forced RST feature, it would crash the
      following ASSERT() inside scrub_read_endio():
      
      	ASSERT(sector_nr < stripe->nr_sectors);
      
      Before that, we would have tree dump from
      btrfs_get_raid_extent_offset(), as we failed to find the RST entry for
      the range.
      
      [CAUSE]
      Inside scrub_submit_extent_sector_read() every time we allocated a new
      bbio we immediately called btrfs_map_block() to make sure there was some
      RST range covering the scrub target.
      
      But if btrfs_map_block() fails, we immediately call endio for the bbio,
      while the bbio is newly allocated, it's completely empty.
      
      Then inside scrub_read_endio(), we go through the bvecs to find
      the sector number (as bi_sector is no longer reliable if the bio is
      submitted to lower layers).
      
      And since the bio is empty, such bvecs iteration would not find any
      sector matching the sector, and return sector_nr == stripe->nr_sectors,
      triggering the ASSERT().
      
      [FIX]
      Instead of calling btrfs_map_block() after allocating a new bbio, call
      btrfs_map_block() first.
      
      Since our only objective of calling btrfs_map_block() is only to update
      stripe_len, there is really no need to do that after btrfs_alloc_bio().
      
      This new timing would avoid the problem of handling empty bbio
      completely, and in fact fixes a possible race window for the old code,
      where if the submission thread is the only owner of the pending_io, the
      scrub would never finish (since we didn't decrease the pending_io
      counter).
      
      Although the root cause of RST lookup failure still needs to be
      addressed.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c499086
    • Naohiro Aota's avatar
      btrfs: zoned: fix initial free space detection · b9fd2aff
      Naohiro Aota authored
      When creating a new block group, it calls btrfs_add_new_free_space() to add
      the entire block group range into the free space accounting.
      __btrfs_add_free_space_zoned() checks if size == block_group->length to
      detect the initial free space adding, and proceed that case properly.
      
      However, if the zone_capacity == zone_size and the over-write speed is fast
      enough, the entire zone can be over-written within one transaction. That
      confuses __btrfs_add_free_space_zoned() to handle it as an initial free
      space accounting. As a result, that block group becomes a strange state: 0
      used bytes, 0 zone_unusable bytes, but alloc_offset == zone_capacity (no
      allocation anymore).
      
      The initial free space accounting can properly be checked by checking
      alloc_offset too.
      
      Fixes: 98173255 ("btrfs: zoned: calculate free space from zone capacity")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9fd2aff
    • Filipe Manana's avatar
      btrfs: use NOFS context when getting inodes during logging and log replay · d1825752
      Filipe Manana authored
      During inode logging (and log replay too), we are holding a transaction
      handle and we often need to call btrfs_iget(), which will read an inode
      from its subvolume btree if it's not loaded in memory and that results in
      allocating an inode with GFP_KERNEL semantics at the btrfs_alloc_inode()
      callback - and this may recurse into the filesystem in case we are under
      memory pressure and attempt to commit the current transaction, resulting
      in a deadlock since the logging (or log replay) task is holding a
      transaction handle open.
      
      Syzbot reported this with the following stack traces:
      
        WARNING: possible circular locking dependency detected
        6.10.0-rc2-syzkaller-00361-g061d1af7 #0 Not tainted
        ------------------------------------------------------
        syz-executor.1/9919 is trying to acquire lock:
        ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:334 [inline]
        ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:3891 [inline]
        ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:3981 [inline]
        ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
      
        but task is already holding lock:
        ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&ei->log_mutex){+.+.}-{3:3}:
               __mutex_lock_common kernel/locking/mutex.c:608 [inline]
               __mutex_lock+0x175/0x9c0 kernel/locking/mutex.c:752
               btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
               btrfs_log_inode_parent+0x8cb/0x2a90 fs/btrfs/tree-log.c:7079
               btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
               btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
               vfs_fsync_range+0x141/0x230 fs/sync.c:188
               generic_write_sync include/linux/fs.h:2794 [inline]
               btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
               new_sync_write fs/read_write.c:497 [inline]
               vfs_write+0x6b6/0x1140 fs/read_write.c:590
               ksys_write+0x12f/0x260 fs/read_write.c:643
               do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
               __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
               do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
               entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
        -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}:
               join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315
               start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
               btrfs_commit_super+0xa1/0x110 fs/btrfs/disk-io.c:4170
               close_ctree+0xcb0/0xf90 fs/btrfs/disk-io.c:4324
               generic_shutdown_super+0x159/0x3d0 fs/super.c:642
               kill_anon_super+0x3a/0x60 fs/super.c:1226
               btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2096
               deactivate_locked_super+0xbe/0x1a0 fs/super.c:473
               deactivate_super+0xde/0x100 fs/super.c:506
               cleanup_mnt+0x222/0x450 fs/namespace.c:1267
               task_work_run+0x14e/0x250 kernel/task_work.c:180
               resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
               exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
               exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
               __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
               syscall_exit_to_user_mode+0x278/0x2a0 kernel/entry/common.c:218
               __do_fast_syscall_32+0x80/0x120 arch/x86/entry/common.c:389
               do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
               entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
        -> #1 (btrfs_trans_num_writers){++++}-{0:0}:
               __lock_release kernel/locking/lockdep.c:5468 [inline]
               lock_release+0x33e/0x6c0 kernel/locking/lockdep.c:5774
               percpu_up_read include/linux/percpu-rwsem.h:99 [inline]
               __sb_end_write include/linux/fs.h:1650 [inline]
               sb_end_intwrite include/linux/fs.h:1767 [inline]
               __btrfs_end_transaction+0x5ca/0x920 fs/btrfs/transaction.c:1071
               btrfs_commit_inode_delayed_inode+0x228/0x330 fs/btrfs/delayed-inode.c:1301
               btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
               evict+0x2ed/0x6c0 fs/inode.c:667
               iput_final fs/inode.c:1741 [inline]
               iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
               iput+0x5c/0x80 fs/inode.c:1757
               dentry_unlink_inode+0x295/0x480 fs/dcache.c:400
               __dentry_kill+0x1d0/0x600 fs/dcache.c:603
               dput.part.0+0x4b1/0x9b0 fs/dcache.c:845
               dput+0x1f/0x30 fs/dcache.c:835
               ovl_stack_put+0x60/0x90 fs/overlayfs/util.c:132
               ovl_destroy_inode+0xc6/0x190 fs/overlayfs/super.c:182
               destroy_inode+0xc4/0x1b0 fs/inode.c:311
               iput_final fs/inode.c:1741 [inline]
               iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
               iput+0x5c/0x80 fs/inode.c:1757
               dentry_unlink_inode+0x295/0x480 fs/dcache.c:400
               __dentry_kill+0x1d0/0x600 fs/dcache.c:603
               shrink_kill fs/dcache.c:1048 [inline]
               shrink_dentry_list+0x140/0x5d0 fs/dcache.c:1075
               prune_dcache_sb+0xeb/0x150 fs/dcache.c:1156
               super_cache_scan+0x32a/0x550 fs/super.c:221
               do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
               shrink_slab_memcg mm/shrinker.c:548 [inline]
               shrink_slab+0xa87/0x1310 mm/shrinker.c:626
               shrink_one+0x493/0x7c0 mm/vmscan.c:4790
               shrink_many mm/vmscan.c:4851 [inline]
               lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
               shrink_node mm/vmscan.c:5910 [inline]
               kswapd_shrink_node mm/vmscan.c:6720 [inline]
               balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
               kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
               kthread+0x2c1/0x3a0 kernel/kthread.c:389
               ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
               ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
        -> #0 (fs_reclaim){+.+.}-{0:0}:
               check_prev_add kernel/locking/lockdep.c:3134 [inline]
               check_prevs_add kernel/locking/lockdep.c:3253 [inline]
               validate_chain kernel/locking/lockdep.c:3869 [inline]
               __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
               lock_acquire kernel/locking/lockdep.c:5754 [inline]
               lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
               __fs_reclaim_acquire mm/page_alloc.c:3801 [inline]
               fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815
               might_alloc include/linux/sched/mm.h:334 [inline]
               slab_pre_alloc_hook mm/slub.c:3891 [inline]
               slab_alloc_node mm/slub.c:3981 [inline]
               kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
               btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
               alloc_inode+0x5d/0x230 fs/inode.c:261
               iget5_locked fs/inode.c:1235 [inline]
               iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
               btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
               btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
               btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
               add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline]
               copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928
               btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592
               log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline]
               btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718
               btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline]
               btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141
               btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
               btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
               vfs_fsync_range+0x141/0x230 fs/sync.c:188
               generic_write_sync include/linux/fs.h:2794 [inline]
               btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
               do_iter_readv_writev+0x504/0x780 fs/read_write.c:741
               vfs_writev+0x36f/0xde0 fs/read_write.c:971
               do_pwritev+0x1b2/0x260 fs/read_write.c:1072
               __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline]
               __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline]
               __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210
               do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
               __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
               do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
               entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
        other info that might help us debug this:
      
        Chain exists of:
          fs_reclaim --> btrfs_trans_num_extwriters --> &ei->log_mutex
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&ei->log_mutex);
                                       lock(btrfs_trans_num_extwriters);
                                       lock(&ei->log_mutex);
          lock(fs_reclaim);
      
         *** DEADLOCK ***
      
        7 locks held by syz-executor.1/9919:
         #0: ffff88802be20420 (sb_writers#23){.+.+}-{0:0}, at: do_pwritev+0x1b2/0x260 fs/read_write.c:1072
         #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: inode_lock include/linux/fs.h:791 [inline]
         #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: btrfs_inode_lock+0xc8/0x110 fs/btrfs/inode.c:385
         #2: ffff888065c0f778 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_inode_lock+0xee/0x110 fs/btrfs/inode.c:388
         #3: ffff88802be20610 (sb_internal#4){.+.+}-{0:0}, at: btrfs_sync_file+0x95b/0xe10 fs/btrfs/file.c:1952
         #4: ffff8880546323f0 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290
         #5: ffff888054632418 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290
         #6: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
      
        stack backtrace:
        CPU: 2 PID: 9919 Comm: syz-executor.1 Not tainted 6.10.0-rc2-syzkaller-00361-g061d1af7 #0
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114
         check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187
         check_prev_add kernel/locking/lockdep.c:3134 [inline]
         check_prevs_add kernel/locking/lockdep.c:3253 [inline]
         validate_chain kernel/locking/lockdep.c:3869 [inline]
         __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
         lock_acquire kernel/locking/lockdep.c:5754 [inline]
         lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
         __fs_reclaim_acquire mm/page_alloc.c:3801 [inline]
         fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815
         might_alloc include/linux/sched/mm.h:334 [inline]
         slab_pre_alloc_hook mm/slub.c:3891 [inline]
         slab_alloc_node mm/slub.c:3981 [inline]
         kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
         btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
         alloc_inode+0x5d/0x230 fs/inode.c:261
         iget5_locked fs/inode.c:1235 [inline]
         iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
         btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
         btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
         btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
         add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline]
         copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928
         btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592
         log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline]
         btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718
         btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline]
         btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141
         btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
         btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
         vfs_fsync_range+0x141/0x230 fs/sync.c:188
         generic_write_sync include/linux/fs.h:2794 [inline]
         btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
         do_iter_readv_writev+0x504/0x780 fs/read_write.c:741
         vfs_writev+0x36f/0xde0 fs/read_write.c:971
         do_pwritev+0x1b2/0x260 fs/read_write.c:1072
         __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline]
         __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline]
         __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210
         do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
         __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
         do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
         entry_SYSENTER_compat_after_hwframe+0x84/0x8e
        RIP: 0023:0xf7334579
        Code: b8 01 10 06 03 (...)
        RSP: 002b:00000000f5f265ac EFLAGS: 00000292 ORIG_RAX: 000000000000017b
        RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000200002c0
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      
      Fix this by ensuring we are under a NOFS scope whenever we call
      btrfs_iget() during inode logging and log replay.
      
      Reported-by: syzbot+8576cfa84070dce4d59b@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/000000000000274a3a061abbd928@google.com/
      Fixes: 712e36c5 ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1825752
  6. 13 Jun, 2024 2 commits
    • Johannes Thumshirn's avatar
      btrfs: zoned: allocate dummy checksums for zoned NODATASUM writes · cebae292
      Johannes Thumshirn authored
      Shin'ichiro reported that when he's running fstests' test-case
      btrfs/167 on emulated zoned devices, he's seeing the following NULL
      pointer dereference in 'btrfs_zone_finish_endio()':
      
        Oops: general protection fault, probably for non-canonical address 0xdffffc0000000011: 0000 [#1] PREEMPT SMP KASAN NOPTI
        KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
        CPU: 4 PID: 23324407 Comm: kworker/u80:15 Tainted: G        W          6.10.0-rc2-kts+ #4
        Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
        Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
      
        RSP: 0018:ffff88867f107a90 EFLAGS: 00010206
        RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff893e5534
        RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088
        RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed1081696028
        R10: ffff88840b4b0143 R11: ffff88834dfff600 R12: ffff88840b4b0000
        R13: 0000000000020000 R14: 0000000000000000 R15: ffff888530ad5210
        FS:  0000000000000000(0000) GS:ffff888e3f800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f87223fff38 CR3: 00000007a7c6a002 CR4: 00000000007706f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
         <TASK>
         ? __die_body.cold+0x19/0x27
         ? die_addr+0x46/0x70
         ? exc_general_protection+0x14f/0x250
         ? asm_exc_general_protection+0x26/0x30
         ? do_raw_read_unlock+0x44/0x70
         ? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
         btrfs_finish_one_ordered+0x5d9/0x19a0 [btrfs]
         ? __pfx_lock_release+0x10/0x10
         ? do_raw_write_lock+0x90/0x260
         ? __pfx_do_raw_write_lock+0x10/0x10
         ? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs]
         ? _raw_write_unlock+0x23/0x40
         ? btrfs_finish_ordered_zoned+0x5a9/0x850 [btrfs]
         ? lock_acquire+0x435/0x500
         btrfs_work_helper+0x1b1/0xa70 [btrfs]
         ? __schedule+0x10a8/0x60b0
         ? __pfx___might_resched+0x10/0x10
         process_one_work+0x862/0x1410
         ? __pfx_lock_acquire+0x10/0x10
         ? __pfx_process_one_work+0x10/0x10
         ? assign_work+0x16c/0x240
         worker_thread+0x5e6/0x1010
         ? __pfx_worker_thread+0x10/0x10
         kthread+0x2c3/0x3a0
         ? trace_irq_enable.constprop.0+0xce/0x110
         ? __pfx_kthread+0x10/0x10
         ret_from_fork+0x31/0x70
         ? __pfx_kthread+0x10/0x10
         ret_from_fork_asm+0x1a/0x30
         </TASK>
      
      Enabling CONFIG_BTRFS_ASSERT revealed the following assertion to
      trigger:
      
        assertion failed: !list_empty(&ordered->list), in fs/btrfs/zoned.c:1815
      
      This indicates, that we're missing the checksums list on the
      ordered_extent. As btrfs/167 is doing a NOCOW write this is to be
      expected.
      
      Further analysis with drgn confirmed the assumption:
      
        >>> inode = prog.crashed_thread().stack_trace()[11]['ordered'].inode
        >>> btrfs_inode = drgn.container_of(inode, "struct btrfs_inode", \
               				"vfs_inode")
        >>> print(btrfs_inode.flags)
        (u32)1
      
      As zoned emulation mode simulates conventional zones on regular devices,
      we cannot use zone-append for writing. But we're only attaching dummy
      checksums if we're doing a zone-append write.
      
      So for NOCOW zoned data writes on conventional zones, also attach a
      dummy checksum.
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Fixes: cbfce4c7 ("btrfs: optimize the logical to physical mapping for zoned writes")
      CC: Naohiro Aota <Naohiro.Aota@wdc.com> # 6.6+
      Tested-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cebae292
    • Boris Burkov's avatar
      btrfs: retry block group reclaim without infinite loop · 4eb4e85c
      Boris Burkov authored
      If inc_block_group_ro systematically fails (e.g. due to ETXTBUSY from
      swap) or btrfs_relocate_chunk systematically fails (from lack of
      space), then this worker becomes an infinite loop.
      
      At the very least, this strands the cleaner thread, but can also result
      in hung tasks/RCU stalls on PREEMPT_NONE kernels and if the
      reclaim_bgs_lock mutex is not contended.
      
      I believe the best long term fix is to manage reclaim via work queue,
      where we queue up a relocation on the triggering condition and re-queue
      on failure. In the meantime, this is an easy fix to apply to avoid the
      immediate pain.
      
      Fixes: 7e271809 ("btrfs: reinsert BGs failed to reclaim")
      CC: stable@vger.kernel.org # 6.6+
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4eb4e85c
  7. 06 Jun, 2024 1 commit
    • Qu Wenruo's avatar
      btrfs: protect folio::private when attaching extent buffer folios · f3a5367c
      Qu Wenruo authored
      [BUG]
      Since v6.8 there are rare kernel crashes reported by various people,
      the common factor is bad page status error messages like this:
      
        BUG: Bad page state in process kswapd0  pfn:d6e840
        page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
        pfn:0xd6e840
        aops:btree_aops ino:1
        flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
        page_type: 0xffffffff()
        raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
        raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
        page dumped because: non-NULL mapping
      
      [CAUSE]
      Commit 09e6cef1 ("btrfs: refactor alloc_extent_buffer() to
      allocate-then-attach method") changes the sequence when allocating a new
      extent buffer.
      
      Previously we always called grab_extent_buffer() under
      mapping->i_private_lock, to ensure the safety on modification on
      folio::private (which is a pointer to extent buffer for regular
      sectorsize).
      
      This can lead to the following race:
      
      Thread A is trying to allocate an extent buffer at bytenr X, with 4
      4K pages, meanwhile thread B is trying to release the page at X + 4K
      (the second page of the extent buffer at X).
      
                 Thread A                |                 Thread B
      -----------------------------------+-------------------------------------
                                         | btree_release_folio()
      				   | | This is for the page at X + 4K,
      				   | | Not page X.
      				   | |
      alloc_extent_buffer()              | |- release_extent_buffer()
      |- filemap_add_folio() for the     | |  |- atomic_dec_and_test(eb->refs)
      |  page at bytenr X (the first     | |  |
      |  page).                          | |  |
      |  Which returned -EEXIST.         | |  |
      |                                  | |  |
      |- filemap_lock_folio()            | |  |
      |  Returned the first page locked. | |  |
      |                                  | |  |
      |- grab_extent_buffer()            | |  |
      |  |- atomic_inc_not_zero()        | |  |
      |  |  Returned false               | |  |
      |  |- folio_detach_private()       | |  |- folio_detach_private() for X
      |     |- folio_test_private()      | |     |- folio_test_private()
            |  Returned true             | |     |  Returned true
            |- folio_put()               |       |- folio_put()
      
      Now there are two puts on the same folio at folio X, leading to refcount
      underflow of the folio X, and eventually causing the BUG_ON() on the
      page->mapping.
      
      The condition is not that easy to hit:
      
      - The release must be triggered for the middle page of an eb
        If the release is on the same first page of an eb, page lock would kick
        in and prevent the race.
      
      - folio_detach_private() has a very small race window
        It's only between folio_test_private() and folio_clear_private().
      
      That's exactly when mapping->i_private_lock is used to prevent such race,
      and commit 09e6cef1 ("btrfs: refactor alloc_extent_buffer() to
      allocate-then-attach method") screwed that up.
      
      At that time, I thought the page lock would kick in as
      filemap_release_folio() also requires the page to be locked, but forgot
      the filemap_release_folio() only locks one page, not all pages of an
      extent buffer.
      
      [FIX]
      Move all the code requiring i_private_lock into
      attach_eb_folio_to_filemap(), so that everything is done with proper
      lock protection.
      
      Furthermore to prevent future problems, add an extra
      lockdep_assert_locked() to ensure we're holding the proper lock.
      
      To reproducer that is able to hit the race (takes a few minutes with
      instrumented code inserting delays to alloc_extent_buffer()):
      
        #!/bin/sh
        drop_caches () {
      	  while(true); do
      		  echo 3 > /proc/sys/vm/drop_caches
      		  echo 1 > /proc/sys/vm/compact_memory
      	  done
        }
      
        run_tar () {
      	  while(true); do
      		  for x in `seq 1 80` ; do
      			  tar cf /dev/zero /mnt > /dev/null &
      		  done
      		  wait
      	  done
        }
      
        mkfs.btrfs -f -d single -m single /dev/vda
        mount -o noatime /dev/vda /mnt
        # create 200,000 files, 1K each
        ./simoop -n 200000 -E -f 1k /mnt
        drop_caches &
        (run_tar)
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/Reported-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/Reported-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
      Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
      Fixes: 09e6cef1 ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
      CC: stable@vger.kernel.org # 6.8+
      CC: Chris Mason <clm@fb.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3a5367c
  8. 05 Jun, 2024 2 commits
    • Filipe Manana's avatar
      btrfs: fix leak of qgroup extent records after transaction abort · fb33eb2e
      Filipe Manana authored
      Qgroup extent records are created when delayed ref heads are created and
      then released after accounting extents at btrfs_qgroup_account_extents(),
      called during the transaction commit path.
      
      If a transaction is aborted we free the qgroup records by calling
      btrfs_qgroup_destroy_extent_records() at btrfs_destroy_delayed_refs(),
      unless we don't have delayed references. We are incorrectly assuming
      that no delayed references means we don't have qgroup extents records.
      
      We can currently have no delayed references because we ran them all
      during a transaction commit and the transaction was aborted after that
      due to some error in the commit path.
      
      So fix this by ensuring we btrfs_qgroup_destroy_extent_records() at
      btrfs_destroy_delayed_refs() even if we don't have any delayed references.
      
      Reported-by: syzbot+0fecc032fa134afd49df@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/0000000000004e7f980619f91835@google.com/
      Fixes: 81f7eb00 ("btrfs: destroy qgroup extent records on transaction abort")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb33eb2e
    • Omar Sandoval's avatar
      btrfs: fix crash on racing fsync and size-extending write into prealloc · 9d274c19
      Omar Sandoval authored
      We have been seeing crashes on duplicate keys in
      btrfs_set_item_key_safe():
      
        BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.c:2620!
        invalid opcode: 0000 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
        RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]
      
      With the following stack trace:
      
        #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
        #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
        #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
        #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
        #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
        #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
        #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
        #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
        #8  vfs_fsync_range (fs/sync.c:188:9)
        #9  vfs_fsync (fs/sync.c:202:9)
        #10 do_fsync (fs/sync.c:212:9)
        #11 __do_sys_fdatasync (fs/sync.c:225:9)
        #12 __se_sys_fdatasync (fs/sync.c:223:1)
        #13 __x64_sys_fdatasync (fs/sync.c:223:1)
        #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
        #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
        #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
      
      So we're logging a changed extent from fsync, which is splitting an
      extent in the log tree. But this split part already exists in the tree,
      triggering the BUG().
      
      This is the state of the log tree at the time of the crash, dumped with
      drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
      to get more details than btrfs_print_leaf() gives us:
      
        >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
        leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
        leaf 33439744 flags 0x100000000000000
        fs uuid e5bd3946-400c-4223-8923-190ef1f18677
        chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
                item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                        generation 7 transid 9 size 8192 nbytes 8473563889606862198
                        block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                        sequence 204 flags 0x10(PREALLOC)
                        atime 1716417703.220000000 (2024-05-22 15:41:43)
                        ctime 1716417704.983333333 (2024-05-22 15:41:44)
                        mtime 1716417704.983333333 (2024-05-22 15:41:44)
                        otime 17592186044416.000000000 (559444-03-08 01:40:16)
                item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                        index 195 namelen 3 name: 193
                item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                        location key (0 UNKNOWN.0 0) type XATTR
                        transid 7 data_len 1 name_len 6
                        name: user.a
                        data a
                item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                        generation 9 type 1 (regular)
                        extent data disk byte 303144960 nr 12288
                        extent data offset 0 nr 4096 ram 12288
                        extent compression 0 (none)
                item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                        generation 9 type 2 (prealloc)
                        prealloc data disk byte 303144960 nr 12288
                        prealloc data offset 4096 nr 8192
                item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                        generation 9 type 2 (prealloc)
                        prealloc data disk byte 303144960 nr 12288
                        prealloc data offset 8192 nr 4096
        ...
      
      So the real problem happened earlier: notice that items 4 (4k-12k) and 5
      (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
      item 5 starts at i_size.
      
      Here is the state of the filesystem tree at the time of the crash:
      
        >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
        >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
        >>> print_extent_buffer(nodes[0])
        leaf 30425088 level 0 items 184 generation 9 owner 5
        leaf 30425088 flags 0x100000000000000
        fs uuid e5bd3946-400c-4223-8923-190ef1f18677
        chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
        	...
                item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                        generation 7 transid 7 size 4096 nbytes 12288
                        block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                        sequence 6 flags 0x10(PREALLOC)
                        atime 1716417703.220000000 (2024-05-22 15:41:43)
                        ctime 1716417703.220000000 (2024-05-22 15:41:43)
                        mtime 1716417703.220000000 (2024-05-22 15:41:43)
                        otime 1716417703.220000000 (2024-05-22 15:41:43)
                item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                        index 195 namelen 3 name: 193
                item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                        location key (0 UNKNOWN.0 0) type XATTR
                        transid 7 data_len 1 name_len 6
                        name: user.a
                        data a
                item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                        generation 9 type 1 (regular)
                        extent data disk byte 303144960 nr 12288
                        extent data offset 0 nr 8192 ram 12288
                        extent compression 0 (none)
                item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                        generation 9 type 2 (prealloc)
                        prealloc data disk byte 303144960 nr 12288
                        prealloc data offset 8192 nr 4096
      
      Item 5 in the log tree corresponds to item 183 in the filesystem tree,
      but nothing matches item 4. Furthermore, item 183 is the last item in
      the leaf.
      
      btrfs_log_prealloc_extents() is responsible for logging prealloc extents
      beyond i_size. It first truncates any previously logged prealloc extents
      that start beyond i_size. Then, it walks the filesystem tree and copies
      the prealloc extent items to the log tree.
      
      If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
      unlocks the tree and does another search. However, while the filesystem
      tree is unlocked, an ordered extent completion may modify the tree. In
      particular, it may insert an extent item that overlaps with an extent
      item that was already copied to the log tree.
      
      This may manifest in several ways depending on the exact scenario,
      including an EEXIST error that is silently translated to a full sync,
      overlapping items in the log tree, or this crash. This particular crash
      is triggered by the following sequence of events:
      
      - Initially, the file has i_size=4k, a regular extent from 0-4k, and a
        prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
        the last item in its B-tree leaf.
      - The file is fsync'd, which copies its inode item and both extent items
        to the log tree.
      - An xattr is set on the file, which sets the
        BTRFS_INODE_COPY_EVERYTHING flag.
      - The range 4k-8k in the file is written using direct I/O. i_size is
        extended to 8k, but the ordered extent is still in flight.
      - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
        calls copy_inode_items_to_log(), which calls
        btrfs_log_prealloc_extents().
      - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
        filesystem tree. Since it starts before i_size, it skips it. Since it
        is the last item in its B-tree leaf, it calls btrfs_next_leaf().
      - btrfs_next_leaf() unlocks the path.
      - The ordered extent completion runs, which converts the 4k-8k part of
        the prealloc extent to written and inserts the remaining prealloc part
        from 8k-12k.
      - btrfs_next_leaf() does a search and finds the new prealloc extent
        8k-12k.
      - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
        the log tree. Note that it overlaps with the 4k-12k prealloc extent
        that was copied to the log tree by the first fsync.
      - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
        extent that was written.
      - This tries to drop the range 4k-8k in the log tree, which requires
        adjusting the start of the 4k-12k prealloc extent in the log tree to
        8k.
      - btrfs_set_item_key_safe() sees that there is already an extent
        starting at 8k in the log tree and calls BUG().
      
      Fix this by detecting when we're about to insert an overlapping file
      extent item in the log tree and truncating the part that would overlap.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9d274c19
  9. 28 May, 2024 1 commit
    • Filipe Manana's avatar
      btrfs: ensure fast fsync waits for ordered extents after a write failure · f13e01b8
      Filipe Manana authored
      If a write path in COW mode fails, either before submitting a bio for the
      new extents or an actual IO error happens, we can end up allowing a fast
      fsync to log file extent items that point to unwritten extents.
      
      This is because dropping the extent maps happens when completing ordered
      extents, at btrfs_finish_one_ordered(), and the completion of an ordered
      extent is executed in a work queue.
      
      This can result in a fast fsync to start logging file extent items based
      on existing extent maps before the ordered extents complete, therefore
      resulting in a log that has file extent items that point to unwritten
      extents, resulting in a corrupt file if a crash happens after and the log
      tree is replayed the next time the fs is mounted.
      
      This can happen for both direct IO writes and buffered writes.
      
      For example consider a direct IO write, in COW mode, that fails at
      btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
      error:
      
      1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
         set to false, meaning an error happened;
      
      2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
         flag;
      
      3) btrfs_finish_ordered_extent() queues the completion of the ordered
         extent - so that btrfs_finish_one_ordered() will be executed later in
         a work queue. That function will drop extent maps in the range when
         it's executed, since the extent maps point to unwritten locations
         (signaled by the BTRFS_ORDERED_IOERR flag);
      
      4) After calling btrfs_finish_ordered_extent() we keep going down the
         write path and unlock the inode;
      
      5) After that a fast fsync starts and locks the inode;
      
      6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
         task sees the extent maps that point to the unwritten locations and
         logs file extent items based on them - it does not know they are
         unwritten, and the fast fsync path does not wait for ordered extents
         to complete, which is an intentional behaviour in order to reduce
         latency.
      
      For the buffered write case, here's one example:
      
      1) A fast fsync begins, and it starts by flushing delalloc and waiting for
         the writeback to complete by calling filemap_fdatawait_range();
      
      2) Flushing the dellaloc created a new extent map X;
      
      3) During the writeback some IO error happened, and at the end io callback
         (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
         sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
         completion;
      
      4) After queuing the ordered extent completion, the end io callback clears
         the writeback flag from all pages (or folios), and from that moment the
         fast fsync can proceed;
      
      5) The fast fsync proceeds sees extent map X and logs a file extent item
         based on extent map X, resulting in a log that points to an unwritten
         data extent - because the ordered extent completion hasn't run yet, it
         happens only after the logging.
      
      To fix this make btrfs_finish_ordered_extent() set the inode flag
      BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
      so that a fast fsync will wait for ordered extent completion.
      
      Note that this issues of using extent maps that point to unwritten
      locations can not happen for reads, because in read paths we start by
      locking the extent range and wait for any ordered extents in the range
      to complete before looking for extent maps.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f13e01b8
  10. 21 May, 2024 1 commit
  11. 15 May, 2024 5 commits
    • Filipe Manana's avatar
      btrfs: fix end of tree detection when searching for data extent ref · dddff821
      Filipe Manana authored
      At lookup_extent_data_ref() we are incorrectly checking if we are at the
      last slot of the last leaf in the extent tree. We are returning -ENOENT
      if btrfs_next_leaf() returns a value greater than 1, but btrfs_next_leaf()
      never returns anything greater than 1:
      
      1) It returns < 0 on error;
      
      2) 0 if there is a next leaf (or a new item was added to the end of the
         current leaf after releasing the path);
      
      3) 1 if there are no more leaves (and no new items were added to the last
         leaf after releasing the path).
      
      So fix this by checking if the return value is greater than zero instead
      of being greater than one.
      
      Fixes: 1618aa3c ("btrfs: simplify return variables in lookup_extent_data_ref()")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dddff821
    • Lu Yao's avatar
      btrfs: scrub: initialize ret in scrub_simple_mirror() to fix compilation warning · b4e585ff
      Lu Yao authored
      The following error message is displayed:
        ../fs/btrfs/scrub.c:2152:9: error: ‘ret’ may be used uninitialized
        in this function [-Werror=maybe-uninitialized]"
      
      Compiler version: gcc version: (Debian 10.2.1-6) 10.2.1 20210110
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarLu Yao <yaolu@kylinos.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4e585ff
    • Filipe Manana's avatar
      btrfs: zoned: fix use-after-free due to race with dev replace · 0090d6e1
      Filipe Manana authored
      While loading a zone's info during creation of a block group, we can race
      with a device replace operation and then trigger a use-after-free on the
      device that was just replaced (source device of the replace operation).
      
      This happens because at btrfs_load_zone_info() we extract a device from
      the chunk map into a local variable and then use the device while not
      under the protection of the device replace rwsem. So if there's a device
      replace operation happening when we extract the device and that device
      is the source of the replace operation, we will trigger a use-after-free
      if before we finish using the device the replace operation finishes and
      frees the device.
      
      Fix this by enlarging the critical section under the protection of the
      device replace rwsem so that all uses of the device are done inside the
      critical section.
      
      CC: stable@vger.kernel.org # 6.1.x: 15c12fcc: btrfs: zoned: introduce a zone_info struct in btrfs_load_block_group_zone_info
      CC: stable@vger.kernel.org # 6.1.x: 09a46725: btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info
      CC: stable@vger.kernel.org # 6.1.x: 9e0e3e74: btrfs: zoned: factor out single bg handling from btrfs_load_block_group_zone_info
      CC: stable@vger.kernel.org # 6.1.x: 87463f7e: btrfs: zoned: factor out DUP bg handling from btrfs_load_block_group_zone_info
      CC: stable@vger.kernel.org # 6.1.x
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0090d6e1
    • Boris Burkov's avatar
      btrfs: qgroup: fix qgroup id collision across mounts · 2b8aa78c
      Boris Burkov authored
      If we delete subvolumes whose ID is the largest in the filesystem, then
      unmount and mount again, then btrfs_init_root_free_objectid on the
      tree_root will select a subvolid smaller than that one and thus allow
      reusing it.
      
      If we are also using qgroups (and particularly squotas) it is possible
      to delete the subvol without deleting the qgroup. In that case, we will
      be able to create a new subvol whose id already has a level 0 qgroup.
      This will result in re-using that qgroup which would then lead to
      incorrect accounting.
      
      Fixes: 6ed05643 ("btrfs: create qgroup earlier in snapshot creation")
      CC: stable@vger.kernel.org # 6.7+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b8aa78c
    • David Sterba's avatar
      btrfs: qgroup: update rescan message levels and error codes · 1fa7603d
      David Sterba authored
      On filesystems without enabled quotas there's still a warning message in
      the logs when rescan is called. In that case it's not a problem that
      should be reported, rescan can be called unconditionally.  Change the
      error code to ENOTCONN which is used for 'quotas not enabled' elsewhere.
      
      Remove message (also a warning) when rescan is called during an ongoing
      rescan, this brings no useful information and the error code is
      sufficient.
      
      Change message levels to debug for now, they can be removed eventually.
      
      CC: stable@vger.kernel.org # 6.6+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fa7603d
  12. 07 May, 2024 15 commits