1. 07 Oct, 2021 5 commits
    • Filipe Manana's avatar
      btrfs: deal with errors when adding inode reference during log replay · 52db7779
      Filipe Manana authored
      At __inode_add_ref(), we treating any error returned from
      btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning
      that there is no existing directory entry in the fs/subvolume tree.
      This is not correct since we can get errors such as, for example, -EIO
      when reading extent buffers while searching the fs/subvolume's btree.
      
      So fix that and return the error to the caller when it is not -ENOENT.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52db7779
    • Filipe Manana's avatar
      btrfs: deal with errors when replaying dir entry during log replay · e15ac641
      Filipe Manana authored
      At replay_one_one(), we are treating any error returned from
      btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning
      that there is no existing directory entry in the fs/subvolume tree.
      This is not correct since we can get errors such as, for example, -EIO
      when reading extent buffers while searching the fs/subvolume's btree.
      
      So fix that and return the error to the caller when it is not -ENOENT.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e15ac641
    • Filipe Manana's avatar
      btrfs: deal with errors when checking if a dir entry exists during log replay · 77a5b9e3
      Filipe Manana authored
      Currently inode_in_dir() ignores errors returned from
      btrfs_lookup_dir_index_item() and from btrfs_lookup_dir_item(), treating
      any errors as if the directory entry does not exists in the fs/subvolume
      tree, which is obviously not correct, as we can get errors such as -EIO
      when reading extent buffers while searching the fs/subvolume's tree.
      
      Fix that by making inode_in_dir() return the errors and making its only
      caller, add_inode_ref(), deal with returned errors as well.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77a5b9e3
    • Josef Bacik's avatar
      btrfs: update refs for any root except tree log roots · d175209b
      Josef Bacik authored
      I hit a stuck relocation on btrfs/061 during my overnight testing.  This
      turned out to be because we had left over extent entries in our extent
      root for a data reloc inode that no longer existed.  This happened
      because in btrfs_drop_extents() we only update refs if we have SHAREABLE
      set or we are the tree_root.  This regression was introduced by
      aeb935a4 ("btrfs: don't set SHAREABLE flag for data reloc tree")
      where we stopped setting SHAREABLE for the data reloc tree.
      
      The problem here is we actually do want to update extent references for
      data extents in the data reloc tree, in fact we only don't want to
      update extent references if the file extents are in the log tree.
      Update this check to only skip updating references in the case of the
      log tree.
      
      This is relatively rare, because you have to be running scrub at the
      same time, which is what btrfs/061 does.  The data reloc inode has its
      extents pre-allocated, and then we copy the extent into the
      pre-allocated chunks.  We theoretically should never be calling
      btrfs_drop_extents() on a data reloc inode.  The exception of course is
      with scrub, if our pre-allocated extent falls inside of the block group
      we are scrubbing, then the block group will be marked read only and we
      will be forced to cow that extent.  This means we will call
      btrfs_drop_extents() on that range when we COW that file extent.
      
      This isn't really problematic if we do this, the data reloc inode
      requires that our extent lengths match exactly with the extent we are
      copying, thankfully we validate the extent is correct with
      get_new_location(), so if we happen to COW only part of the extent we
      won't link it in when we do the relocation, so we are safe from any
      other shenanigans that arise because of this interaction with scrub.
      
      Fixes: aeb935a4 ("btrfs: don't set SHAREABLE flag for data reloc tree")
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d175209b
    • Qu Wenruo's avatar
      btrfs: unlock newly allocated extent buffer after error · 19ea40dd
      Qu Wenruo authored
      [BUG]
      There is a bug report that injected ENOMEM error could leave a tree
      block locked while we return to user-space:
      
        BTRFS info (device loop0): enabling ssd optimizations
        FAULT_INJECTION: forcing a failure.
        name failslab, interval 1, probability 0, space 0, times 0
        CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 #16
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
        Call Trace:
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
         fail_dump lib/fault-inject.c:52 [inline]
         should_fail+0x13c/0x160 lib/fault-inject.c:146
         should_failslab+0x5/0x10 mm/slab_common.c:1328
         slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494
         slab_alloc_node mm/slub.c:3120 [inline]
         slab_alloc mm/slub.c:3214 [inline]
         kmem_cache_alloc+0x44/0x280 mm/slub.c:3219
         btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline]
         btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833
         __btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415
         btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570
         btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768
         btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905
         btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530
         btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783
         lookup_open+0x660/0x780 fs/namei.c:3282
         open_last_lookups fs/namei.c:3352 [inline]
         path_openat+0x465/0xe20 fs/namei.c:3557
         do_filp_open+0xe3/0x170 fs/namei.c:3588
         do_sys_openat2+0x357/0x4a0 fs/open.c:1200
         do_sys_open+0x87/0xd0 fs/open.c:1216
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x46ae99
        Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48
        89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
        01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
        RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
        RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99
        RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800
        RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017
        R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0
      
        ================================================
        WARNING: lock held when returning to user space!
        5.15.0-rc1 #16 Not tainted
        ------------------------------------------------
        syz-executor/7579 is leaving the kernel with locks still held!
        1 lock held by syz-executor/7579:
         #0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at:
        __btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112
      
      [CAUSE]
      In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new
      extent buffer @buf is locked, but if later operations like adding
      delayed tree ref fail, we just free @buf without unlocking it,
      resulting above warning.
      
      [FIX]
      Unlock @buf in out_free_buf: label.
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19ea40dd
  2. 17 Sep, 2021 4 commits
    • Qu Wenruo's avatar
      btrfs: prevent __btrfs_dump_space_info() to underflow its free space · 0619b790
      Qu Wenruo authored
      It's not uncommon where __btrfs_dump_space_info() gets called
      under over-commit situations.
      
      In that case free space would underflow as total allocated space is not
      enough to handle all the over-committed space.
      
      Such underflow values can sometimes cause confusion for users enabled
      enospc_debug mount option, and takes some seconds for developers to
      convert the underflow value to signed result.
      
      Just output the free space as s64 to avoid such problem.
      Reported-by: default avatarEli V <eliventer@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJtFHUSy4zgyhf-4d9T+KdJp9w=UgzC2A0V=VtmaeEpcGgm1-Q@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0619b790
    • Filipe Manana's avatar
      btrfs: fix mount failure due to past and transient device flush error · 6b225baa
      Filipe Manana authored
      When we get an error flushing one device, during a super block commit, we
      record the error in the device structure, in the field 'last_flush_error'.
      This is used to later check if we should error out the super block commit,
      depending on whether the number of flush errors is greater than or equals
      to the maximum tolerated device failures for a raid profile.
      
      However if we get a transient device flush error, unmount the filesystem
      and later try to mount it, we can fail the mount because we treat that
      past error as critical and consider the device is missing. Even if it's
      very likely that the error will happen again, as it's probably due to a
      hardware related problem, there may be cases where the error might not
      happen again. One example is during testing, and a test case like the
      new generic/648 from fstests always triggers this. The test cases
      generic/019 and generic/475 also trigger this scenario, but very
      sporadically.
      
      When this happens we get an error like this:
      
        $ mount /dev/sdc /mnt
        mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
      
        $ dmesg
        (...)
        [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
        [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
        [12918.890853] BTRFS error (device sdc): open_ctree failed
      
      The failure happens because when btrfs_check_rw_degradable() is called at
      mount time, or at remount from RO to RW time, is sees a non zero value in
      a device's ->last_flush_error attribute, and therefore considers that the
      device is 'missing'.
      
      Fix this by setting a device's ->last_flush_error to zero when we close a
      device, making sure the error is not seen on the next mount attempt. We
      only need to track flush errors during the current mount, so that we never
      commit a super block if such errors happened.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b225baa
    • Filipe Manana's avatar
      btrfs: fix transaction handle leak after verity rollback failure · acbee9af
      Filipe Manana authored
      During a verity rollback, if we fail to update the inode or delete the
      orphan, we abort the transaction and return without releasing our
      transaction handle. Fix that by releasing the handle.
      
      Fixes: 14605409 ("btrfs: initial fsverity support")
      Fixes: 70524253 ("btrfs: verity metadata orphan items")
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      acbee9af
    • Qu Wenruo's avatar
      btrfs: replace BUG_ON() in btrfs_csum_one_bio() with proper error handling · bbc9a6eb
      Qu Wenruo authored
      There is a BUG_ON() in btrfs_csum_one_bio() to catch code logic error.
      It has indeed caught several bugs during subpage development.
      But the BUG_ON() itself will bring down the whole system which is
      an overkill.
      
      Replace it with a WARN() and exit gracefully, so that it won't crash the
      whole system while we can still catch the code logic error.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbc9a6eb
  3. 07 Sep, 2021 6 commits
    • Naohiro Aota's avatar
      btrfs: zoned: fix double counting of split ordered extent · f79645df
      Naohiro Aota authored
      btrfs_add_ordered_extent_*() add num_bytes to fs_info->ordered_bytes.
      Then, splitting an ordered extent will call btrfs_add_ordered_extent_*()
      again for split extents, leading to double counting of the region of
      a split extent. These leaked bytes are finally reported at unmount time
      as follow:
      
        BTRFS info (device dm-1): at unmount dio bytes count 364544
      
      Fix the double counting by subtracting split extent's size from
      fs_info->ordered_bytes.
      
      Fixes: d22002fd ("btrfs: zoned: split ordered extent when bio is sent")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f79645df
    • Anand Jain's avatar
      btrfs: fix lockdep warning while mounting sprout fs · c1247069
      Anand Jain authored
      Following test case reproduces lockdep warning.
      
        Test case:
      
        $ mkfs.btrfs -f <dev1>
        $ btrfstune -S 1 <dev1>
        $ mount <dev1> <mnt>
        $ btrfs device add <dev2> <mnt> -f
        $ umount <mnt>
        $ mount <dev2> <mnt>
        $ umount <mnt>
      
      The warning claims a possible ABBA deadlock between the threads
      initiated by [#1] btrfs device add and [#0] the mount.
      
        [ 540.743122] WARNING: possible circular locking dependency detected
        [ 540.743129] 5.11.0-rc7+ #5 Not tainted
        [ 540.743135] ------------------------------------------------------
        [ 540.743142] mount/2515 is trying to acquire lock:
        [ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs]
        [ 540.743458] but task is already holding lock:
        [ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
        [ 540.743541] which lock already depends on the new lock.
        [ 540.743543] the existing dependency chain (in reverse order) is:
      
        [ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}:
        [ 540.743566] down_read_nested+0x48/0x2b0
        [ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs]
        [ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs]
        [ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs]
        [ 540.743785] btrfs_update_device+0x83/0x260 [btrfs]
        [ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex
        [ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs]
        [ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs]
        [ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs]
        [ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs]
        [ 540.744166] __x64_sys_ioctl+0xe4/0x140
        [ 540.744170] do_syscall_64+0x4b/0x80
        [ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
        [ 540.744184] __lock_acquire+0x155f/0x2360
        [ 540.744188] lock_acquire+0x10b/0x5c0
        [ 540.744190] __mutex_lock+0xb1/0xf80
        [ 540.744193] mutex_lock_nested+0x27/0x30
        [ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs]
        [ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs]
        [ 540.744336] open_ctree+0xf6e/0x2074 [btrfs]
        [ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs]
        [ 540.744472] legacy_get_tree+0x38/0x90
        [ 540.744475] vfs_get_tree+0x30/0x140
        [ 540.744478] fc_mount+0x16/0x60
        [ 540.744482] vfs_kern_mount+0x91/0x100
        [ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs]
        [ 540.744536] legacy_get_tree+0x38/0x90
        [ 540.744537] vfs_get_tree+0x30/0x140
        [ 540.744539] path_mount+0x8d8/0x1070
        [ 540.744541] do_mount+0x8d/0xc0
        [ 540.744543] __x64_sys_mount+0x125/0x160
        [ 540.744545] do_syscall_64+0x4b/0x80
        [ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [ 540.744551] other info that might help us debug this:
        [ 540.744552] Possible unsafe locking scenario:
      
        [ 540.744553] CPU0 				CPU1
        [ 540.744554] ---- 				----
        [ 540.744555] lock(btrfs-chunk-00);
        [ 540.744557] 					lock(&fs_devs->device_list_mutex);
        [ 540.744560] 					lock(btrfs-chunk-00);
        [ 540.744562] lock(&fs_devs->device_list_mutex);
        [ 540.744564]
         *** DEADLOCK ***
      
        [ 540.744565] 3 locks held by mount/2515:
        [ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450
        [ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs]
        [ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
        [ 540.744708]
         stack backtrace:
        [ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5
      
      But the device_list_mutex in clone_fs_devices() is redundant, as
      explained below.  Two threads [1]  and [2] (below) could lead to
      clone_fs_device().
      
        [1]
        open_ctree <== mount sprout fs
         btrfs_read_chunk_tree()
          mutex_lock(&uuid_mutex) <== global lock
          read_one_dev()
           open_seed_devices()
            clone_fs_devices() <== seed fs_devices
             mutex_lock(&orig->device_list_mutex) <== seed fs_devices
      
        [2]
        btrfs_init_new_device() <== sprouting
         mutex_lock(&uuid_mutex); <== global lock
         btrfs_prepare_sprout()
           lockdep_assert_held(&uuid_mutex)
           clone_fs_devices(seed_fs_device) <== seed fs_devices
      
      Both of these threads hold uuid_mutex which is sufficient to protect
      getting the seed device(s) freed while we are trying to clone it for
      sprouting [2] or mounting a sprout [1] (as above). A mounted seed device
      can not free/write/replace because it is read-only. An unmounted seed
      device can be freed by btrfs_free_stale_devices(), but it needs
      uuid_mutex.  So this patch removes the unnecessary device_list_mutex in
      clone_fs_devices().  And adds a lockdep_assert_held(&uuid_mutex) in
      clone_fs_devices().
      Reported-by: default avatarSu Yue <l@damenly.su>
      Tested-by: default avatarSu Yue <l@damenly.su>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1247069
    • Josef Bacik's avatar
      btrfs: delay blkdev_put until after the device remove · 3fa421de
      Josef Bacik authored
      When removing the device we call blkdev_put() on the device once we've
      removed it, and because we have an EXCL open we need to take the
      ->open_mutex on the block device to clean it up.  Unfortunately during
      device remove we are holding the sb writers lock, which results in the
      following lockdep splat:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #407 Not tainted
      ------------------------------------------------------
      losetup/11595 is trying to acquire lock:
      ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_put+0x3a/0x220
             btrfs_rm_device.cold+0x62/0xe5
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11595:
       #0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7fc21255d4cb
      
      So instead save the bdev and do the put once we've dropped the sb
      writers lock in order to avoid the lockdep recursion.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3fa421de
    • Josef Bacik's avatar
      btrfs: update the bdev time directly when closing · 8f96a5bf
      Josef Bacik authored
      We update the ctime/mtime of a block device when we remove it so that
      blkid knows the device changed.  However we do this by re-opening the
      block device and calling filp_update_time.  This is more correct because
      it'll call the inode->i_op->update_time if it exists, but the block dev
      inodes do not do this.  Instead call generic_update_time() on the
      bd_inode in order to avoid the blkdev_open path and get rid of the
      following lockdep splat:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #406 Not tainted
      ------------------------------------------------------
      losetup/11596 is trying to acquire lock:
      ffff939640d2f538 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_get_by_dev.part.0+0x56/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             file_open_name+0xc7/0x170
             filp_open+0x2c/0x50
             btrfs_scratch_superblocks.part.0+0x10f/0x170
             btrfs_rm_device.cold+0xe8/0xed
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11596:
       #0: ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 1 PID: 11596 Comm: losetup Not tainted 5.14.0-rc2+ #406
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8f96a5bf
    • Kari Argillander's avatar
      btrfs: use correct header for div_u64 in misc.h · cde7417c
      Kari Argillander authored
      asm/do_div.h is for div_u64, but it is found in math64.h. This change
      will make compiler job easier and prevent compiler errors in situation
      where compiler will not find math64.h from another paths.
      Signed-off-by: default avatarKari Argillander <kari.argillander@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cde7417c
    • Anand Jain's avatar
      btrfs: fix upper limit for max_inline for page size 64K · 6f93e834
      Anand Jain authored
      The mount option max_inline ranges from 0 to the sectorsize (which is
      now equal to page size). But we parse the mount options too early and
      before the actual sectorsize is read from the superblock. So the upper
      limit of max_inline is unaware of the actual sectorsize and is limited
      by the temporary sectorsize 4096, even on a system where the default
      sectorsize is 64K.
      
      Fix this by reading the superblock sectorsize before the mount option
      parse.
      Reported-by: default avatarAlexander Tsvetkov <alexander.tsvetkov@oracle.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f93e834
  4. 23 Aug, 2021 25 commits
    • Desmond Cheong Zhi Xi's avatar
      btrfs: reset replace target device to allocation state on close · 0d977e0e
      Desmond Cheong Zhi Xi authored
      This crash was observed with a failed assertion on device close:
      
        BTRFS: Transaction aborted (error -28)
        WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
        Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
        CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
        RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
        RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
        R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
        FS:  0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
        Call Trace:
         flush_space+0x197/0x2f0 [btrfs]
         btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
         process_one_work+0x262/0x5e0
         worker_thread+0x4c/0x320
         ? process_one_work+0x5e0/0x5e0
         kthread+0x144/0x170
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x1f/0x30
        irq event stamp: 19334989
        hardirqs last  enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
        hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
        softirqs last  enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
        softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
        ---[ end trace 45939e308e0dd3c7 ]---
        BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
        BTRFS info (device vdd): forced readonly
        BTRFS warning (device vdd): failed setting block group ro: -30
        BTRFS info (device vdd): suspending dev_replace for unmount
        assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3431!
        invalid opcode: 0000 [#1] PREEMPT SMP
        CPU: 1 PID: 3982 Comm: umount Tainted: G        W         5.14.0-rc5-default+ #1532
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
        RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
        RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
        RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
        R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
        FS:  00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
        Call Trace:
         btrfs_close_one_device.cold+0x11/0x55 [btrfs]
         close_fs_devices+0x44/0xb0 [btrfs]
         btrfs_close_devices+0x48/0x160 [btrfs]
         generic_shutdown_super+0x69/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x2c/0xa0
         cleanup_mnt+0x144/0x1b0
         task_work_run+0x59/0xa0
         exit_to_user_mode_loop+0xe7/0xf0
         exit_to_user_mode_prepare+0xaf/0xf0
         syscall_exit_to_user_mode+0x19/0x50
         do_syscall_64+0x4a/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      This happens when close_ctree is called while a dev_replace hasn't
      completed. In close_ctree, we suspend the dev_replace, but keep the
      replace target around so that we can resume the dev_replace procedure
      when we mount the root again. This is the call trace:
      
        close_ctree():
          btrfs_dev_replace_suspend_for_unmount();
          btrfs_close_devices():
            btrfs_close_fs_devices():
              btrfs_close_one_device():
                ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
                       &device->dev_state));
      
      However, since the replace target sticks around, there is a device
      with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
      assertion in btrfs_close_one_device.
      
      To fix this, if we come across the replace target device when
      closing, we should properly reset it back to allocation state. This
      fix also ensures that if a non-target device has a corrupted state and
      has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
      catch the error.
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Fixes: b2a61667 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d977e0e
    • Naohiro Aota's avatar
      btrfs: zoned: fix ordered extent boundary calculation · 939c7feb
      Naohiro Aota authored
      btrfs_lookup_ordered_extent() is supposed to query the offset in a file
      instead of the logical address. Pass the file offset from
      submit_extent_page() to calc_bio_boundaries().
      
      Also, calc_bio_boundaries() relies on the bio's operation flag, so move
      the call site after setting it.
      
      Fixes: 390ed29b ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      939c7feb
    • Josef Bacik's avatar
      btrfs: do not do preemptive flushing if the majority is global rsv · 11462397
      Josef Bacik authored
      A common characteristic of the bug report where preemptive flushing was
      going full tilt was the fact that the vast majority of the free metadata
      space was used up by the global reserve.  The hard 90% threshold would
      cover the majority of these cases, but to be even smarter we should take
      into account how much of the outstanding reservations are covered by the
      global block reserve.  If the global block reserve accounts for the vast
      majority of outstanding reservations, skip preemptive flushing, as it
      will likely just cause churn and pain.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11462397
    • Josef Bacik's avatar
      btrfs: reduce the preemptive flushing threshold to 90% · 93c60b17
      Josef Bacik authored
      The preemptive flushing code was added in order to avoid needing to
      synchronously wait for ENOSPC flushing to recover space.  Once we're
      almost full however we can essentially flush constantly.  We were using
      98% as a threshold to determine if we were simply full, however in
      practice this is a really high bar to hit.  For example reports of
      systems running into this problem had around 94% usage and thus
      continued to flush.  Fix this by lowering the threshold to 90%, which is
      a more sane value, especially for smaller file systems.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
      CC: stable@vger.kernel.org # 5.12+
      Fixes: 576fa348 ("btrfs: improve preemptive background space flushing")
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      93c60b17
    • Marcos Paulo de Souza's avatar
      btrfs: tree-log: check btrfs_lookup_data_extent return value · 3736127a
      Marcos Paulo de Souza authored
      Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if
      the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return
      values bellow zero if an error happened.
      
      Function replay_one_extent currently checks if the search found
      something (0 returned) and increments the reference, and if not, it
      seems to evaluate as 'not found'.
      
      Fix the condition by checking if the value was bellow zero and return
      early.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3736127a
    • Filipe Manana's avatar
      btrfs: avoid unnecessarily logging directories that had no changes · 8be2ba2e
      Filipe Manana authored
      There are several cases where when logging an inode we need to log its
      parent directories or logging subdirectories when logging a directory.
      
      There are cases however where we end up logging a directory even if it was
      not changed in the current transaction, no dentries added or removed since
      the last transaction. While this is harmless from a functional point of
      view, it is a waste time as it brings no advantage.
      
      One example where this is triggered is the following:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ mkdir /mnt/A
        $ mkdir /mnt/B
        $ mkdir /mnt/C
      
        $ touch /mnt/A/foo
        $ ln /mnt/A/foo /mnt/B/bar
        $ ln /mnt/A/foo /mnt/C/baz
      
        $ sync
      
        $ rm -f /mnt/A/foo
        $ xfs_io -c "fsync" /mnt/B/bar
      
      This last fsync ends up logging directories A, B and C, however we only
      need to log directory A, as B and C were not changed since the last
      transaction commit.
      
      So fix this by changing need_log_inode(), to return false in case the
      given inode is a directory and has a ->last_trans value smaller than the
      current transaction's ID.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8be2ba2e
    • Christian Brauner's avatar
      btrfs: allow idmapped mount · 5b9b26f5
      Christian Brauner authored
      Now that we converted btrfs internally to account for idmapped mounts
      allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
      flag.  We only need to raise this flag on the btrfs_root_fs_type
      filesystem since btrfs_mount_root() is ultimately responsible for
      allocating the superblock and is called into from btrfs_mount()
      associated with btrfs_fs_type.
      
      The conversion of the btrfs inode operations was straightforward.
      Regarding btrfs specific ioctls that perform checks based on inode
      permissions only those have been allowed that are not filesystem wide
      operations and hence can be reasonably charged against a specific mount.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b9b26f5
    • Christian Brauner's avatar
      btrfs: handle ACLs on idmapped mounts · 4a8b34af
      Christian Brauner authored
      Make the ACL code idmapped mount aware. The POSIX default and POSIX
      access ACLs are the only ACLs other than some specific xattrs that take
      DAC permissions into account. On an idmapped mount they need to be
      translated according to the mount's userns. The main change is done to
      __btrfs_set_acl() which is responsible for translating POSIX ACLs to
      their final on-disk representation.
      
      The btrfs_init_acl() helper does not need to take the idmapped mount
      into account since it is called in the context of file creation
      operations (mknod, create, mkdir, symlink, tmpfile) and is used for
      btrfs_init_inode_security() to copy POSIX default and POSIX access
      permissions from the parent directory. These ACLs need to be inherited
      unmodified from the parent directory. This is identical to what we do
      for ext4 and xfs.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a8b34af
    • Christian Brauner's avatar
      btrfs: allow idmapped INO_LOOKUP_USER ioctl · 6623d9a0
      Christian Brauner authored
      The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl
      and has the following restrictions. The main difference between the two
      is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER
      is scoped beneath the file descriptor passed with the ioctl.
      Specifically, INO_LOOKUP_USER must adhere to the following restrictions:
      
      - The caller must be privileged over each inode of each path component
        for the path they are trying to lookup.
      
      - The path for the subvolume the caller is trying to lookup must be reachable
        from the inode associated with the file descriptor passed with the ioctl.
      
      The second condition makes it possible to scope the lookup of the path
      to the mount identified by the file descriptor passed with the ioctl.
      This allows us to enable this ioctl on idmapped mounts.
      
      Specifically, this is possible because all child subvolumes of a parent
      subvolume are reachable when the parent subvolume is mounted. So if the
      user had access to open the parent subvolume or has been given the fd
      then they can lookup the path if they had access to it provided they
      were privileged over each path component.
      
      Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name
      of a subvolume even though they would otherwise be restricted from doing
      so via regular VFS-based lookup.
      
      So think about a parent subvolume with multiple child subvolumes.
      Someone could mount he parent subvolume and restrict access to the child
      subvolumes by overmounting them with empty directories. At this point
      the user can't traverse the child subvolumes and they can't open files
      in the child subvolumes.  However, they can still learn the path of
      child subvolumes as long as they have access to the parent subvolume by
      using the INO_LOOKUP_USER ioctl.
      
      The underlying assumption here is that it's ok that the lookup ioctls
      can't really take mounts into account other than the original mount the
      fd belongs to during lookup. Since this assumption is baked into the
      original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6623d9a0
    • Christian Brauner's avatar
      btrfs: allow idmapped SUBVOL_SETFLAGS ioctl · 39e1674f
      Christian Brauner authored
      Setting flags on subvolumes or snapshots are core features of btrfs. The
      SUBVOL_SETFLAGS ioctl is especially important as it allows to make
      subvolumes and snapshots read-only or read-write. Allow setting flags on
      btrfs subvolumes and snapshots on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39e1674f
    • Christian Brauner's avatar
      btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls · e4fed17a
      Christian Brauner authored
      The SET_RECEIVED_SUBVOL ioctls are used to set information about
      a received subvolume. Make it possible to set information about a
      received subvolume on idmapped mounts. This is a fairly straightforward
      operation since all the permission checking helpers are already capable
      of handling idmapped mounts. So we just need to pass down the mount's
      userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e4fed17a
    • Christian Brauner's avatar
      btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids · aabb34e7
      Christian Brauner authored
      So far we prevented the deletion of subvolumes and snapshots using
      subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag.
      
      This restriction is necessary on idmapped mounts as this allows
      filesystem wide subvolume and snapshot deletions and thus can escape the
      scope of what's exposed under the mount identified by the fd passed with
      the ioctl.
      
      Deletion by subvolume id works by looking for an alias of the parent of
      the subvolume or snapshot to be deleted. The parent alias can be
      anywhere in the filesystem. However, as long as the alias of the parent
      that is found is the same as the one identified by the file descriptor
      passed through the ioctl we can allow the deletion.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aabb34e7
    • Christian Brauner's avatar
      btrfs: allow idmapped SNAP_DESTROY ioctls · c4ed533b
      Christian Brauner authored
      Destroying subvolumes and snapshots are important features of btrfs.
      Both operations are available to unprivileged users if the filesystem
      has been mounted with the "user_subvol_rm_allowed" mount option. Allow
      subvolume and snapshot deletion on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      
      Subvolumes and snapshots can either be deleted by specifying their name
      or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
      snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
      
      This feature is blocked on idmapped mounts as this allows filesystem
      wide subvolume deletions and thus can escape the scope of what's exposed
      under the mount identified by the fd passed with the ioctl.
      
      This means that even the root or CAP_SYS_ADMIN capable user can't delete
      a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
      
      The root user is currently already subject to permission checks in
      btrfs_may_delete() including whether the inode's i_uid/i_gid of the
      directory the subvolume is located in have a mapping in the caller's
      idmapping. For this to fail isn't currently possible since a btrfs
      filesystem can't be mounted with a non-initial idmapping but it shows
      that even the root user would fail to delete a subvolume if the relevant
      inode isn't mapped in their idmapping. The idmapped mount case is the
      same in principle.
      
      This isn't a huge problem a root user wanting to delete arbitrary
      subvolumes can just always create another (even detached) mount without
      an idmapping attached.
      
      In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
      subvolume to delete is directly located under inode referenced by the fd
      passed for the ioctl() in a follow-up commit.
      
      Here is an example where a btrfs subvolume is deleted through a
      subvolume mount that does not expose the subvolume to be delete but it
      can still be deleted by using the subvolume id:
      
        /* Compile the following program as "delete_by_spec". */
      
        #define _GNU_SOURCE
        #include <fcntl.h>
        #include <inttypes.h>
        #include <linux/btrfs.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <sys/ioctl.h>
        #include <sys/stat.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static int rm_subvolume_by_id(int fd, uint64_t subvolid)
        {
      	 struct btrfs_ioctl_vol_args_v2 args = {};
      	 int ret;
      
      	 args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
      	 args.subvolid = subvolid;
      
      	 ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
      	 if (ret < 0)
      		 return -1;
      
      	 return 0;
        }
      
        int main(int argc, char *argv[])
        {
      	 int subvolid = 0;
      
      	 if (argc < 3)
      		 exit(1);
      
      	 fprintf(stderr, "Opening %s\n", argv[1]);
      	 int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
      	 if (fd < 0)
      		 exit(2);
      
      	 subvolid = atoi(argv[2]);
      
      	 fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
      	 int ret = rm_subvolume_by_id(fd, subvolid);
      	 if (ret < 0)
      		 exit(3);
      
      	 exit(0);
        }
        #include <stdio.h>"
        #include <stdlib.h>"
        #include <linux/btrfs.h"
      
        truncate -s 10G btrfs.img
        mkfs.btrfs btrfs.img
        export LOOPDEV=$(sudo losetup -f --show btrfs.img)
        mount ${LOOPDEV} /mnt
        sudo chown $(id -u):$(id -g) /mnt
        btrfs subvolume create /mnt/A
        btrfs subvolume create /mnt/B/C
        # Get subvolume id via:
        sudo btrfs subvolume show /mnt/A
        # Save subvolid
        SUBVOLID=<nr>
        sudo umount /mnt
        sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
        ./delete_by_spec /mnt ${SUBVOLID}
      
      With idmapped mounts this can potentially be used by users to delete
      subvolumes/snapshots they would otherwise not have access to as the
      idmapping would be applied to an inode that is not exposed in the mount
      of the subvolume.
      
      The fact that this is a filesystem wide operation suggests it might be a
      good idea to expose this under a separate ioctl that clearly indicates
      this. In essence, the file descriptor passed with the ioctl is merely
      used to identify the filesystem on which to operate when
      BTRFS_SUBVOL_SPEC_BY_ID is used.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4ed533b
    • Christian Brauner's avatar
      btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls · 4d4340c9
      Christian Brauner authored
      Creating subvolumes and snapshots is one of the core features of btrfs
      and is even available to unprivileged users. Make it possible to use
      subvolume and snapshot creation on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d4340c9
    • Christian Brauner's avatar
      btrfs: check whether fsgid/fsuid are mapped during subvolume creation · 5474bf40
      Christian Brauner authored
      When a new subvolume is created btrfs currently doesn't check whether
      the fsgid/fsuid of the caller actually have a mapping in the user
      namespace attached to the filesystem. The VFS always checks this to make
      sure that the caller's fsgid/fsuid can be represented on-disk. This is
      most relevant for filesystems that can be mounted inside user namespaces
      but it is in general a good hardening measure to prevent unrepresentable
      gid/uid from being written to disk.
      
      Since we want to support idmapped mounts for btrfs ioctls to create
      subvolumes in follow-up patches this becomes important since we want to
      make sure the fsgid/fsuid of the caller as mapped according to the
      idmapped mount can be represented on-disk. Simply add the missing
      fsuidgid_has_mapping() line from the VFS may_create() version to
      btrfs_may_create().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5474bf40
    • Christian Brauner's avatar
      btrfs: allow idmapped permission inode op · 3bc71ba0
      Christian Brauner authored
      Enable btrfs_permission() to handle idmapped mounts. This is just a
      matter of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3bc71ba0
    • Christian Brauner's avatar
      btrfs: allow idmapped setattr inode op · d4d09464
      Christian Brauner authored
      Enable btrfs_setattr() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4d09464
    • Christian Brauner's avatar
      btrfs: allow idmapped tmpfile inode op · 98b6ab5f
      Christian Brauner authored
      Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98b6ab5f
    • Christian Brauner's avatar
      btrfs: allow idmapped symlink inode op · 5a052108
      Christian Brauner authored
      Enable btrfs_symlink() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a052108
    • Christian Brauner's avatar
      btrfs: allow idmapped mkdir inode op · b0b3e44d
      Christian Brauner authored
      Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of
      passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0b3e44d
    • Christian Brauner's avatar
      btrfs: allow idmapped create inode op · e93ca491
      Christian Brauner authored
      Enable btrfs_create() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e93ca491
    • Christian Brauner's avatar
      btrfs: allow idmapped mknod inode op · 72105277
      Christian Brauner authored
      Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of
      passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72105277
    • Christian Brauner's avatar
      btrfs: allow idmapped getattr inode op · c020d2ea
      Christian Brauner authored
      Enable btrfs_getattr() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c020d2ea
    • Christian Brauner's avatar
      btrfs: allow idmapped rename inode op · ca07274c
      Christian Brauner authored
      Enable btrfs_rename() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca07274c
    • Christian Brauner's avatar
      btrfs: handle idmaps in btrfs_new_inode() · b3b6f5b9
      Christian Brauner authored
      Extend btrfs_new_inode() to take the idmapped mount into account when
      initializing a new inode. This is just a matter of passing down the
      mount's userns. The rest is taken care of in inode_init_owner(). This is
      a preliminary patch to make the individual btrfs inode operations
      idmapped mount aware.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3b6f5b9