1. 23 Aug, 2021 40 commits
    • Desmond Cheong Zhi Xi's avatar
      btrfs: reset replace target device to allocation state on close · 0d977e0e
      Desmond Cheong Zhi Xi authored
      This crash was observed with a failed assertion on device close:
      
        BTRFS: Transaction aborted (error -28)
        WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
        Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
        CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
        RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
        RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
        R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
        FS:  0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
        Call Trace:
         flush_space+0x197/0x2f0 [btrfs]
         btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
         process_one_work+0x262/0x5e0
         worker_thread+0x4c/0x320
         ? process_one_work+0x5e0/0x5e0
         kthread+0x144/0x170
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x1f/0x30
        irq event stamp: 19334989
        hardirqs last  enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
        hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
        softirqs last  enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
        softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
        ---[ end trace 45939e308e0dd3c7 ]---
        BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
        BTRFS info (device vdd): forced readonly
        BTRFS warning (device vdd): failed setting block group ro: -30
        BTRFS info (device vdd): suspending dev_replace for unmount
        assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3431!
        invalid opcode: 0000 [#1] PREEMPT SMP
        CPU: 1 PID: 3982 Comm: umount Tainted: G        W         5.14.0-rc5-default+ #1532
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
        RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
        RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
        RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
        R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
        FS:  00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
        Call Trace:
         btrfs_close_one_device.cold+0x11/0x55 [btrfs]
         close_fs_devices+0x44/0xb0 [btrfs]
         btrfs_close_devices+0x48/0x160 [btrfs]
         generic_shutdown_super+0x69/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x2c/0xa0
         cleanup_mnt+0x144/0x1b0
         task_work_run+0x59/0xa0
         exit_to_user_mode_loop+0xe7/0xf0
         exit_to_user_mode_prepare+0xaf/0xf0
         syscall_exit_to_user_mode+0x19/0x50
         do_syscall_64+0x4a/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      This happens when close_ctree is called while a dev_replace hasn't
      completed. In close_ctree, we suspend the dev_replace, but keep the
      replace target around so that we can resume the dev_replace procedure
      when we mount the root again. This is the call trace:
      
        close_ctree():
          btrfs_dev_replace_suspend_for_unmount();
          btrfs_close_devices():
            btrfs_close_fs_devices():
              btrfs_close_one_device():
                ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
                       &device->dev_state));
      
      However, since the replace target sticks around, there is a device
      with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
      assertion in btrfs_close_one_device.
      
      To fix this, if we come across the replace target device when
      closing, we should properly reset it back to allocation state. This
      fix also ensures that if a non-target device has a corrupted state and
      has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
      catch the error.
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Fixes: b2a61667 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d977e0e
    • Naohiro Aota's avatar
      btrfs: zoned: fix ordered extent boundary calculation · 939c7feb
      Naohiro Aota authored
      btrfs_lookup_ordered_extent() is supposed to query the offset in a file
      instead of the logical address. Pass the file offset from
      submit_extent_page() to calc_bio_boundaries().
      
      Also, calc_bio_boundaries() relies on the bio's operation flag, so move
      the call site after setting it.
      
      Fixes: 390ed29b ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      939c7feb
    • Josef Bacik's avatar
      btrfs: do not do preemptive flushing if the majority is global rsv · 11462397
      Josef Bacik authored
      A common characteristic of the bug report where preemptive flushing was
      going full tilt was the fact that the vast majority of the free metadata
      space was used up by the global reserve.  The hard 90% threshold would
      cover the majority of these cases, but to be even smarter we should take
      into account how much of the outstanding reservations are covered by the
      global block reserve.  If the global block reserve accounts for the vast
      majority of outstanding reservations, skip preemptive flushing, as it
      will likely just cause churn and pain.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11462397
    • Josef Bacik's avatar
      btrfs: reduce the preemptive flushing threshold to 90% · 93c60b17
      Josef Bacik authored
      The preemptive flushing code was added in order to avoid needing to
      synchronously wait for ENOSPC flushing to recover space.  Once we're
      almost full however we can essentially flush constantly.  We were using
      98% as a threshold to determine if we were simply full, however in
      practice this is a really high bar to hit.  For example reports of
      systems running into this problem had around 94% usage and thus
      continued to flush.  Fix this by lowering the threshold to 90%, which is
      a more sane value, especially for smaller file systems.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
      CC: stable@vger.kernel.org # 5.12+
      Fixes: 576fa348 ("btrfs: improve preemptive background space flushing")
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      93c60b17
    • Marcos Paulo de Souza's avatar
      btrfs: tree-log: check btrfs_lookup_data_extent return value · 3736127a
      Marcos Paulo de Souza authored
      Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if
      the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return
      values bellow zero if an error happened.
      
      Function replay_one_extent currently checks if the search found
      something (0 returned) and increments the reference, and if not, it
      seems to evaluate as 'not found'.
      
      Fix the condition by checking if the value was bellow zero and return
      early.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3736127a
    • Filipe Manana's avatar
      btrfs: avoid unnecessarily logging directories that had no changes · 8be2ba2e
      Filipe Manana authored
      There are several cases where when logging an inode we need to log its
      parent directories or logging subdirectories when logging a directory.
      
      There are cases however where we end up logging a directory even if it was
      not changed in the current transaction, no dentries added or removed since
      the last transaction. While this is harmless from a functional point of
      view, it is a waste time as it brings no advantage.
      
      One example where this is triggered is the following:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ mkdir /mnt/A
        $ mkdir /mnt/B
        $ mkdir /mnt/C
      
        $ touch /mnt/A/foo
        $ ln /mnt/A/foo /mnt/B/bar
        $ ln /mnt/A/foo /mnt/C/baz
      
        $ sync
      
        $ rm -f /mnt/A/foo
        $ xfs_io -c "fsync" /mnt/B/bar
      
      This last fsync ends up logging directories A, B and C, however we only
      need to log directory A, as B and C were not changed since the last
      transaction commit.
      
      So fix this by changing need_log_inode(), to return false in case the
      given inode is a directory and has a ->last_trans value smaller than the
      current transaction's ID.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8be2ba2e
    • Christian Brauner's avatar
      btrfs: allow idmapped mount · 5b9b26f5
      Christian Brauner authored
      Now that we converted btrfs internally to account for idmapped mounts
      allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
      flag.  We only need to raise this flag on the btrfs_root_fs_type
      filesystem since btrfs_mount_root() is ultimately responsible for
      allocating the superblock and is called into from btrfs_mount()
      associated with btrfs_fs_type.
      
      The conversion of the btrfs inode operations was straightforward.
      Regarding btrfs specific ioctls that perform checks based on inode
      permissions only those have been allowed that are not filesystem wide
      operations and hence can be reasonably charged against a specific mount.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b9b26f5
    • Christian Brauner's avatar
      btrfs: handle ACLs on idmapped mounts · 4a8b34af
      Christian Brauner authored
      Make the ACL code idmapped mount aware. The POSIX default and POSIX
      access ACLs are the only ACLs other than some specific xattrs that take
      DAC permissions into account. On an idmapped mount they need to be
      translated according to the mount's userns. The main change is done to
      __btrfs_set_acl() which is responsible for translating POSIX ACLs to
      their final on-disk representation.
      
      The btrfs_init_acl() helper does not need to take the idmapped mount
      into account since it is called in the context of file creation
      operations (mknod, create, mkdir, symlink, tmpfile) and is used for
      btrfs_init_inode_security() to copy POSIX default and POSIX access
      permissions from the parent directory. These ACLs need to be inherited
      unmodified from the parent directory. This is identical to what we do
      for ext4 and xfs.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a8b34af
    • Christian Brauner's avatar
      btrfs: allow idmapped INO_LOOKUP_USER ioctl · 6623d9a0
      Christian Brauner authored
      The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl
      and has the following restrictions. The main difference between the two
      is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER
      is scoped beneath the file descriptor passed with the ioctl.
      Specifically, INO_LOOKUP_USER must adhere to the following restrictions:
      
      - The caller must be privileged over each inode of each path component
        for the path they are trying to lookup.
      
      - The path for the subvolume the caller is trying to lookup must be reachable
        from the inode associated with the file descriptor passed with the ioctl.
      
      The second condition makes it possible to scope the lookup of the path
      to the mount identified by the file descriptor passed with the ioctl.
      This allows us to enable this ioctl on idmapped mounts.
      
      Specifically, this is possible because all child subvolumes of a parent
      subvolume are reachable when the parent subvolume is mounted. So if the
      user had access to open the parent subvolume or has been given the fd
      then they can lookup the path if they had access to it provided they
      were privileged over each path component.
      
      Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name
      of a subvolume even though they would otherwise be restricted from doing
      so via regular VFS-based lookup.
      
      So think about a parent subvolume with multiple child subvolumes.
      Someone could mount he parent subvolume and restrict access to the child
      subvolumes by overmounting them with empty directories. At this point
      the user can't traverse the child subvolumes and they can't open files
      in the child subvolumes.  However, they can still learn the path of
      child subvolumes as long as they have access to the parent subvolume by
      using the INO_LOOKUP_USER ioctl.
      
      The underlying assumption here is that it's ok that the lookup ioctls
      can't really take mounts into account other than the original mount the
      fd belongs to during lookup. Since this assumption is baked into the
      original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6623d9a0
    • Christian Brauner's avatar
      btrfs: allow idmapped SUBVOL_SETFLAGS ioctl · 39e1674f
      Christian Brauner authored
      Setting flags on subvolumes or snapshots are core features of btrfs. The
      SUBVOL_SETFLAGS ioctl is especially important as it allows to make
      subvolumes and snapshots read-only or read-write. Allow setting flags on
      btrfs subvolumes and snapshots on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39e1674f
    • Christian Brauner's avatar
      btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls · e4fed17a
      Christian Brauner authored
      The SET_RECEIVED_SUBVOL ioctls are used to set information about
      a received subvolume. Make it possible to set information about a
      received subvolume on idmapped mounts. This is a fairly straightforward
      operation since all the permission checking helpers are already capable
      of handling idmapped mounts. So we just need to pass down the mount's
      userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e4fed17a
    • Christian Brauner's avatar
      btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids · aabb34e7
      Christian Brauner authored
      So far we prevented the deletion of subvolumes and snapshots using
      subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag.
      
      This restriction is necessary on idmapped mounts as this allows
      filesystem wide subvolume and snapshot deletions and thus can escape the
      scope of what's exposed under the mount identified by the fd passed with
      the ioctl.
      
      Deletion by subvolume id works by looking for an alias of the parent of
      the subvolume or snapshot to be deleted. The parent alias can be
      anywhere in the filesystem. However, as long as the alias of the parent
      that is found is the same as the one identified by the file descriptor
      passed through the ioctl we can allow the deletion.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aabb34e7
    • Christian Brauner's avatar
      btrfs: allow idmapped SNAP_DESTROY ioctls · c4ed533b
      Christian Brauner authored
      Destroying subvolumes and snapshots are important features of btrfs.
      Both operations are available to unprivileged users if the filesystem
      has been mounted with the "user_subvol_rm_allowed" mount option. Allow
      subvolume and snapshot deletion on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      
      Subvolumes and snapshots can either be deleted by specifying their name
      or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
      snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
      
      This feature is blocked on idmapped mounts as this allows filesystem
      wide subvolume deletions and thus can escape the scope of what's exposed
      under the mount identified by the fd passed with the ioctl.
      
      This means that even the root or CAP_SYS_ADMIN capable user can't delete
      a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
      
      The root user is currently already subject to permission checks in
      btrfs_may_delete() including whether the inode's i_uid/i_gid of the
      directory the subvolume is located in have a mapping in the caller's
      idmapping. For this to fail isn't currently possible since a btrfs
      filesystem can't be mounted with a non-initial idmapping but it shows
      that even the root user would fail to delete a subvolume if the relevant
      inode isn't mapped in their idmapping. The idmapped mount case is the
      same in principle.
      
      This isn't a huge problem a root user wanting to delete arbitrary
      subvolumes can just always create another (even detached) mount without
      an idmapping attached.
      
      In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
      subvolume to delete is directly located under inode referenced by the fd
      passed for the ioctl() in a follow-up commit.
      
      Here is an example where a btrfs subvolume is deleted through a
      subvolume mount that does not expose the subvolume to be delete but it
      can still be deleted by using the subvolume id:
      
        /* Compile the following program as "delete_by_spec". */
      
        #define _GNU_SOURCE
        #include <fcntl.h>
        #include <inttypes.h>
        #include <linux/btrfs.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <sys/ioctl.h>
        #include <sys/stat.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static int rm_subvolume_by_id(int fd, uint64_t subvolid)
        {
      	 struct btrfs_ioctl_vol_args_v2 args = {};
      	 int ret;
      
      	 args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
      	 args.subvolid = subvolid;
      
      	 ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
      	 if (ret < 0)
      		 return -1;
      
      	 return 0;
        }
      
        int main(int argc, char *argv[])
        {
      	 int subvolid = 0;
      
      	 if (argc < 3)
      		 exit(1);
      
      	 fprintf(stderr, "Opening %s\n", argv[1]);
      	 int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
      	 if (fd < 0)
      		 exit(2);
      
      	 subvolid = atoi(argv[2]);
      
      	 fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
      	 int ret = rm_subvolume_by_id(fd, subvolid);
      	 if (ret < 0)
      		 exit(3);
      
      	 exit(0);
        }
        #include <stdio.h>"
        #include <stdlib.h>"
        #include <linux/btrfs.h"
      
        truncate -s 10G btrfs.img
        mkfs.btrfs btrfs.img
        export LOOPDEV=$(sudo losetup -f --show btrfs.img)
        mount ${LOOPDEV} /mnt
        sudo chown $(id -u):$(id -g) /mnt
        btrfs subvolume create /mnt/A
        btrfs subvolume create /mnt/B/C
        # Get subvolume id via:
        sudo btrfs subvolume show /mnt/A
        # Save subvolid
        SUBVOLID=<nr>
        sudo umount /mnt
        sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
        ./delete_by_spec /mnt ${SUBVOLID}
      
      With idmapped mounts this can potentially be used by users to delete
      subvolumes/snapshots they would otherwise not have access to as the
      idmapping would be applied to an inode that is not exposed in the mount
      of the subvolume.
      
      The fact that this is a filesystem wide operation suggests it might be a
      good idea to expose this under a separate ioctl that clearly indicates
      this. In essence, the file descriptor passed with the ioctl is merely
      used to identify the filesystem on which to operate when
      BTRFS_SUBVOL_SPEC_BY_ID is used.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4ed533b
    • Christian Brauner's avatar
      btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls · 4d4340c9
      Christian Brauner authored
      Creating subvolumes and snapshots is one of the core features of btrfs
      and is even available to unprivileged users. Make it possible to use
      subvolume and snapshot creation on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d4340c9
    • Christian Brauner's avatar
      btrfs: check whether fsgid/fsuid are mapped during subvolume creation · 5474bf40
      Christian Brauner authored
      When a new subvolume is created btrfs currently doesn't check whether
      the fsgid/fsuid of the caller actually have a mapping in the user
      namespace attached to the filesystem. The VFS always checks this to make
      sure that the caller's fsgid/fsuid can be represented on-disk. This is
      most relevant for filesystems that can be mounted inside user namespaces
      but it is in general a good hardening measure to prevent unrepresentable
      gid/uid from being written to disk.
      
      Since we want to support idmapped mounts for btrfs ioctls to create
      subvolumes in follow-up patches this becomes important since we want to
      make sure the fsgid/fsuid of the caller as mapped according to the
      idmapped mount can be represented on-disk. Simply add the missing
      fsuidgid_has_mapping() line from the VFS may_create() version to
      btrfs_may_create().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5474bf40
    • Christian Brauner's avatar
      btrfs: allow idmapped permission inode op · 3bc71ba0
      Christian Brauner authored
      Enable btrfs_permission() to handle idmapped mounts. This is just a
      matter of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3bc71ba0
    • Christian Brauner's avatar
      btrfs: allow idmapped setattr inode op · d4d09464
      Christian Brauner authored
      Enable btrfs_setattr() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4d09464
    • Christian Brauner's avatar
      btrfs: allow idmapped tmpfile inode op · 98b6ab5f
      Christian Brauner authored
      Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98b6ab5f
    • Christian Brauner's avatar
      btrfs: allow idmapped symlink inode op · 5a052108
      Christian Brauner authored
      Enable btrfs_symlink() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a052108
    • Christian Brauner's avatar
      btrfs: allow idmapped mkdir inode op · b0b3e44d
      Christian Brauner authored
      Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of
      passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0b3e44d
    • Christian Brauner's avatar
      btrfs: allow idmapped create inode op · e93ca491
      Christian Brauner authored
      Enable btrfs_create() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e93ca491
    • Christian Brauner's avatar
      btrfs: allow idmapped mknod inode op · 72105277
      Christian Brauner authored
      Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of
      passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72105277
    • Christian Brauner's avatar
      btrfs: allow idmapped getattr inode op · c020d2ea
      Christian Brauner authored
      Enable btrfs_getattr() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c020d2ea
    • Christian Brauner's avatar
      btrfs: allow idmapped rename inode op · ca07274c
      Christian Brauner authored
      Enable btrfs_rename() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca07274c
    • Christian Brauner's avatar
      btrfs: handle idmaps in btrfs_new_inode() · b3b6f5b9
      Christian Brauner authored
      Extend btrfs_new_inode() to take the idmapped mount into account when
      initializing a new inode. This is just a matter of passing down the
      mount's userns. The rest is taken care of in inode_init_owner(). This is
      a preliminary patch to make the individual btrfs inode operations
      idmapped mount aware.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3b6f5b9
    • Christian Brauner's avatar
      namei: add mapping aware lookup helper · c2fd68b6
      Christian Brauner authored
      Various filesystems rely on the lookup_one_len() helper to lookup a
      single path component relative to a well-known starting point. Allow
      such filesystems to support idmapped mounts by adding a version of this
      helper to take the idmap into account when calling inode_permission().
      This change is a required to let btrfs (and other filesystems) support
      idmapped mounts.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2fd68b6
    • Anand Jain's avatar
      btrfs: sysfs: document structures and their associated files · e7849e33
      Anand Jain authored
      Sysfs file has grown big. It takes some time to locate the correct
      struct attribute to add new files. Create a table and map the struct
      attribute to its sysfs path.
      
      Also, fix the comment about the debug sysfs path.  And add the comments
      to the attributes instead of attribute group, where sysfs file names are
      defined.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7849e33
    • Qu Wenruo's avatar
      btrfs: fix NULL pointer dereference when deleting device by invalid id · e4571b8c
      Qu Wenruo authored
      [BUG]
      It's easy to trigger NULL pointer dereference, just by removing a
      non-existing device id:
      
       # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
      				     /dev/test/scratch2
       # mount /dev/test/scratch1 /mnt/btrfs
       # btrfs device remove 3 /mnt/btrfs
      
      Then we have the following kernel NULL pointer dereference:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
       RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
        btrfs_ioctl+0x18bb/0x3190 [btrfs]
        ? lock_is_held_type+0xa5/0x120
        ? find_held_lock.constprop.0+0x2b/0x80
        ? do_user_addr_fault+0x201/0x6a0
        ? lock_release+0xd2/0x2d0
        ? __x64_sys_ioctl+0x83/0xb0
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      [CAUSE]
      Commit a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return
      btrfs_device directly") moves the "missing" device path check into
      btrfs_rm_device().
      
      But btrfs_rm_device() itself can have case where it only receives
      @devid, with NULL as @device_path.
      
      In that case, calling strcmp() on NULL will trigger the NULL pointer
      dereference.
      
      Before that commit, we handle the "missing" case inside
      btrfs_find_device_by_devspec(), which will not check @device_path at all
      if @devid is provided, thus no way to trigger the bug.
      
      [FIX]
      Before calling strcmp(), also make sure @device_path is not NULL.
      
      Fixes: a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
      CC: stable@vger.kernel.org # 5.4+
      Reported-by: default avatarbutt3rflyh4ck <butterflyhuangxx@gmail.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e4571b8c
    • Naohiro Aota's avatar
      btrfs: zoned: add asserts on splitting extent_map · 63fb5879
      Naohiro Aota authored
      We call split_zoned_em() on an extent_map on submitting a bio for it. Thus,
      we can assume the extent_map is PINNED, not LOGGING, and in the modified
      list. Add ASSERT()s to ensure the extent_maps after the split also has the
      proper flags set and are in the modified list.
      Suggested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63fb5879
    • Naohiro Aota's avatar
      btrfs: zoned: fix block group alloc_offset calculation · 0ae79c6f
      Naohiro Aota authored
      alloc_offset is offset from the start of a block group and @offset is
      actually an address in logical space. Thus, we need to consider
      block_group->start when calculating them.
      
      Fixes: 011b41bf ("btrfs: zoned: advance allocation pointer after tree log node")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ae79c6f
    • Naohiro Aota's avatar
      btrfs: zoned: suppress reclaim error message on EAGAIN · ba86dd9f
      Naohiro Aota authored
      btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are
      running. The message can fail btrfs/187 and it's unnecessary because we
      anyway add it back to the reclaim list.
      
      btrfs_reclaim_bgs_work()
      `-> btrfs_relocate_chunk()
          `-> btrfs_relocate_block_group()
              `-> reloc_chunk_start()
                  `-> if (fs_info->send_in_progress)
                      `-> return -EAGAIN
      
      CC: stable@vger.kernel.org # 5.13+
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba86dd9f
    • Johannes Thumshirn's avatar
      btrfs: zoned: allow disabling of zone auto reclaim · 77233c2d
      Johannes Thumshirn authored
      Automatically reclaiming dirty zones might not always be desired for all
      workloads, especially as there are currently still some rough edges with
      the relocation code on zoned filesystems.
      
      Allow disabling zone auto reclaim on a per filesystem basis by writing 0
      as the threshold value.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77233c2d
    • Filipe Manana's avatar
      btrfs: update comment at log_conflicting_inodes() · 1f295373
      Filipe Manana authored
      A comment at log_conflicting_inodes() mentions that we check the inode's
      logged_trans field instead of using btrfs_inode_in_log() because the field
      last_log_commit is not updated when we log that an inode exists and the
      inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part
      about the full sync flag is not true anymore since commit 9acc8103
      ("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so
      update the comment to not mention that part anymore.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f295373
    • Filipe Manana's avatar
      btrfs: remove no longer needed full sync flag check at inode_logged() · d135a533
      Filipe Manana authored
      Now that we are checking if the inode's logged_trans is 0 to detect the
      possibility of the inode having been evicted and reloaded, the test for
      the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at
      tree-log.c:inode_logged(). Its purpose was to detect the possibility
      of a previous eviction as well, since when an inode is loaded the full
      sync flag is always set on it (and only cleared after the inode is
      logged).
      
      So just remove the check and update the comment. The check for the inode's
      logged_trans being 0 was added recently by the patch with the subject
      "btrfs: eliminate some false positives when checking if inode was logged".
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d135a533
    • Filipe Manana's avatar
      btrfs: remove unnecessary NULL check for the new inode during rename exchange · 1c167b87
      Filipe Manana authored
      At the very end of btrfs_rename_exchange(), in case an error happened, we
      are checking if 'new_inode' is NULL, but that is not needed since during a
      rename exchange, unlike regular renames, 'new_inode' can never be NULL,
      and if it were, we would have a crashed much earlier when we dereference it
      multiple times.
      
      So remove the check because it is not necessary and because it is causing
      static checkers to emit a warning. I probably introduced the check by
      copy-pasting similar code from btrfs_rename(), where 'new_inode' can be
      NULL, in commit 86e8aa0e ("Btrfs: unpin logs if rename exchange
      operation fails").
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c167b87
    • Goldwyn Rodrigues's avatar
      btrfs: allocate backref_ctx on stack in find_extent_clone · dce28150
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx
      on stack. The size is reasonably small.
      
      sizeof(backref_ctx) = 48
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dce28150
    • Goldwyn Rodrigues's avatar
      btrfs: allocate btrfs_ioctl_defrag_range_args on stack · c853a578
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args,
      allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_defrag_range_args) = 48
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c853a578
    • Goldwyn Rodrigues's avatar
      btrfs: allocate btrfs_ioctl_quota_rescan_args on stack · 0afb603a
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args,
      allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_quota_rescan_args) = 64
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0afb603a
    • Goldwyn Rodrigues's avatar
      btrfs: allocate file_ra_state on stack in readahead_cache · 98caf953
      Goldwyn Rodrigues authored
      Instead of allocating file_ra_state using kmalloc, allocate on stack.
      sizeof(struct readahead) = 32 bytes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98caf953
    • Marcos Paulo de Souza's avatar
      btrfs: introduce btrfs_search_backwards function · 0ff40a91
      Marcos Paulo de Souza authored
      It's a common practice to start a search using offset (u64)-1, which is
      the u64 maximum value, meaning that we want the search_slot function to
      be set in the last item with the same objectid and type.
      
      Once we are in this position, it's a matter to start a search backwards
      by calling btrfs_previous_item, which will check if we'll need to go to
      a previous leaf and other necessary checks, only to be sure that we are
      in last offset of the same object and type.
      
      The new btrfs_search_backwards function does the all these steps when
      necessary, and can be used to avoid code duplication.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ff40a91