1. 27 Apr, 2022 5 commits
    • Filipe Manana's avatar
      btrfs: skip compression property for anything other than files and dirs · 4b73c55f
      Filipe Manana authored
      The compression property only has effect on regular files and directories
      (so that it's propagated to files and subdirectories created inside a
      directory). For any other inode type (symlink, fifo, device, socket),
      it's pointless to set the compression property because it does nothing
      and ends up unnecessarily wasting leaf space due to the pointless xattr
      (75 or 76 bytes, depending on the compression value). Symlinks in
      particular are very common (for example, I have almost 10k symlinks under
      /etc, /usr and /var alone) and therefore it's worth to avoid wasting
      leaf space with the compression xattr.
      
      For example, the compression property can end up on a symlink or character
      device implicitly, through inheritance from a parent directory
      
        $ mkdir /mnt/testdir
        $ btrfs property set /mnt/testdir compression lzo
      
        $ ln -s yadayada /mnt/testdir/lnk
        $ mknod /mnt/testdir/dev c 0 0
      
      Or explicitly like this:
      
        $ ln -s yadayda /mnt/lnk
        $ setfattr -h -n btrfs.compression -v lzo /mnt/lnk
      
      So skip the compression property on inodes that are neither a regular
      file nor a directory.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4b73c55f
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on failure to update inode when setting xattr · 193b4e83
      Filipe Manana authored
      We are doing a BUG_ON() if we fail to update an inode after setting (or
      clearing) a xattr, but there's really no reason to not instead simply
      abort the transaction and return the error to the caller. This should be
      a rare error because we have previously reserved enough metadata space to
      update the inode and the delayed inode should have already been setup, so
      an -ENOSPC or -ENOMEM, which are the possible errors, are very unlikely to
      happen.
      
      So replace the BUG_ON()s with a transaction abort.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      193b4e83
    • Filipe Manana's avatar
      btrfs: always log symlinks in full mode · d0e64a98
      Filipe Manana authored
      On Linux, empty symlinks are invalid, and attempting to create one with
      the system call symlink(2) results in an -ENOENT error and this is
      explicitly documented in the man page.
      
      If we rename a symlink that was created in the current transaction and its
      parent directory was logged before, we actually end up logging the symlink
      without logging its content, which is stored in an inline extent. That
      means that after a power failure we can end up with an empty symlink,
      having no content and an i_size of 0 bytes.
      
      It can be easily reproduced like this:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ mkdir /mnt/testdir
        $ sync
      
        # Create a file inside the directory and fsync the directory.
        $ touch /mnt/testdir/foo
        $ xfs_io -c "fsync" /mnt/testdir
      
        # Create a symlink inside the directory and then rename the symlink.
        $ ln -s /mnt/testdir/foo /mnt/testdir/bar
        $ mv /mnt/testdir/bar /mnt/testdir/baz
      
        # Now fsync again the directory, this persist the log tree.
        $ xfs_io -c "fsync" /mnt/testdir
      
        <power failure>
      
        $ mount /dev/sdc /mnt
        $ stat -c %s /mnt/testdir/baz
        0
        $ readlink /mnt/testdir/baz
        $
      
      Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that
      their content is also logged.
      
      A test case for fstests will follow.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d0e64a98
    • Chung-Chiang Cheng's avatar
      btrfs: do not allow compression on nodatacow files · 0e852ab8
      Chung-Chiang Cheng authored
      Compression and nodatacow are mutually exclusive. A similar issue was
      fixed by commit f37c563b ("btrfs: add missing check for nocow and
      compression inode flags"). Besides ioctl, there is another way to
      enable/disable/reset compression directly via xattr. The following
      steps will result in a invalid combination.
      
        $ touch bar
        $ chattr +C bar
        $ lsattr bar
        ---------------C-- bar
        $ setfattr -n btrfs.compression -v zstd bar
        $ lsattr bar
        --------c------C-- bar
      
      To align with the logic in check_fsflags, nocompress will also be
      unacceptable after this patch, to prevent mix any compression-related
      options with nodatacow.
      
        $ touch bar
        $ chattr +C bar
        $ lsattr bar
        ---------------C-- bar
        $ setfattr -n btrfs.compression -v zstd bar
        setfattr: bar: Invalid argument
        $ setfattr -n btrfs.compression -v no bar
        setfattr: bar: Invalid argument
      
      When both compression and nodatacow are enabled, then
      btrfs_run_delalloc_range prefers nodatacow and no compression happens.
      Reported-by: default avatarJayce Lin <jaycelin@synology.com>
      CC: stable@vger.kernel.org # 5.10.x: e6f9d696: btrfs: export a helper for compression hard check
      CC: stable@vger.kernel.org # 5.10.x
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChung-Chiang Cheng <cccheng@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e852ab8
    • Chung-Chiang Cheng's avatar
      btrfs: export a helper for compression hard check · e6f9d696
      Chung-Chiang Cheng authored
      inode_can_compress will be used outside of inode.c to check the
      availability of setting compression flag by xattr. This patch moves
      this function as an internal helper and renames it to
      btrfs_inode_can_compress.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChung-Chiang Cheng <cccheng@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e6f9d696
  2. 21 Apr, 2022 2 commits
    • Naohiro Aota's avatar
      btrfs: zoned: use dedicated lock for data relocation · 5f0addf7
      Naohiro Aota authored
      Currently, we use btrfs_inode_{lock,unlock}() to grant an exclusive
      writeback of the relocation data inode in
      btrfs_zoned_data_reloc_{lock,unlock}(). However, that can cause a deadlock
      in the following path.
      
      Thread A takes btrfs_inode_lock() and waits for metadata reservation by
      e.g, waiting for writeback:
      
      prealloc_file_extent_cluster()
        - btrfs_inode_lock(&inode->vfs_inode, 0);
        - btrfs_prealloc_file_range()
        ...
          - btrfs_replace_file_extents()
            - btrfs_start_transaction
            ...
              - btrfs_reserve_metadata_bytes()
      
      Thread B (e.g, doing a writeback work) needs to wait for the inode lock to
      continue writeback process:
      
      do_writepages
        - btrfs_writepages
          - extent_writpages
            - btrfs_zoned_data_reloc_lock(BTRFS_I(inode));
              - btrfs_inode_lock()
      
      The deadlock is caused by relying on the vfs_inode's lock. By using it, we
      introduced unnecessary exclusion of writeback and
      btrfs_prealloc_file_range(). Also, the lock at this point is useless as we
      don't have any dirty pages in the inode yet.
      
      Introduce fs_info->zoned_data_reloc_io_lock and use it for the exclusive
      writeback.
      
      Fixes: 35156d85 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
      CC: stable@vger.kernel.org # 5.16.x: 869f4cdc: btrfs: zoned: encapsulate inode locking for zoned relocation
      CC: stable@vger.kernel.org # 5.16.x
      CC: stable@vger.kernel.org # 5.17
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f0addf7
    • Filipe Manana's avatar
      btrfs: fix assertion failure during scrub due to block group reallocation · a692e13d
      Filipe Manana authored
      During a scrub, or device replace, we can race with block group removal
      and allocation and trigger the following assertion failure:
      
      [7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817
      [7526.387351] ------------[ cut here ]------------
      [7526.387373] kernel BUG at fs/btrfs/ctree.h:3599!
      [7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
      [7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4
      [7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
      [7526.393520] Code: f3 48 c7 c7 20 (...)
      [7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246
      [7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000
      [7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff
      [7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001
      [7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800
      [7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000
      [7526.402874] FS:  00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000
      [7526.404038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0
      [7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [7526.408169] Call Trace:
      [7526.408529]  <TASK>
      [7526.408839]  scrub_enumerate_chunks.cold+0x11/0x79 [btrfs]
      [7526.409690]  ? do_wait_intr_irq+0xb0/0xb0
      [7526.410276]  btrfs_scrub_dev+0x226/0x620 [btrfs]
      [7526.410995]  ? preempt_count_add+0x49/0xa0
      [7526.411592]  btrfs_ioctl+0x1ab5/0x36d0 [btrfs]
      [7526.412278]  ? __fget_files+0xc9/0x1b0
      [7526.412825]  ? kvm_sched_clock_read+0x14/0x40
      [7526.413459]  ? lock_release+0x155/0x4a0
      [7526.414022]  ? __x64_sys_ioctl+0x83/0xb0
      [7526.414601]  __x64_sys_ioctl+0x83/0xb0
      [7526.415150]  do_syscall_64+0x3b/0xc0
      [7526.415675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [7526.416408] RIP: 0033:0x7f6c99d34397
      [7526.416931] Code: 3c 1c e8 1c ff (...)
      [7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397
      [7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003
      [7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000
      [7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de
      [7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640
      [7526.425950]  </TASK>
      
      That assertion is relatively new, introduced with commit d04fbe19
      ("btrfs: scrub: cleanup the argument list of scrub_chunk()").
      
      The block group we get at scrub_enumerate_chunks() can actually have a
      start address that is smaller then the chunk offset we extracted from a
      device extent item we got from the commit root of the device tree.
      This is very rare, but it can happen due to a race with block group
      removal and allocation. For example, the following steps show how this
      can happen:
      
      1) We are at transaction T, and we have the following blocks groups,
         sorted by their logical start address:
      
         [ bg A, start address A, length 1G (data) ]
         [ bg B, start address B, length 1G (data) ]
         (...)
         [ bg W, start address W, length 1G (data) ]
      
           --> logical address space hole of 256M,
               there used to be a 256M metadata block group here
      
         [ bg Y, start address Y, length 256M (metadata) ]
      
            --> Y matches W's end offset + 256M
      
         Block group Y is the block group with the highest logical address in
         the whole filesystem;
      
      2) Block group Y is deleted and its extent mapping is removed by the call
         to remove_extent_mapping() made from btrfs_remove_block_group().
      
         So after this point, the last element of the mapping red black tree,
         its rightmost node, is the mapping for block group W;
      
      3) While still at transaction T, a new data block group is allocated,
         with a length of 1G. When creating the block group we do a call to
         find_next_chunk(), which returns the logical start address for the
         new block group. This calls returns X, which corresponds to the
         end offset of the last block group, the rightmost node in the mapping
         red black tree (fs_info->mapping_tree), plus one.
      
         So we get a new block group that starts at logical address X and with
         a length of 1G. It spans over the whole logical range of the old block
         group Y, that was previously removed in the same transaction.
      
         However the device extent allocated to block group X is not the same
         device extent that was used by block group Y, and it also does not
         overlap that extent, which must be always the case because we allocate
         extents by searching through the commit root of the device tree
         (otherwise it could corrupt a filesystem after a power failure or
         an unclean shutdown in general), so the extent allocator is behaving
         as expected;
      
      4) We have a task running scrub, currently at scrub_enumerate_chunks().
         There it searches for device extent items in the device tree, using
         its commit root. It finds a device extent item that was used by
         block group Y, and it extracts the value Y from that item into the
         local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();
      
         It then calls btrfs_lookup_block_group() to find block group for
         the logical address Y - since there's currently no block group that
         starts at that logical address, it returns block group X, because
         its range contains Y.
      
         This results in triggering the assertion:
      
            ASSERT(cache->start == chunk_offset);
      
         right before calling scrub_chunk(), as cache->start is X and
         chunk_offset is Y.
      
      This is more likely to happen of filesystems not larger than 50G, because
      for these filesystems we use a 256M size for metadata block groups and
      a 1G size for data block groups, while for filesystems larger than 50G,
      we use a 1G size for both data and metadata block groups (except for
      zoned filesystems). It could also happen on any filesystem size due to
      the fact that system block groups are always smaller (32M) than both
      data and metadata block groups, but these are not frequently deleted, so
      much less likely to trigger the race.
      
      So make scrub skip any block group with a start offset that is less than
      the value we expect, as that means it's a new block group that was created
      in the current transaction. It's pointless to continue and try to scrub
      its extents, because scrub searches for extents using the commit root, so
      it won't find any. For a device replace, skip it as well for the same
      reasons, and we don't need to worry about the possibility of extents of
      the new block group not being to the new device, because we have the write
      duplication setup done through btrfs_map_block().
      
      Fixes: d04fbe19 ("btrfs: scrub: cleanup the argument list of scrub_chunk()")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a692e13d
  3. 19 Apr, 2022 4 commits
  4. 05 Apr, 2022 9 commits
  5. 24 Mar, 2022 5 commits
    • Kaiwen Hu's avatar
      btrfs: prevent subvol with swapfile from being deleted · 60021bd7
      Kaiwen Hu authored
      A subvolume with an active swapfile must not be deleted otherwise it
      would not be possible to deactivate it.
      
      After the subvolume is deleted, we cannot swapoff the swapfile in this
      deleted subvolume because the path is unreachable.  The swapfile is
      still active and holding references, the filesystem cannot be unmounted.
      
      The test looks like this:
      
        mkfs.btrfs -f $dev > /dev/null
        mount $dev $mnt
      
        btrfs sub create $mnt/subvol
        touch $mnt/subvol/swapfile
        chmod 600 $mnt/subvol/swapfile
        chattr +C $mnt/subvol/swapfile
        dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096
        mkswap $mnt/subvol/swapfile
        swapon $mnt/subvol/swapfile
      
        btrfs sub delete $mnt/subvol
        swapoff $mnt/subvol/swapfile  # failed: No such file or directory
        swapoff --all
      
        unmount $mnt                  # target is busy.
      
      To prevent above issue, we simply check that whether the subvolume
      contains any active swapfile, and stop the deleting process.  This
      behavior is like snapshot ioctl dealing with a swapfile.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarKaiwen Hu <kevinhu@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60021bd7
    • Josef Bacik's avatar
      btrfs: do not warn for free space inode in cow_file_range · a7d16d9a
      Josef Bacik authored
      This is a long time leftover from when I originally added the free space
      inode, the point was to catch cases where we weren't honoring the NOCOW
      flag.  However there exists a race with relocation, if we allocate our
      free space inode in a block group that is about to be relocated, we
      could trigger the COW path before the relocation has the opportunity to
      find the extents and delete the free space cache.  In production where
      we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around
      5k times in a 2 week period, so not super common but enough that it's at
      the top of our metrics.
      
      We're properly handling the error here, and with us phasing out v1 space
      cache anyway just drop the WARN_ON_ONCE.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7d16d9a
    • Qu Wenruo's avatar
      btrfs: avoid defragging extents whose next extents are not targets · 75a36a7d
      Qu Wenruo authored
      [BUG]
      There is a report that autodefrag is defragging single sector, which
      is completely waste of IO, and no help for defragging:
      
         btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
      
      [CAUSE]
      In defrag_collect_targets(), we check if the current range (A) can be merged
      with next one (B).
      
      If mergeable, we will add range A into target for defrag.
      
      However there is a catch for autodefrag, when checking mergeability
      against range B, we intentionally pass 0 as @newer_than, hoping to get a
      higher chance to merge with the next extent.
      
      But in the next iteration, range B will looked up by defrag_lookup_extent(),
      with non-zero @newer_than.
      
      And if range B is not really newer, it will rejected directly, causing
      only range A being defragged, while we expect to defrag both range A and
      B.
      
      [FIX]
      Since the root cause is the difference in check condition of
      defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
      
      1. Pass @newer_than to defrag_check_next_extent()
      2. Pass @extent_thresh to defrag_check_next_extent()
      
      This makes the check between defrag_collect_targets() and
      defrag_check_next_extent() more consistent.
      
      While there is still some minor difference, the remaining checks are
      focus on runtime flags like writeback/delalloc, which are mostly
      transient and safe to be checked only in defrag_collect_targets().
      
      Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      75a36a7d
    • Darrick J. Wong's avatar
      btrfs: fix fallocate to use file_modified to update permissions consistently · 05fd9564
      Darrick J. Wong authored
      Since the initial introduction of (posix) fallocate back at the turn of
      the century, it has been possible to use this syscall to change the
      user-visible contents of files.  This can happen by extending the file
      size during a preallocation, or through any of the newer modes (punch,
      zero range).  Because the call can be used to change file contents, we
      should treat it like we do any other modification to a file -- update
      the mtime, and drop set[ug]id privileges/capabilities.
      
      The VFS function file_modified() does all this for us if pass it a
      locked inode, so let's make fallocate drop permissions correctly.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05fd9564
    • Qu Wenruo's avatar
      btrfs: remove device item and update super block in the same transaction · bbac5869
      Qu Wenruo authored
      [BUG]
      There is a report that a btrfs has a bad super block num devices.
      
      This makes btrfs to reject the fs completely.
      
        BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
        BTRFS error (device sdd3): failed to read chunk tree: -22
        BTRFS error (device sdd3): open_ctree failed
      
      [CAUSE]
      During btrfs device removal, chunk tree and super block num devs are
      updated in two different transactions:
      
        btrfs_rm_device()
        |- btrfs_rm_dev_item(device)
        |  |- trans = btrfs_start_transaction()
        |  |  Now we got transaction X
        |  |
        |  |- btrfs_del_item()
        |  |  Now device item is removed from chunk tree
        |  |
        |  |- btrfs_commit_transaction()
        |     Transaction X got committed, super num devs untouched,
        |     but device item removed from chunk tree.
        |     (AKA, super num devs is already incorrect)
        |
        |- cur_devices->num_devices--;
        |- cur_devices->total_devices--;
        |- btrfs_set_super_num_devices()
           All those operations are not in transaction X, thus it will
           only be written back to disk in next transaction.
      
      So after the transaction X in btrfs_rm_dev_item() committed, but before
      transaction X+1 (which can be minutes away), a power loss happen, then
      we got the super num mismatch.
      
      [FIX]
      Instead of starting and committing a transaction inside
      btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
      pass it to btrfs_rm_dev_item().
      
      And only commit the transaction after everything is done.
      Reported-by: default avatarLuca Béla Palkovics <luca.bela.palkovics@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbac5869
  6. 23 Mar, 2022 3 commits
    • Ethan Lien's avatar
      btrfs: fix qgroup reserve overflow the qgroup limit · b642b52d
      Ethan Lien authored
      We use extent_changeset->bytes_changed in qgroup_reserve_data() to record
      how many bytes we set for EXTENT_QGROUP_RESERVED state. Currently the
      bytes_changed is set as "unsigned int", and it will overflow if we try to
      fallocate a range larger than 4GiB. The result is we reserve less bytes
      and eventually break the qgroup limit.
      
      Unlike regular buffered/direct write, which we use one changeset for
      each ordered extent, which can never be larger than 256M.  For
      fallocate, we use one changeset for the whole range, thus it no longer
      respects the 256M per extent limit, and caused the problem.
      
      The following example test script reproduces the problem:
      
        $ cat qgroup-overflow.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        # Set qgroup limit to 2GiB.
        btrfs quota enable $MNT
        btrfs qgroup limit 2G $MNT
      
        # Try to fallocate a 3GiB file. This should fail.
        echo
        echo "Try to fallocate a 3GiB file..."
        fallocate -l 3G $MNT/3G.file
      
        # Try to fallocate a 5GiB file.
        echo
        echo "Try to fallocate a 5GiB file..."
        fallocate -l 5G $MNT/5G.file
      
        # See we break the qgroup limit.
        echo
        sync
        btrfs qgroup show -r $MNT
      
        umount $MNT
      
      When running the test:
      
        $ ./qgroup-overflow.sh
        (...)
      
        Try to fallocate a 3GiB file...
        fallocate: fallocate failed: Disk quota exceeded
      
        Try to fallocate a 5GiB file...
      
        qgroupid         rfer         excl     max_rfer
        --------         ----         ----     --------
        0/5           5.00GiB      5.00GiB      2.00GiB
      
      Since we have no control of how bytes_changed is used, it's better to
      set it to u64.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b642b52d
    • Johannes Thumshirn's avatar
      btrfs: zoned: remove left over ASSERT checking for single profile · 62ed0bf7
      Johannes Thumshirn authored
      With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block
      groups") we started allowing DUP on metadata block groups, so the
      ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are
      no longer valid and in fact even harmful.
      
      Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62ed0bf7
    • Johannes Thumshirn's avatar
      btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zone · 0b9e6676
      Johannes Thumshirn authored
      btrfs_can_activate_zone() can be called with the device_list_mutex already
      held, which will lead to a deadlock:
      
      insert_dev_extents() // Takes device_list_mutex
      `-> insert_dev_extent()
       `-> btrfs_insert_empty_item()
        `-> btrfs_insert_empty_items()
         `-> btrfs_search_slot()
          `-> btrfs_cow_block()
           `-> __btrfs_cow_block()
            `-> btrfs_alloc_tree_block()
             `-> btrfs_reserve_extent()
              `-> find_free_extent()
               `-> find_free_extent_update_loop()
                `-> can_allocate_chunk()
                 `-> btrfs_can_activate_zone() // Takes device_list_mutex again
      
      Instead of using the RCU on fs_devices->device_list we
      can use fs_devices->alloc_list, protected by the chunk_mutex to traverse
      the list of active devices.
      
      We are in the chunk allocation thread. The newer chunk allocation
      happens from the devices in the fs_device->alloc_list protected by the
      chunk_mutex.
      
        btrfs_create_chunk()
          lockdep_assert_held(&info->chunk_mutex);
          gather_device_info
            list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
      
      Also, a device that reappears after the mount won't join the alloc_list
      yet and, it will be in the dev_list, which we don't want to consider in
      the context of the chunk alloc.
      
        [15.166572] WARNING: possible recursive locking detected
        [15.167117] 5.17.0-rc6-dennis #79 Not tainted
        [15.167487] --------------------------------------------
        [15.167733] kworker/u8:3/146 is trying to acquire lock:
        [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs]
        [15.167733]
        [15.167733] but task is already holding lock:
        [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
        [15.167733]
        [15.167733] other info that might help us debug this:
        [15.167733]  Possible unsafe locking scenario:
        [15.167733]
        [15.171834]        CPU0
        [15.171834]        ----
        [15.171834]   lock(&fs_devs->device_list_mutex);
        [15.171834]   lock(&fs_devs->device_list_mutex);
        [15.171834]
        [15.171834]  *** DEADLOCK ***
        [15.171834]
        [15.171834]  May be due to missing lock nesting notation
        [15.171834]
        [15.171834] 5 locks held by kworker/u8:3/146:
        [15.171834]  #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
        [15.171834]  #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
        [15.176244]  #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs]
        [15.176244]  #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
        [15.176244]  #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs]
        [15.179641]
        [15.179641] stack backtrace:
        [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79
        [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
        [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
        [15.179641] Call Trace:
        [15.179641]  <TASK>
        [15.179641]  dump_stack_lvl+0x45/0x59
        [15.179641]  __lock_acquire.cold+0x217/0x2b2
        [15.179641]  lock_acquire+0xbf/0x2b0
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  __mutex_lock+0x8e/0x970
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? lock_is_held_type+0xd7/0x130
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? _raw_spin_unlock+0x24/0x40
        [15.183838]  ? btrfs_get_alloc_profile+0x106/0x230 [btrfs]
        [15.187601]  btrfs_reserve_extent+0x131/0x260 [btrfs]
        [15.187601]  btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs]
        [15.187601]  __btrfs_cow_block+0x138/0x600 [btrfs]
        [15.187601]  btrfs_cow_block+0x10f/0x230 [btrfs]
        [15.187601]  btrfs_search_slot+0x55f/0xbc0 [btrfs]
        [15.187601]  ? lock_is_held_type+0xd7/0x130
        [15.187601]  btrfs_insert_empty_items+0x2d/0x60 [btrfs]
        [15.187601]  btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs]
        [15.187601]  __btrfs_end_transaction+0x36/0x2a0 [btrfs]
        [15.192037]  flush_space+0x374/0x600 [btrfs]
        [15.192037]  ? find_held_lock+0x2b/0x80
        [15.192037]  ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs]
        [15.192037]  ? lock_release+0x131/0x2b0
        [15.192037]  btrfs_async_reclaim_data_space+0x70/0x180 [btrfs]
        [15.192037]  process_one_work+0x24c/0x5a0
        [15.192037]  worker_thread+0x4a/0x3d0
      
      Fixes: a85f05e5 ("btrfs: zoned: avoid chunk allocation if active block group has enough space")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b9e6676
  7. 14 Mar, 2022 12 commits