1. 12 Jan, 2021 2 commits
    • Filipe Manana's avatar
      btrfs: send: fix invalid clone operations when cloning from the same file and root · 518837e6
      Filipe Manana authored
      When an incremental send finds an extent that is shared, it checks which
      file extent items in the range refer to that extent, and for those it
      emits clone operations, while for others it emits regular write operations
      to avoid corruption at the destination (as described and fixed by commit
      d906d49f ("Btrfs: send, fix file corruption due to incorrect cloning
      operations")).
      
      However when the root we are cloning from is the send root, we are cloning
      from the inode currently being processed and the source file range has
      several extent items that partially point to the desired extent, with an
      offset smaller than the offset in the file extent item for the range we
      want to clone into, it can cause the algorithm to issue a clone operation
      that starts at the current eof of the file being processed in the receiver
      side, in which case the receiver will fail, with EINVAL, when attempting
      to execute the clone operation.
      
      Example reproducer:
      
        $ cat test-send-clone.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        # Create our test file with a single and large extent (1M) and with
        # different content for different file ranges that will be reflinked
        # later.
        xfs_io -f \
               -c "pwrite -S 0xab 0 128K" \
               -c "pwrite -S 0xcd 128K 128K" \
               -c "pwrite -S 0xef 256K 256K" \
               -c "pwrite -S 0x1a 512K 512K" \
               $MNT/foobar
      
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Now do a series of changes to our file such that we end up with
        # different parts of the extent reflinked into different file offsets
        # and we overwrite a large part of the extent too, so no file extent
        # items refer to that part that was overwritten. This used to confuse
        # the algorithm used by the kernel to figure out which file ranges to
        # clone, making it attempt to clone from a source range starting at
        # the current eof of the file, resulting in the receiver to fail since
        # it is an invalid clone operation.
        #
        xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
               -c "reflink $MNT/foobar 0K 512K 256K" \
               -c "reflink $MNT/foobar 512K 128K 256K" \
               -c "pwrite -S 0x73 384K 640K" \
               $MNT/foobar
      
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        echo -e "\nFile digest in the original filesystem:"
        md5sum $MNT/snap2/foobar
      
        # Now unmount the filesystem, create a new one, mount it and try to
        # apply both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        btrfs receive -f /tmp/snap1.send $MNT
        btrfs receive -f /tmp/snap2.send $MNT
      
        # Must match what we got in the original filesystem of course.
        echo -e "\nFile digest in the new filesystem:"
        md5sum $MNT/snap2/foobar
      
        umount $MNT
      
      When running the reproducer, the incremental send operation fails due to
      an invalid clone operation:
      
        $ ./test-send-clone.sh
        wrote 131072/131072 bytes at offset 0
        128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
        wrote 131072/131072 bytes at offset 131072
        128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
        wrote 262144/262144 bytes at offset 262144
        256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
        wrote 524288/524288 bytes at offset 524288
        512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        linked 983040/983040 bytes at offset 1048576
        960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
        linked 262144/262144 bytes at offset 524288
        256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
        linked 262144/262144 bytes at offset 131072
        256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
        wrote 655360/655360 bytes at offset 393216
        640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
      
        File digest in the original filesystem:
        9c13c61cb0b9f5abf45344375cb04dfa  /mnt/sdi/snap2/foobar
        At subvol snap1
        At snapshot snap2
        ERROR: failed to clone extents to foobar: Invalid argument
      
        File digest in the new filesystem:
        132f0396da8f48d2e667196bff882cfc  /mnt/sdi/snap2/foobar
      
      The clone operation is invalid because its source range starts at the
      current eof of the file in the receiver, causing the receiver to get
      an EINVAL error from the clone operation when attempting it.
      
      For the example above, what happens is the following:
      
      1) When processing the extent at file offset 1M, the algorithm checks that
         the extent is shared and can be (fully or partially) found at file
         offset 0.
      
         At this point the file has a size (and eof) of 1M at the receiver;
      
      2) It finds that our extent item at file offset 1M has a data offset of
         64K and, since the file extent item at file offset 0 has a data offset
         of 0, it issues a clone operation, from the same file and root, that
         has a source range offset of 64K, destination offset of 1M and a length
         of 64K, since the extent item at file offset 0 refers only to the first
         128K of the shared extent.
      
         After this clone operation, the file size (and eof) at the receiver is
         increased from 1M to 1088K (1M + 64K);
      
      3) Now there's still 896K (960K - 64K) of data left to clone or write, so
         it checks for the next file extent item, which starts at file offset
         128K. This file extent item has a data offset of 0 and a length of
         256K, so a clone operation with a source range offset of 256K, a
         destination offset of 1088K (1M + 64K) and length of 128K is issued.
      
         After this operation the file size (and eof) at the receiver increases
         from 1088K to 1216K (1088K + 128K);
      
      4) Now there's still 768K (896K - 128K) of data left to clone or write, so
         it checks for the next file extent item, located at file offset 384K.
         This file extent item points to a different extent, not the one we want
         to clone, with a length of 640K. So we issue a write operation into the
         file range 1216K (1088K + 128K, end of the last clone operation), with
         a length of 640K and with a data matching the one we can find for that
         range in send root.
      
         After this operation, the file size (and eof) at the receiver increases
         from 1216K to 1856K (1216K + 640K);
      
      5) Now there's still 128K (768K - 640K) of data left to clone or write, so
         we look into the file extent item, which is for file offset 1M and it
         points to the extent we want to clone, with a data offset of 64K and a
         length of 960K.
      
         However this matches the file offset we started with, the start of the
         range to clone into. So we can't for sure find any file extent item
         from here onwards with the rest of the data we want to clone, yet we
         proceed and since the file extent item points to the shared extent,
         with a data offset of 64K, we issue a clone operation with a source
         range starting at file offset 1856K, which matches the file extent
         item's offset, 1M, plus the amount of data cloned and written so far,
         which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
         operation is invalid since the source range offset matches the current
         eof of the file in the receiver. We should have stopped looking for
         extents to clone at this point and instead fallback to write, which
         would simply the contain the data in the file range from 1856K to
         1856K + 128K.
      
      So fix this by stopping the loop that looks for file ranges to clone at
      clone_range() when we reach the current eof of the file being processed,
      if we are cloning from the same file and using the send root as the clone
      root. This ensures any data not yet cloned will be sent to the receiver
      through a write operation.
      
      A test case for fstests will follow soon.
      Reported-by: default avatarMassimo B. <massimo.b@gmx.net>
      Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
      Fixes: 11f2069c ("Btrfs: send, allow clone operations within the same file")
      CC: stable@vger.kernel.org # 5.5+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      518837e6
    • David Sterba's avatar
      btrfs: no need to run delayed refs after commit_fs_roots during commit · 14ff8e19
      David Sterba authored
      The inode number cache has been removed in this dev cycle, there's one
      more leftover. We don't need to run the delayed refs again after
      commit_fs_roots as stated in the comment, because btrfs_save_ino_cache
      is no more since 5297199a ("btrfs: remove inode number cache
      feature").
      
      Nothing else between commit_fs_roots and btrfs_qgroup_account_extents
      could create new delayed refs so the qgroup consistency should be safe.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      14ff8e19
  2. 08 Jan, 2021 1 commit
    • Josef Bacik's avatar
      btrfs: shrink delalloc pages instead of full inodes · e076ab2a
      Josef Bacik authored
      Commit 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
      shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
      some infrastructure we have in place to flush inodes that we use for
      device replace and snapshot.  However this introduced a pretty serious
      performance regression.  To reproduce the user untarred the source
      tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
      see it take anywhere from 5 to 20 times as long to untar in 5.10
      compared to 5.9. This was observed on fast devices (SSD and better) and
      not on HDD.
      
      The root cause is because before we would generally use the normal
      writeback path to reclaim delalloc space, and for this we would provide
      it with the number of pages we wanted to flush.  The referenced commit
      changed this to flush that many inodes, which drastically increased the
      amount of space we were flushing in certain cases, which severely
      affected performance.
      
      We cannot revert this patch unfortunately because of 3d45f221
      ("btrfs: fix deadlock when cloning inline extent and low on free
      metadata space") which requires the ability to skip flushing inodes that
      are being cloned in certain scenarios, which means we need to keep using
      our flushing infrastructure or risk re-introducing the deadlock.
      
      Instead to fix this problem we can go back to providing
      btrfs_start_delalloc_roots with a number of pages to flush, and then set
      up a writeback_control and utilize sync_inode() to handle the flushing
      for us.  This gives us the same behavior we had prior to the fix, while
      still allowing us to avoid the deadlock that was fixed by Filipe.  I
      redid the users original test and got the following results on one of
      our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)
      
        5.9		0m54.258s
        5.10		1m26.212s
        5.10+patch	0m38.800s
      
      5.10+patch is significantly faster than plain 5.9 because of my patch
      series "Change data reservations to use the ticketing infra" which
      contained the patch that introduced the regression, but generally
      improved the overall ENOSPC flushing mechanisms.
      
      Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
      the results:
      
        5.10.5            4m00s
        5.10.5+patch      1m08s
        5.11-rc2	    5m14s
        5.11-rc2+patch    1m30s
      Reported-by: default avatarRené Rebe <rene@exactcode.de>
      Fixes: 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
      CC: stable@vger.kernel.org # 5.10
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add my test results ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e076ab2a
  3. 07 Jan, 2021 4 commits
    • Qu Wenruo's avatar
      btrfs: reloc: fix wrong file extent type check to avoid false ENOENT · 50e31ef4
      Qu Wenruo authored
      [BUG]
      There are several bug reports about recent kernel unable to relocate
      certain data block groups.
      
      Sometimes the error just goes away, but there is one reporter who can
      reproduce it reliably.
      
      The dmesg would look like:
      
        [438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
        [438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
        [450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
        [463.501781] BTRFS info (device dm-10): balance: ended with status: -2
      
      [CAUSE]
      The ENOENT error is returned from the following call chain:
      
        add_data_references()
        |- delete_v1_space_cache();
           |- if (!found)
      	   return -ENOENT;
      
      The variable @found is set to true if we find a data extent whose
      disk bytenr matches parameter @data_bytes.
      
      With extra debugging, the offending tree block looks like this:
      
        leaf bytenr = 42676709441536, data_bytenr = 34626327621632
      
                      ctime 1567904822.739884119 (2019-09-08 03:07:02)
                      mtime 0.0 (1970-01-01 01:00:00)
                      otime 0.0 (1970-01-01 01:00:00)
              item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
                      generation 1517381 type 2 (prealloc)
                      prealloc data disk byte 34626327621632 nr 262144 <<<
                      prealloc data offset 0 nr 262144
              item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
                      generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
                      lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
                      uuid d0d4361f-d231-6d40-8901-fe506e4b2b53
      
      Although item 27 has disk bytenr 34626327621632, which matches the
      data_bytenr, its type is prealloc, not reg.
      This makes the existing code skip that item, and return ENOENT.
      
      [FIX]
      The code is modified in commit 19b546d7 ("btrfs: relocation: Use
      btrfs_find_all_leafs to locate data extent parent tree leaves"), before
      that commit, we use something like
      
        "if (type == BTRFS_FILE_EXTENT_INLINE) continue;"
      
      But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
      ignoring BTRFS_FILE_EXTENT_PREALLOC.
      
      Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs2@lesimple.fr>
      Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
      Fixes: 19b546d7 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
      CC: stable@vger.kernel.org # 5.6+
      Tested-By: default avatarStéphane Lesimple <stephane_btrfs2@lesimple.fr>
      Reviewed-by: default avatarSu Yue <l@damenly.su>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50e31ef4
    • Su Yue's avatar
      btrfs: tree-checker: check if chunk item end overflows · 347fb0cf
      Su Yue authored
      While mounting a crafted image provided by user, kernel panics due to
      the invalid chunk item whose end is less than start.
      
        [66.387422] loop: module loaded
        [66.389773] loop0: detected capacity change from 262144 to 0
        [66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613)
        [66.431061] BTRFS info (device loop0): disk space caching is enabled
        [66.431078] BTRFS info (device loop0): has skinny extents
        [66.437101] BTRFS error: insert state: end < start 29360127 37748736
        [66.437136] ------------[ cut here ]------------
        [66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs]
        [66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G           O      5.11.0-rc1-custom #45
        [66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
        [66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs]
        [66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286
        [66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
        [66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
        [66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001
        [66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0
        [66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000
        [66.437447] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
        [66.437451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
        [66.437460] PKRU: 55555554
        [66.437464] Call Trace:
        [66.437475]  set_extent_bit+0x652/0x740 [btrfs]
        [66.437539]  set_extent_bits_nowait+0x1d/0x20 [btrfs]
        [66.437576]  add_extent_mapping+0x1e0/0x2f0 [btrfs]
        [66.437621]  read_one_chunk+0x33c/0x420 [btrfs]
        [66.437674]  btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
        [66.437708]  ? kvm_sched_clock_read+0x18/0x40
        [66.437739]  open_ctree+0xb32/0x1734 [btrfs]
        [66.437781]  ? bdi_register_va+0x1b/0x20
        [66.437788]  ? super_setup_bdi_name+0x79/0xd0
        [66.437810]  btrfs_mount_root.cold+0x12/0xeb [btrfs]
        [66.437854]  ? __kmalloc_track_caller+0x217/0x3b0
        [66.437873]  legacy_get_tree+0x34/0x60
        [66.437880]  vfs_get_tree+0x2d/0xc0
        [66.437888]  vfs_kern_mount.part.0+0x78/0xc0
        [66.437897]  vfs_kern_mount+0x13/0x20
        [66.437902]  btrfs_mount+0x11f/0x3c0 [btrfs]
        [66.437940]  ? kfree+0x5ff/0x670
        [66.437944]  ? __kmalloc_track_caller+0x217/0x3b0
        [66.437962]  legacy_get_tree+0x34/0x60
        [66.437974]  vfs_get_tree+0x2d/0xc0
        [66.437983]  path_mount+0x48c/0xd30
        [66.437998]  __x64_sys_mount+0x108/0x140
        [66.438011]  do_syscall_64+0x38/0x50
        [66.438018]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [66.438023] RIP: 0033:0x7f0138827f6e
        [66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        [66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
        [66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
        [66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
        [66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        [66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
        [66.438078] irq event stamp: 18169
        [66.438082] hardirqs last  enabled at (18175): [<ffffffffb81154bf>] console_unlock+0x4ff/0x5f0
        [66.438088] hardirqs last disabled at (18180): [<ffffffffb8115427>] console_unlock+0x467/0x5f0
        [66.438092] softirqs last  enabled at (16910): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20
        [66.438097] softirqs last disabled at (16905): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20
        [66.438103] ---[ end trace e114b111db64298b ]---
        [66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127
        [66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists)
        [66.441069] ------------[ cut here ]------------
        [66.441072] kernel BUG at fs/btrfs/extent_io.c:679!
        [66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        [66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G        W  O      5.11.0-rc1-custom #45
        [66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
        [66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
        [66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
        [66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
        [66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
        [66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
        [66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
        [66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
        [66.458356] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
        [66.459841] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
        [66.462196] PKRU: 55555554
        [66.462692] Call Trace:
        [66.463139]  set_extent_bit.cold+0x30/0x98 [btrfs]
        [66.464049]  set_extent_bits_nowait+0x1d/0x20 [btrfs]
        [66.490466]  add_extent_mapping+0x1e0/0x2f0 [btrfs]
        [66.514097]  read_one_chunk+0x33c/0x420 [btrfs]
        [66.534976]  btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
        [66.555718]  ? kvm_sched_clock_read+0x18/0x40
        [66.575758]  open_ctree+0xb32/0x1734 [btrfs]
        [66.595272]  ? bdi_register_va+0x1b/0x20
        [66.614638]  ? super_setup_bdi_name+0x79/0xd0
        [66.633809]  btrfs_mount_root.cold+0x12/0xeb [btrfs]
        [66.652938]  ? __kmalloc_track_caller+0x217/0x3b0
        [66.671925]  legacy_get_tree+0x34/0x60
        [66.690300]  vfs_get_tree+0x2d/0xc0
        [66.708221]  vfs_kern_mount.part.0+0x78/0xc0
        [66.725808]  vfs_kern_mount+0x13/0x20
        [66.742730]  btrfs_mount+0x11f/0x3c0 [btrfs]
        [66.759350]  ? kfree+0x5ff/0x670
        [66.775441]  ? __kmalloc_track_caller+0x217/0x3b0
        [66.791750]  legacy_get_tree+0x34/0x60
        [66.807494]  vfs_get_tree+0x2d/0xc0
        [66.823349]  path_mount+0x48c/0xd30
        [66.838753]  __x64_sys_mount+0x108/0x140
        [66.854412]  do_syscall_64+0x38/0x50
        [66.869673]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [66.885093] RIP: 0033:0x7f0138827f6e
        [66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        [66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
        [66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
        [67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
        [67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        [67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
        [67.216138] ---[ end trace e114b111db64298c ]---
        [67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
        [67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
        [67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
        [67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
        [67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
        [67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
        [67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
        [67.465436] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
        [67.511660] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
        [67.558449] PKRU: 55555554
        [67.581146] note: mount[613] exited with preempt_count 2
      
      The image has a chunk item which has a logical start 37748736 and length
      18446744073701163008 (-8M). The calculated end 29360127 overflows.
      EEXIST was caught by insert_state() because of the duplicate end and
      extent_io_tree_panic() was called.
      
      Add overflow check of chunk item end to tree checker so it can be
      detected early at mount time.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      347fb0cf
    • Su Yue's avatar
      btrfs: prevent NULL pointer dereference in extent_io_tree_panic · 29b665cc
      Su Yue authored
      Some extent io trees are initialized with NULL private member (e.g.
      btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
      Dereference of a NULL tree->private as inode pointer will cause panic.
      
      Pass tree->fs_info as it's known to be valid in all cases.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
      Fixes: 05912a3c ("btrfs: drop extent_io_ops::tree_fs_info callback")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      29b665cc
    • Josef Bacik's avatar
      btrfs: print the actual offset in btrfs_root_name · 71008734
      Josef Bacik authored
      We're supposed to print the root_key.offset in btrfs_root_name in the
      case of a reloc root, not the objectid.  Fix this helper to take the key
      so we have access to the offset when we need it.
      
      Fixes: 457f1864 ("btrfs: pretty print leaked root name")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71008734
  4. 18 Dec, 2020 13 commits
    • Filipe Manana's avatar
      btrfs: run delayed iputs when remounting RO to avoid leaking them · a8cc263e
      Filipe Manana authored
      When remounting RO, after setting the superblock with the RO flag, the
      cleaner task will start sleeping and do nothing, since the call to
      btrfs_need_cleaner_sleep() keeps returning 'true'. However, when the
      cleaner task goes to sleep, the list of delayed iputs may not be empty.
      
      As long as we are in RO mode, the cleaner task will keep sleeping and
      never run the delayed iputs. This means that if a filesystem unmount
      is started, we get into close_ctree() with a non-empty list of delayed
      iputs, and because the filesystem is in RO mode and is not in an error
      state (or a transaction aborted), btrfs_error_commit_super() and
      btrfs_commit_super(), which run the delayed iputs, are never called,
      and later we fail the assertion that checks if the delayed iputs list
      is empty:
      
        assertion failed: list_empty(&fs_info->delayed_iputs), in fs/btrfs/disk-io.c:4049
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3153!
        invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 1 PID: 3780621 Comm: umount Tainted: G             L    5.6.0-rc2-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
        Code: 8b 7b 58 48 85 ff 74 (...)
        RSP: 0018:ffffb748c89bbdf8 EFLAGS: 00010246
        RAX: 0000000000000051 RBX: ffff9608f2584000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffffffff91998988 RDI: 00000000ffffffff
        RBP: ffff9608f25870d8 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc0cbc500
        R13: ffffffff92411750 R14: 0000000000000000 R15: ffff9608f2aab250
        FS:  00007fcbfaa66c80(0000) GS:ffff960936c80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fffc2c2dd38 CR3: 0000000235e54002 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         close_ctree+0x1a2/0x2e6 [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x93/0xc0
         exit_to_usermode_loop+0xf9/0x100
         do_syscall_64+0x20d/0x260
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7fcbfaca6307
        Code: eb 0b 00 f7 d8 64 89 (...)
        RSP: 002b:00007fffc2c2ed68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000558203b559b0 RCX: 00007fcbfaca6307
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000558203b55bc0
        RBP: 0000000000000000 R08: 0000000000000001 R09: 00007fffc2c2dad0
        R10: 0000558203b55bf0 R11: 0000000000000246 R12: 0000558203b55bc0
        R13: 00007fcbfadcc204 R14: 0000558203b55aa8 R15: 0000000000000000
        Modules linked in: btrfs dm_flakey dm_log_writes (...)
        ---[ end trace d44d303790049ef6 ]---
      
      So fix this by making the remount RO path run any remaining delayed iputs
      after waiting for the cleaner to become inactive.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8cc263e
    • Filipe Manana's avatar
      btrfs: add assertion for empty list of transactions at late stage of umount · 0a31daa4
      Filipe Manana authored
      Add an assertion to close_ctree(), after destroying all the work queues,
      to verify we do not have any transaction still open or committing at that
      at that point. If we have any, it means something is seriously wrong and
      that can cause memory leaks and use-after-free problems. This is motivated
      by the previous patches that fixed bugs where we ended up leaking an open
      transaction after unmounting the filesystem.
      Tested-by: default avatarFabian Vogt <fvogt@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a31daa4
    • Filipe Manana's avatar
      btrfs: fix race between RO remount and the cleaner task · a0a1db70
      Filipe Manana authored
      When we are remounting a filesystem in RO mode we can race with the cleaner
      task and result in leaking a transaction if the filesystem is unmounted
      shortly after, before the transaction kthread had a chance to commit that
      transaction. That also results in a crash during unmount, due to a
      use-after-free, if hardware acceleration is not available for crc32c.
      
      The following sequence of steps explains how the race happens.
      
      1) The filesystem is mounted in RW mode and the cleaner task is running.
         This means that currently BTRFS_FS_CLEANER_RUNNING is set at
         fs_info->flags;
      
      2) The cleaner task is currently running delayed iputs for example;
      
      3) A filesystem RO remount operation starts;
      
      4) The RO remount task calls btrfs_commit_super(), which commits any
         currently open transaction, and it finishes;
      
      5) At this point the cleaner task is still running and it creates a new
         transaction by doing one of the following things:
      
         * When running the delayed iput() for an inode with a 0 link count,
           in which case at btrfs_evict_inode() we start a transaction through
           the call to evict_refill_and_join(), use it and then release its
           handle through btrfs_end_transaction();
      
         * When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
           a transaction is started at btrfs_drop_snapshot() and then its handle
           is released through a call to btrfs_end_transaction_throttle();
      
         * When the remount task was still running, and before the remount task
           called btrfs_delete_unused_bgs(), the cleaner task also called
           btrfs_delete_unused_bgs() and it picked and removed one block group
           from the list of unused block groups. Before the cleaner task started
           a transaction, through btrfs_start_trans_remove_block_group() at
           btrfs_delete_unused_bgs(), the remount task had already called
           btrfs_commit_super();
      
      6) So at this point the filesystem is in RO mode and we have an open
         transaction that was started by the cleaner task;
      
      7) Shortly after a filesystem unmount operation starts. At close_ctree()
         we stop the transaction kthread before it had a chance to commit the
         transaction, since less than 30 seconds (the default commit interval)
         have elapsed since the last transaction was committed;
      
      8) We end up calling iput() against the btree inode at close_ctree() while
         there is an open transaction, and since that transaction was used to
         update btrees by the cleaner, we have dirty pages in the btree inode
         due to COW operations on metadata extents, and therefore writeback is
         triggered for the btree inode.
      
         So btree_write_cache_pages() is invoked to flush those dirty pages
         during the final iput() on the btree inode. This results in creating a
         bio and submitting it, which makes us end up at
         btrfs_submit_metadata_bio();
      
      9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
         that calls btrfs_wq_submit_bio(), because check_async_write() returned
         a value of 1. This value of 1 is because we did not have hardware
         acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
         set in fs_info->flags;
      
      10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
          workqueue at fs_info->workers, which was already freed before by the
          call to btrfs_stop_all_workers() at close_ctree(). This results in an
          invalid memory access due to a use-after-free, leading to a crash.
      
      When this happens, before the crash there are several warnings triggered,
      since we have reserved metadata space in a block group, the delayed refs
      reservation, etc:
      
        ------------[ cut here ]------------
        WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
        Code: f0 01 00 00 48 39 c2 75 (...)
        RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
        RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
        RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
        RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
        R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 01 48 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c6 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
        Code: 48 83 bb b0 03 00 00 00 (...)
        RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
        RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
        RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
        R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 01 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c7 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
        Code: ad de 49 be 22 01 00 (...)
        RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
        RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
        RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
        R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c8 ]---
        BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
        BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
        BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
      
      And the crash, which only happens when we do not have crc32c hardware
      acceleration, produces the following trace immediately after those
      warnings:
      
        stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
        Code: 54 55 53 48 89 f3 (...)
        RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
        RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
        RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
        R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
        FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
         btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
         submit_one_bio+0x61/0x70 [btrfs]
         btree_write_cache_pages+0x414/0x450 [btrfs]
         ? kobject_put+0x9a/0x1d0
         ? trace_hardirqs_on+0x1b/0xf0
         ? _raw_spin_unlock_irqrestore+0x3c/0x60
         ? free_debug_processing+0x1e1/0x2b0
         do_writepages+0x43/0xe0
         ? lock_acquired+0x199/0x490
         __writeback_single_inode+0x59/0x650
         writeback_single_inode+0xaf/0x120
         write_inode_now+0x94/0xd0
         iput+0x187/0x2b0
         close_ctree+0x2c6/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f3cfebabee7
        Code: ff 0b 00 f7 d8 64 89 01 (...)
        RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
        RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
        R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        ---[ end trace dd74718fef1ed5cc ]---
      
      Finally when we remove the btrfs module (rmmod btrfs), there are several
      warnings about objects that were allocated from our slabs but were never
      freed, consequence of the transaction that was never committed and got
      leaked:
      
        =============================================================================
        BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? lock_release+0x20e/0x4c0
         kmem_cache_destroy+0x55/0x120
         btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x0000000050cbdd61 @offset=12104
        INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
              btrfs_free_tree_block+0x128/0x360 [btrfs]
              __btrfs_cow_block+0x489/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              sync_filesystem+0x74/0x90
              generic_shutdown_super+0x22/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        INFO: Object 0x0000000086e9b0ff @offset=12776
        INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
              btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
        INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
              commit_cowonly_roots+0x248/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        =============================================================================
        BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
        CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? lock_release+0x20e/0x4c0
         kmem_cache_destroy+0x55/0x120
         btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x000000001a340018 @offset=4408
        INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
              btrfs_free_tree_block+0x128/0x360 [btrfs]
              __btrfs_cow_block+0x489/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              btrfs_commit_transaction+0x60/0xc40 [btrfs]
              create_subvol+0x56a/0x990 [btrfs]
              btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
              __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
              btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
              btrfs_ioctl+0x1a92/0x36f0 [btrfs]
              __x64_sys_ioctl+0x83/0xb0
              do_syscall_64+0x33/0x80
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        INFO: Object 0x000000002b46292a @offset=13648
        INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
              btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
        INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        =============================================================================
        BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
        INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? __mutex_unlock_slowpath+0x45/0x2a0
         kmem_cache_destroy+0x55/0x120
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x000000004cf95ea8 @offset=6264
        INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
        CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
      
      So fix this by making the remount path to wait for the cleaner task before
      calling btrfs_commit_super(). The remount path now waits for the bit
      BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
      btrfs_commit_super() and this ensures the cleaner can not start a
      transaction after that, because it sleeps when the filesystem is in RO
      mode and we have already flagged the filesystem as RO before waiting for
      BTRFS_FS_CLEANER_RUNNING to be cleared.
      
      This also introduces a new flag BTRFS_FS_STATE_RO to be used for
      fs_info->fs_state when the filesystem is in RO mode. This is because we
      were doing the RO check using the flags of the superblock and setting the
      RO mode simply by ORing into the superblock's flags - those operations are
      not atomic and could result in the cleaner not seeing the update from the
      remount task after it clears BTRFS_FS_CLEANER_RUNNING.
      Tested-by: default avatarFabian Vogt <fvogt@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a0a1db70
    • Filipe Manana's avatar
      btrfs: fix transaction leak and crash after cleaning up orphans on RO mount · 638331fa
      Filipe Manana authored
      When we delete a root (subvolume or snapshot), at the very end of the
      operation, we attempt to remove the root's orphan item from the root tree,
      at btrfs_drop_snapshot(), by calling btrfs_del_orphan_item(). We ignore any
      error from btrfs_del_orphan_item() since it is not a serious problem and
      the next time the filesystem is mounted we remove such stray orphan items
      at btrfs_find_orphan_roots().
      
      However if the filesystem is mounted RO and we have stray orphan items for
      any previously deleted root, we can end up leaking a transaction and other
      data structures when unmounting the filesystem, as well as crashing if we
      do not have hardware acceleration for crc32c available.
      
      The steps that lead to the transaction leak are the following:
      
      1) The filesystem is mounted in RW mode;
      
      2) A subvolume is deleted;
      
      3) When the cleaner kthread runs btrfs_drop_snapshot() to delete the root,
         it gets a failure at btrfs_del_orphan_item(), which is ignored, due to
         an ENOMEM when allocating a path for example. So the orphan item for
         the root remains in the root tree;
      
      4) The filesystem is unmounted;
      
      5) The filesystem is mounted RO (-o ro). During the mount path we call
         btrfs_find_orphan_roots(), which iterates the root tree searching for
         orphan items. It finds the orphan item for our deleted root, and since
         it can not find the root, it starts a transaction to delete the orphan
         item (by calling btrfs_del_orphan_item());
      
      6) The RO mount completes;
      
      7) Before the transaction kthread commits the transaction created for
         deleting the orphan item (i.e. less than 30 seconds elapsed since the
         mount, the default commit interval), a filesystem unmount operation is
         started;
      
      8) At close_ctree(), we stop the transaction kthread, but we still have a
         transaction open with at least one dirty extent buffer, a leaf for the
         tree root which was COWed when deleting the orphan item;
      
      9) We then proceed to destroy the work queues, free the roots and block
         groups, etc. After that we drop the last reference on the btree inode by
         calling iput() on it. Since there are dirty pages for the btree inode,
         corresponding to the COWed extent buffer, btree_write_cache_pages() is
         invoked to flush those dirty pages. This results in creating a bio and
         submitting it, which makes us end up at btrfs_submit_metadata_bio();
      
      10) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
          that calls btrfs_wq_submit_bio(), because check_async_write() returned
          a value of 1. This value of 1 is because we did not have hardware
          acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
          set in fs_info->flags;
      
      11) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
          workqueue at fs_info->workers, which was already freed before by the
          call to btrfs_stop_all_workers() at close_ctree(). This results in an
          invalid memory access due to a use-after-free, leading to a crash.
      
      When this happens, before the crash there are several warnings triggered,
      since we have reserved metadata space in a block group, the delayed refs
      reservation, etc:
      
       ------------[ cut here ]------------
       WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
       Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
       CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
       Code: f0 01 00 00 48 39 c2 75 (...)
       RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
       RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
       RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
       RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
       R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
       FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
        close_ctree+0x2ba/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f15ee221ee7
       Code: ff 0b 00 f7 d8 64 89 01 48 (...)
       RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
       RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
       RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
       R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
       R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace dd74718fef1ed5c6 ]---
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
       Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
       CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
       Code: 48 83 bb b0 03 00 00 00 (...)
       RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
       RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
       RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
       R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
       FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
        close_ctree+0x2ba/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f15ee221ee7
       Code: ff 0b 00 f7 d8 64 89 01 (...)
       RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
       RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
       RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
       R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
       R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace dd74718fef1ed5c7 ]---
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
       Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
       CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
       Code: ad de 49 be 22 01 00 (...)
       RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
       RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
       RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
       R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
       FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        close_ctree+0x2ba/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f15ee221ee7
       Code: ff 0b 00 f7 d8 64 89 (...)
       RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
       RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
       RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
       R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
       R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace dd74718fef1ed5c8 ]---
       BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
       BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
       BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
       BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
       BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
       BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
       BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
      
      And the crash, which only happens when we do not have crc32c hardware
      acceleration, produces the following trace immediately after those
      warnings:
      
       stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
       Code: 54 55 53 48 89 f3 (...)
       RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
       RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
       RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
       R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
       FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
        btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
        submit_one_bio+0x61/0x70 [btrfs]
        btree_write_cache_pages+0x414/0x450 [btrfs]
        ? kobject_put+0x9a/0x1d0
        ? trace_hardirqs_on+0x1b/0xf0
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        ? free_debug_processing+0x1e1/0x2b0
        do_writepages+0x43/0xe0
        ? lock_acquired+0x199/0x490
        __writeback_single_inode+0x59/0x650
        writeback_single_inode+0xaf/0x120
        write_inode_now+0x94/0xd0
        iput+0x187/0x2b0
        close_ctree+0x2c6/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f3cfebabee7
       Code: ff 0b 00 f7 d8 64 89 01 (...)
       RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
       RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
       RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
       R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
       R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
       Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
       ---[ end trace dd74718fef1ed5cc ]---
      
      Finally when we remove the btrfs module (rmmod btrfs), there are several
      warnings about objects that were allocated from our slabs but were never
      freed, consequence of the transaction that was never committed and got
      leaked:
       =============================================================================
       BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
       -----------------------------------------------------------------------------
      
       INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
       CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        slab_err+0xb7/0xdc
        ? lock_acquired+0x199/0x490
        __kmem_cache_shutdown+0x1ac/0x3c0
        ? lock_release+0x20e/0x4c0
        kmem_cache_destroy+0x55/0x120
        btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       INFO: Object 0x0000000050cbdd61 @offset=12104
       INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
              btrfs_free_tree_block+0x128/0x360 [btrfs]
              __btrfs_cow_block+0x489/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
       INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              sync_filesystem+0x74/0x90
              generic_shutdown_super+0x22/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       INFO: Object 0x0000000086e9b0ff @offset=12776
       INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
              btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
       INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
              commit_cowonly_roots+0x248/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
       CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        kmem_cache_destroy+0x119/0x120
        btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       =============================================================================
       BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
       -----------------------------------------------------------------------------
      
       INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
       CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        slab_err+0xb7/0xdc
        ? lock_acquired+0x199/0x490
        __kmem_cache_shutdown+0x1ac/0x3c0
        ? lock_release+0x20e/0x4c0
        kmem_cache_destroy+0x55/0x120
        btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       INFO: Object 0x000000001a340018 @offset=4408
       INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
              btrfs_free_tree_block+0x128/0x360 [btrfs]
              __btrfs_cow_block+0x489/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
       INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              btrfs_commit_transaction+0x60/0xc40 [btrfs]
              create_subvol+0x56a/0x990 [btrfs]
              btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
              __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
              btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
              btrfs_ioctl+0x1a92/0x36f0 [btrfs]
              __x64_sys_ioctl+0x83/0xb0
              do_syscall_64+0x33/0x80
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       INFO: Object 0x000000002b46292a @offset=13648
       INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
              btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
       INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
       CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        kmem_cache_destroy+0x119/0x120
        btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       =============================================================================
       BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
       -----------------------------------------------------------------------------
      
       INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
       CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        slab_err+0xb7/0xdc
        ? lock_acquired+0x199/0x490
        __kmem_cache_shutdown+0x1ac/0x3c0
        ? __mutex_unlock_slowpath+0x45/0x2a0
        kmem_cache_destroy+0x55/0x120
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       INFO: Object 0x000000004cf95ea8 @offset=6264
       INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
       INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
       CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        kmem_cache_destroy+0x119/0x120
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
      
      So fix this by calling btrfs_find_orphan_roots() in the mount path only if
      we are mounting the filesystem in RW mode. It's pointless to have it called
      for RO mounts anyway, since despite adding any deleted roots to the list of
      dead roots, we will never have the roots deleted until the filesystem is
      remounted in RW mode, as the cleaner kthread does nothing when we are
      mounted in RO - btrfs_need_cleaner_sleep() always returns true and the
      cleaner spends all time sleeping, never cleaning dead roots.
      
      This is accomplished by moving the call to btrfs_find_orphan_roots() from
      open_ctree() to btrfs_start_pre_rw_mount(), which also guarantees that
      if later the filesystem is remounted RW, we populate the list of dead
      roots and have the cleaner task delete the dead roots.
      Tested-by: default avatarFabian Vogt <fvogt@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      638331fa
    • Filipe Manana's avatar
      btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan · cb13eea3
      Filipe Manana authored
      If we remount a filesystem in RO mode while the qgroup rescan worker is
      running, we can end up having it still running after the remount is done,
      and at unmount time we may end up with an open transaction that ends up
      never getting committed. If that happens we end up with several memory
      leaks and can crash when hardware acceleration is unavailable for crc32c.
      Possibly it can lead to other nasty surprises too, due to use-after-free
      issues.
      
      The following steps explain how the problem happens.
      
      1) We have a filesystem mounted in RW mode and the qgroup rescan worker is
         running;
      
      2) We remount the filesystem in RO mode, and never stop/pause the rescan
         worker, so after the remount the rescan worker is still running. The
         important detail here is that the rescan task is still running after
         the remount operation committed any ongoing transaction through its
         call to btrfs_commit_super();
      
      3) The rescan is still running, and after the remount completed, the
         rescan worker started a transaction, after it finished iterating all
         leaves of the extent tree, to update the qgroup status item in the
         quotas tree. It does not commit the transaction, it only releases its
         handle on the transaction;
      
      4) A filesystem unmount operation starts shortly after;
      
      5) The unmount task, at close_ctree(), stops the transaction kthread,
         which had not had a chance to commit the open transaction since it was
         sleeping and the commit interval (default of 30 seconds) has not yet
         elapsed since the last time it committed a transaction;
      
      6) So after stopping the transaction kthread we still have the transaction
         used to update the qgroup status item open. At close_ctree(), when the
         filesystem is in RO mode and no transaction abort happened (or the
         filesystem is in error mode), we do not expect to have any transaction
         open, so we do not call btrfs_commit_super();
      
      7) We then proceed to destroy the work queues, free the roots and block
         groups, etc. After that we drop the last reference on the btree inode
         by calling iput() on it. Since there are dirty pages for the btree
         inode, corresponding to the COWed extent buffer for the quotas btree,
         btree_write_cache_pages() is invoked to flush those dirty pages. This
         results in creating a bio and submitting it, which makes us end up at
         btrfs_submit_metadata_bio();
      
      8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
         that calls btrfs_wq_submit_bio(), because check_async_write() returned
         a value of 1. This value of 1 is because we did not have hardware
         acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
         set in fs_info->flags;
      
      9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
         workqueue at fs_info->workers, which was already freed before by the
         call to btrfs_stop_all_workers() at close_ctree(). This results in an
         invalid memory access due to a use-after-free, leading to a crash.
      
      When this happens, before the crash there are several warnings triggered,
      since we have reserved metadata space in a block group, the delayed refs
      reservation, etc:
      
        ------------[ cut here ]------------
        WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
        Code: f0 01 00 00 48 39 c2 75 (...)
        RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
        RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
        RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
        RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
        R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 01 48 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c6 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
        Code: 48 83 bb b0 03 00 00 00 (...)
        RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
        RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
        RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
        R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 01 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c7 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
        Code: ad de 49 be 22 01 00 (...)
        RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
        RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
        RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
        R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c8 ]---
        BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
        BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
        BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
      
      And the crash, which only happens when we do not have crc32c hardware
      acceleration, produces the following trace immediately after those
      warnings:
      
        stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
        Code: 54 55 53 48 89 f3 (...)
        RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
        RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
        RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
        R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
        FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
         btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
         submit_one_bio+0x61/0x70 [btrfs]
         btree_write_cache_pages+0x414/0x450 [btrfs]
         ? kobject_put+0x9a/0x1d0
         ? trace_hardirqs_on+0x1b/0xf0
         ? _raw_spin_unlock_irqrestore+0x3c/0x60
         ? free_debug_processing+0x1e1/0x2b0
         do_writepages+0x43/0xe0
         ? lock_acquired+0x199/0x490
         __writeback_single_inode+0x59/0x650
         writeback_single_inode+0xaf/0x120
         write_inode_now+0x94/0xd0
         iput+0x187/0x2b0
         close_ctree+0x2c6/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f3cfebabee7
        Code: ff 0b 00 f7 d8 64 89 01 (...)
        RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
        RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
        R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        ---[ end trace dd74718fef1ed5cc ]---
      
      Finally when we remove the btrfs module (rmmod btrfs), there are several
      warnings about objects that were allocated from our slabs but were never
      freed, consequence of the transaction that was never committed and got
      leaked:
      
        =============================================================================
        BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? lock_release+0x20e/0x4c0
         kmem_cache_destroy+0x55/0x120
         btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x0000000050cbdd61 @offset=12104
        INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
      	btrfs_free_tree_block+0x128/0x360 [btrfs]
      	__btrfs_cow_block+0x489/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
      	btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	commit_cowonly_roots+0xfb/0x300 [btrfs]
      	btrfs_commit_transaction+0x367/0xc40 [btrfs]
      	sync_filesystem+0x74/0x90
      	generic_shutdown_super+0x22/0x100
      	kill_anon_super+0x14/0x30
      	btrfs_kill_super+0x12/0x20 [btrfs]
      	deactivate_locked_super+0x31/0x70
      	cleanup_mnt+0x100/0x160
      	task_work_run+0x68/0xb0
      	exit_to_user_mode_prepare+0x1bb/0x1c0
      	syscall_exit_to_user_mode+0x4b/0x260
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        INFO: Object 0x0000000086e9b0ff @offset=12776
        INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
      	btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
      	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
      	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
        INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
      	commit_cowonly_roots+0x248/0x300 [btrfs]
      	btrfs_commit_transaction+0x367/0xc40 [btrfs]
      	close_ctree+0x113/0x2fa [btrfs]
      	generic_shutdown_super+0x6c/0x100
      	kill_anon_super+0x14/0x30
      	btrfs_kill_super+0x12/0x20 [btrfs]
      	deactivate_locked_super+0x31/0x70
      	cleanup_mnt+0x100/0x160
      	task_work_run+0x68/0xb0
      	exit_to_user_mode_prepare+0x1bb/0x1c0
      	syscall_exit_to_user_mode+0x4b/0x260
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        =============================================================================
        BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
        CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? lock_release+0x20e/0x4c0
         kmem_cache_destroy+0x55/0x120
         btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x000000001a340018 @offset=4408
        INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
      	btrfs_free_tree_block+0x128/0x360 [btrfs]
      	__btrfs_cow_block+0x489/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
      	btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	btrfs_commit_transaction+0x60/0xc40 [btrfs]
      	create_subvol+0x56a/0x990 [btrfs]
      	btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
      	__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
      	btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
      	btrfs_ioctl+0x1a92/0x36f0 [btrfs]
      	__x64_sys_ioctl+0x83/0xb0
      	do_syscall_64+0x33/0x80
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        INFO: Object 0x000000002b46292a @offset=13648
        INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
      	btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
      	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
      	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
        INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	commit_cowonly_roots+0xfb/0x300 [btrfs]
      	btrfs_commit_transaction+0x367/0xc40 [btrfs]
      	close_ctree+0x113/0x2fa [btrfs]
      	generic_shutdown_super+0x6c/0x100
      	kill_anon_super+0x14/0x30
      	btrfs_kill_super+0x12/0x20 [btrfs]
      	deactivate_locked_super+0x31/0x70
      	cleanup_mnt+0x100/0x160
      	task_work_run+0x68/0xb0
      	exit_to_user_mode_prepare+0x1bb/0x1c0
      	syscall_exit_to_user_mode+0x4b/0x260
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        =============================================================================
        BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? __mutex_unlock_slowpath+0x45/0x2a0
         kmem_cache_destroy+0x55/0x120
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x000000004cf95ea8 @offset=6264
        INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
      	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
      	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
      	btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	commit_cowonly_roots+0xfb/0x300 [btrfs]
      	btrfs_commit_transaction+0x367/0xc40 [btrfs]
      	close_ctree+0x113/0x2fa [btrfs]
      	generic_shutdown_super+0x6c/0x100
      	kill_anon_super+0x14/0x30
      	btrfs_kill_super+0x12/0x20 [btrfs]
      	deactivate_locked_super+0x31/0x70
      	cleanup_mnt+0x100/0x160
      	task_work_run+0x68/0xb0
      	exit_to_user_mode_prepare+0x1bb/0x1c0
      	syscall_exit_to_user_mode+0x4b/0x260
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
        CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
      
      Fix this issue by having the remount path stop the qgroup rescan worker
      when we are remounting RO and teach the rescan worker to stop when a
      remount is in progress. If later a remount in RW mode happens, we are
      already resuming the qgroup rescan worker through the call to
      btrfs_qgroup_rescan_resume(), so we do not need to worry about that.
      Tested-by: default avatarFabian Vogt <fvogt@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb13eea3
    • Pavel Begunkov's avatar
      btrfs: merge critical sections of discard lock in workfn · 8fc05859
      Pavel Begunkov authored
      btrfs_discard_workfn() drops discard_ctl->lock just to take it again in
      a moment in btrfs_discard_schedule_work(). Avoid that and also reuse
      ktime.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8fc05859
    • Pavel Begunkov's avatar
      btrfs: fix racy access to discard_ctl data · 1ea2872f
      Pavel Begunkov authored
      Because only one discard worker may be running at any given point, it
      could have been safe to modify ->prev_discard, etc. without
      synchronization, if not for @override flag in
      btrfs_discard_schedule_work() and delayed_work_pending() returning false
      while workfn is running.
      
      That may lead to torn reads of u64 for some architectures, but that's
      not a big problem as only slightly affects the discard rate.
      Suggested-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ea2872f
    • Pavel Begunkov's avatar
      btrfs: fix async discard stall · ea9ed87c
      Pavel Begunkov authored
      Might happen that bg->discard_eligible_time was changed without
      rescheduling, so btrfs_discard_workfn() wakes up earlier than that new
      time, peek_discard_list() returns NULL, and all work halts and goes to
      sleep without further rescheduling even there are block groups to
      discard.
      
      It happens pretty often, but not so visible from the userspace because
      after some time it usually will be kicked off anyway by someone else
      calling btrfs_discard_reschedule_work().
      
      Fix it by continue rescheduling if block group discard lists are not
      empty.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea9ed87c
    • Josef Bacik's avatar
      btrfs: tests: initialize test inodes location · 675a4fc8
      Josef Bacik authored
      I noticed that sometimes the module failed to load because the self
      tests failed like this:
      
        BTRFS: selftest: fs/btrfs/tests/inode-tests.c:963 miscount, wanted 1, got 0
      
      This turned out to be because sometimes the btrfs ino would be the btree
      inode number, and thus we'd skip calling the set extent delalloc bit
      helper, and thus not adjust ->outstanding_extents.
      
      Fix this by making sure we initialize test inodes with a valid inode
      number so that we don't get random failures during self tests.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      675a4fc8
    • Filipe Manana's avatar
      btrfs: send: fix wrong file path when there is an inode with a pending rmdir · 0b3f407e
      Filipe Manana authored
      When doing an incremental send, if we have a new inode that happens to
      have the same number that an old directory inode had in the base snapshot
      and that old directory has a pending rmdir operation, we end up computing
      a wrong path for the new inode, causing the receiver to fail.
      
      Example reproducer:
      
        $ cat test-send-rmdir.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        mkdir $MNT/dir
        touch $MNT/dir/file1
        touch $MNT/dir/file2
        touch $MNT/dir/file3
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- dir/                           (ino 257)
        #         |----- file1                  (ino 258)
        #         |----- file2                  (ino 259)
        #         |----- file3                  (ino 260)
        #
      
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Now remove our directory and all its files.
        rm -fr $MNT/dir
      
        # Unmount the filesystem and mount it again. This is to ensure that
        # the next inode that is created ends up with the same inode number
        # that our directory "dir" had, 257, which is the first free "objectid"
        # available after mounting again the filesystem.
        umount $MNT
        mount $DEV $MNT
      
        # Now create a new file (it could be a directory as well).
        touch $MNT/newfile
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- newfile                        (ino 257)
        #
      
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to apply
        # both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        btrfs receive -f /tmp/snap1.send $MNT
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, the receive operation for the incremental stream
      fails:
      
        $ ./test-send-rmdir.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: chown o257-9-0 failed: No such file or directory
      
      So fix this by tracking directories that have a pending rmdir by inode
      number and generation number, instead of only inode number.
      
      A test case for fstests follows soon.
      Reported-by: default avatarMassimo B. <massimo.b@gmx.net>
      Tested-by: default avatarMassimo B. <massimo.b@gmx.net>
      Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b3f407e
    • Qu Wenruo's avatar
      btrfs: qgroup: don't try to wait flushing if we're already holding a transaction · ae5e070e
      Qu Wenruo authored
      There is a chance of racing for qgroup flushing which may lead to
      deadlock:
      
      	Thread A		|	Thread B
         (not holding trans handle)	|  (holding a trans handle)
      --------------------------------+--------------------------------
      __btrfs_qgroup_reserve_meta()   | __btrfs_qgroup_reserve_meta()
      |- try_flush_qgroup()		| |- try_flush_qgroup()
         |- QGROUP_FLUSHING bit set   |    |
         |				|    |- test_and_set_bit()
         |				|    |- wait_event()
         |- btrfs_join_transaction()	|
         |- btrfs_commit_transaction()|
      
      			!!! DEAD LOCK !!!
      
      Since thread A wants to commit transaction, but thread B is holding a
      transaction handle, blocking the commit.
      At the same time, thread B is waiting for thread A to finish its commit.
      
      This is just a hot fix, and would lead to more EDQUOT when we're near
      the qgroup limit.
      
      The proper fix would be to make all metadata/data reservations happen
      without holding a transaction handle.
      
      CC: stable@vger.kernel.org # 5.9+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae5e070e
    • ethanwu's avatar
      btrfs: correctly calculate item size used when item key collision happens · 9a664971
      ethanwu authored
      Item key collision is allowed for some item types, like dir item and
      inode refs, but the overall item size is limited by the nodesize.
      
      item size(ins_len) passed from btrfs_insert_empty_items to
      btrfs_search_slot already contains size of btrfs_item.
      
      When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
      The check incorrectly reports that split leaf is required, because
      it treats the space required by the newly inserted item as
      btrfs_item + item data. But in item key collision case, only item data
      is actually needed, the newly inserted item could merge into the existing
      one. No new btrfs_item will be inserted.
      
      And split_leaf return EOVERFLOW from following code:
      
        if (extend && data_size + btrfs_item_size_nr(l, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
            return -EOVERFLOW;
      
      In most cases, when callers receive EOVERFLOW, they either return
      this error or handle in different ways. For example, in normal dir item
      creation the userspace will get errno EOVERFLOW; in inode ref case
      INODE_EXTREF is used instead.
      
      However, this is not the case for rename. To avoid the unrecoverable
      situation in rename, btrfs_check_dir_item_collision is called in
      early phase of rename. In this function, when item key collision is
      detected leaf space is checked:
      
        data_size = sizeof(*di) + name_len;
        if (data_size + btrfs_item_size_nr(leaf, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
      
      the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
      refers to existing item size, the condition here correctly calculates
      the needed size for collision case rather than the wrong case above.
      
      The consequence of inconsistent condition check between
      btrfs_check_dir_item_collision and btrfs_search_slot when item key
      collision happens is that we might pass check here but fail
      later at btrfs_search_slot. Rename fails and volume is forced readonly
      
        [436149.586170] ------------[ cut here ]------------
        [436149.586173] BTRFS: Transaction aborted (error -75)
        [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G      D           4.18.0-rc5+ #1
        [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
        [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
        [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
        [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
        [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
        [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
        [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
        [436149.586260] FS:  00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
        [436149.586261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
        [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [436149.586296] Call Trace:
        [436149.586311]  vfs_rename+0x383/0x920
        [436149.586313]  ? vfs_rename+0x383/0x920
        [436149.586315]  do_renameat2+0x4ca/0x590
        [436149.586317]  __x64_sys_rename+0x20/0x30
        [436149.586324]  do_syscall_64+0x5a/0x120
        [436149.586330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [436149.586332] RIP: 0033:0x7fa3133b1d37
        [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
        [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
        [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
        [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
      
      Thanks to Hans van Kranenburg for information about crc32 hash collision
      tools, I was able to reproduce the dir item collision with following
      python script.
      https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
      it under a btrfs volume will trigger the abort transaction.  It simply
      creates files and rename them to forged names that leads to
      hash collision.
      
      There are two ways to fix this. One is to simply revert the patch
      878f2d2c ("Btrfs: fix max dir item size calculation") to make the
      condition consistent although that patch is correct about the size.
      
      The other way is to handle the leaf space check correctly when
      collision happens. I prefer the second one since it correct leaf
      space check in collision case. This fix will not account
      sizeof(struct btrfs_item) when the item already exists.
      There are two places where ins_len doesn't contain
      sizeof(struct btrfs_item), however.
      
        1. extent-tree.c: lookup_inline_extent_backref
        2. file-item.c: btrfs_csum_file_blocks
      
      to make the logic of btrfs_search_slot more clear, we add a flag
      search_for_extension in btrfs_path.
      
      This flag indicates that ins_len passed to btrfs_search_slot doesn't
      contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
      will use the actual size needed to calculate the required leaf space.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarethanwu <ethanwu@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a664971
    • Filipe Manana's avatar
      btrfs: fix deadlock when cloning inline extent and low on free metadata space · 3d45f221
      Filipe Manana authored
      When cloning an inline extent there are cases where we can not just copy
      the inline extent from the source range to the target range (e.g. when the
      target range starts at an offset greater than zero). In such cases we copy
      the inline extent's data into a page of the destination inode and then
      dirty that page. However, after that we will need to start a transaction
      for each processed extent and, if we are ever low on available metadata
      space, we may need to flush existing delalloc for all dirty inodes in an
      attempt to release metadata space - if that happens we may deadlock:
      
      * the async reclaim task queued a delalloc work to flush delalloc for
        the destination inode of the clone operation;
      
      * the task executing that delalloc work gets blocked waiting for the
        range with the dirty page to be unlocked, which is currently locked
        by the task doing the clone operation;
      
      * the async reclaim task blocks waiting for the delalloc work to complete;
      
      * the cloning task is waiting on the waitqueue of its reservation ticket
        while holding the range with the dirty page locked in the inode's
        io_tree;
      
      * if metadata space is not released by some other task (like delalloc for
        some other inode completing for example), the clone task waits forever
        and as a consequence the delalloc work and async reclaim tasks will hang
        forever as well. Releasing more space on the other hand may require
        starting a transaction, which will hang as well when trying to reserve
        metadata space, resulting in a deadlock between all these tasks.
      
      When this happens, traces like the following show up in dmesg/syslog:
      
        [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
        [87452.323644]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.324852] task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
        [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [87452.326136] Call Trace:
        [87452.326737]  __schedule+0x5d1/0xcf0
        [87452.327390]  schedule+0x45/0xe0
        [87452.328174]  lock_extent_bits+0x1e6/0x2d0 [btrfs]
        [87452.328894]  ? finish_wait+0x90/0x90
        [87452.329474]  btrfs_invalidatepage+0x32c/0x390 [btrfs]
        [87452.330133]  ? __mod_memcg_state+0x8e/0x160
        [87452.330738]  __extent_writepage+0x2d4/0x400 [btrfs]
        [87452.331405]  extent_write_cache_pages+0x2b2/0x500 [btrfs]
        [87452.332007]  ? lock_release+0x20e/0x4c0
        [87452.332557]  ? trace_hardirqs_on+0x1b/0xf0
        [87452.333127]  extent_writepages+0x43/0x90 [btrfs]
        [87452.333653]  ? lock_acquire+0x1a3/0x490
        [87452.334177]  do_writepages+0x43/0xe0
        [87452.334699]  ? __filemap_fdatawrite_range+0xa4/0x100
        [87452.335720]  __filemap_fdatawrite_range+0xc5/0x100
        [87452.336500]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [87452.337216]  btrfs_work_helper+0xf1/0x600 [btrfs]
        [87452.337838]  process_one_work+0x24e/0x5e0
        [87452.338437]  worker_thread+0x50/0x3b0
        [87452.339137]  ? process_one_work+0x5e0/0x5e0
        [87452.339884]  kthread+0x153/0x170
        [87452.340507]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.341153]  ret_from_fork+0x22/0x30
        [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
        [87452.342487]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.344049] task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
        [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        [87452.345655] Call Trace:
        [87452.346305]  __schedule+0x5d1/0xcf0
        [87452.346947]  ? kvm_clock_read+0x14/0x30
        [87452.347676]  ? wait_for_completion+0x81/0x110
        [87452.348389]  schedule+0x45/0xe0
        [87452.349077]  schedule_timeout+0x30c/0x580
        [87452.349718]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [87452.350340]  ? lock_acquire+0x1a3/0x490
        [87452.351006]  ? try_to_wake_up+0x7a/0xa20
        [87452.351541]  ? lock_release+0x20e/0x4c0
        [87452.352040]  ? lock_acquired+0x199/0x490
        [87452.352517]  ? wait_for_completion+0x81/0x110
        [87452.353000]  wait_for_completion+0xab/0x110
        [87452.353490]  start_delalloc_inodes+0x2af/0x390 [btrfs]
        [87452.353973]  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        [87452.354455]  flush_space+0x24f/0x660 [btrfs]
        [87452.355063]  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        [87452.355565]  process_one_work+0x24e/0x5e0
        [87452.356024]  worker_thread+0x20f/0x3b0
        [87452.356487]  ? process_one_work+0x5e0/0x5e0
        [87452.356973]  kthread+0x153/0x170
        [87452.357434]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.357880]  ret_from_fork+0x22/0x30
        (...)
        < stack traces of several tasks waiting for the locks of the inodes of the
          clone operation >
        (...)
        [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
        [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
        [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
        [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
        [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
        [92867.447361] task:fsstress        state:D stack:    0 pid:2508238 ppid:2508153 flags:0x00004000
        [92867.447920] Call Trace:
        [92867.448435]  __schedule+0x5d1/0xcf0
        [92867.448934]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [92867.449423]  schedule+0x45/0xe0
        [92867.449916]  __reserve_bytes+0x4a4/0xb10 [btrfs]
        [92867.450576]  ? finish_wait+0x90/0x90
        [92867.451202]  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        [92867.451815]  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        [92867.452412]  start_transaction+0x2d1/0x760 [btrfs]
        [92867.453216]  clone_copy_inline_extent+0x333/0x490 [btrfs]
        [92867.453848]  ? lock_release+0x20e/0x4c0
        [92867.454539]  ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
        [92867.455218]  btrfs_clone+0x569/0x7e0 [btrfs]
        [92867.455952]  btrfs_clone_files+0xf6/0x150 [btrfs]
        [92867.456588]  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
        [92867.457213]  do_clone_file_range+0xd4/0x1f0
        [92867.457828]  vfs_clone_file_range+0x4d/0x230
        [92867.458355]  ? lock_release+0x20e/0x4c0
        [92867.458890]  ioctl_file_clone+0x8f/0xc0
        [92867.459377]  do_vfs_ioctl+0x342/0x750
        [92867.459913]  __x64_sys_ioctl+0x62/0xb0
        [92867.460377]  do_syscall_64+0x33/0x80
        [92867.460842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        (...)
        < stack traces of more tasks blocked on metadata reservation like the clone
          task above, because the async reclaim task has deadlocked >
        (...)
      
      Another thing to notice is that the worker task that is deadlocked when
      trying to flush the destination inode of the clone operation is at
      btrfs_invalidatepage(). This is simply because the clone operation has a
      destination offset greater than the i_size and we only update the i_size
      of the destination file after cloning an extent (just like we do in the
      buffered write path).
      
      Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
      the flushing of delalloc for all inodes that have delalloc, add a runtime
      flag to an inode to signal it should not be flushed, and for inodes with
      that flag set, start_delalloc_inodes() will simply skip them. When the
      cloning code needs to dirty a page to copy an inline extent, set that flag
      on the inode and then clear it when the clone operation finishes.
      
      This could be sporadically triggered with test case generic/269 from
      fstests, which exercises many fsstress processes running in parallel with
      several dd processes filling up the entire filesystem.
      
      CC: stable@vger.kernel.org # 5.9+
      Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d45f221
  5. 09 Dec, 2020 20 commits
    • Qu Wenruo's avatar
      btrfs: scrub: allow scrub to work with subpage sectorsize · b42fe98c
      Qu Wenruo authored
      Since btrfs scrub is utilizing its own infrastructure to submit
      read/write, scrub is independent from all other routines.
      
      This brings one very neat feature, allow us to read 4K data into offset
      0 of a 64K page.  So is the writeback routine.
      
      This makes scrub on subpage sector size much easier to implement, and
      thanks to previous commits which just changed the implementation to
      always do scrub based on sector size, now scrub can handle subpage
      filesystem without any problem.
      
      This patch will just remove the restriction on
      (sectorsize != PAGE_SIZE), to make scrub finally work on subpage
      filesystems.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b42fe98c
    • Qu Wenruo's avatar
      btrfs: scrub: support subpage data scrub · b29dca44
      Qu Wenruo authored
      Btrfs scrub is more flexible than buffered data write path, as we can
      read an unaligned subpage data into page offset 0.
      
      This ability makes subpage support much easier, we just need to check
      each scrub_page::page_len and ensure we only calculate hash for [0,
      page_len) of a page.
      
      There is a small thing to notice: for subpage case, we still do sector
      by sector scrub.  This means we will submit a read bio for each sector
      to scrub, resulting in the same amount of read bios, just like on the 4K
      page systems.
      
      This behavior can be considered as a good thing, if we want everything
      to be the same as 4K page systems.  But this also means, we're wasting
      the possibility to submit larger bio using 64K page size.  This is
      another problem to consider in the future.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b29dca44
    • Qu Wenruo's avatar
      btrfs: scrub: support subpage tree block scrub · 53f3251d
      Qu Wenruo authored
      To support subpage tree block scrub, scrub_checksum_tree_block() only
      needs to learn 2 new tricks:
      
      - Follow sector size
        Now scrub_page only represents one sector, we need to follow it
        properly.
      
      - Run checksum on all sectors
        Since scrub_page only represents one sector, we need to run checksum
        on all sectors, not only (nodesize >> PAGE_SIZE).
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53f3251d
    • Qu Wenruo's avatar
      btrfs: scrub: always allocate one full page for one sector for RAID56 · d0a7a9c0
      Qu Wenruo authored
      For scrub_pages() and scrub_pages_for_parity(), we currently allocate
      one scrub_page structure for one page.
      
      This is fine if we only read/write one sector one time.  But for cases
      like scrubbing RAID56, we need to read/write the full stripe, which is
      in 64K size for now.
      
      For subpage size, we will submit the read in just one page, which is
      normally a good thing, but for RAID56 case, it only expects to see one
      sector, not the full stripe in its endio function.
      This could lead to wrong parity checksum for RAID56 on subpage.
      
      To make the existing code work well for subpage case, here we take a
      shortcut by always allocating a full page for one sector.
      
      This should provide the base to make RAID56 work for subpage case.
      
      The cost is pretty obvious now, for one RAID56 stripe now we always need
      16 pages. For support subpage situation (64K page size, 4K sector size),
      this means we need full one megabyte to scrub just one RAID56 stripe.
      
      And for data scrub, each 4K sector will also need one 64K page.
      
      This is mostly just a workaround, the proper fix for this is a much
      larger project, using scrub_block to replace scrub_page, and allow
      scrub_block to handle multi pages, csums, and csum_bitmap to avoid
      allocating one page for each sector.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d0a7a9c0
    • Qu Wenruo's avatar
      btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits · fa485d21
      Qu Wenruo authored
      Btrfs on-disk format chose to use u64 for almost everything, but there
      are a other restrictions that won't let us use more than u32 for things
      like extent length (the maximum length is 128MiB for non-hole extents),
      or stripe length (we have device number limit).
      
      This means if we don't have extra handling to convert u64 to u32, we
      will always have some questionable operations like
      "u32 = u64 >> sectorsize_bits" in the code.
      
      This patch will try to address the problem by reducing the width for the
      following members/parameters:
      
      - scrub_parity::stripe_len
      - @len of scrub_pages()
      - @extent_len of scrub_remap_extent()
      - @len of scrub_parity_mark_sectors_error()
      - @len of scrub_parity_mark_sectors_data()
      - @len of scrub_extent()
      - @len of scrub_pages_for_parity()
      - @len of scrub_extent_for_parity()
      
      For members extracted from on-disk structure, like map->stripe_len, they
      will be kept as is. Since that modification would require on-disk format
      change.
      
      There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
      sites, extra ASSERT() is added to be extra safe for debug builds.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa485d21
    • Qu Wenruo's avatar
      btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs · 6275193e
      Qu Wenruo authored
      Refactor btrfs_lookup_bio_sums() by:
      
      - Remove the @file_offset parameter
        There are two factors making the @file_offset parameter useless:
      
        * For csum lookup in csum tree, file offset makes no sense
          We only need disk_bytenr, which is unrelated to file_offset
      
        * page_offset (file offset) of each bvec is not contiguous.
          Pages can be added to the same bio as long as their on-disk bytenr
          is contiguous, meaning we could have pages at different file offsets
          in the same bio.
      
        Thus passing file_offset makes no sense any more.
        The only user of file_offset is for data reloc inode, we will use
        a new function, search_file_offset_in_bio(), to handle it.
      
      - Extract the csum tree lookup into search_csum_tree()
        The new function will handle the csum search in csum tree.
        The return value is the same as btrfs_find_ordered_sum(), returning
        the number of found sectors which have checksum.
      
      - Change how we do the main loop
        The only needed info from bio is:
        * the on-disk bytenr
        * the length
      
        After extracting the above info, we can do the search without bio
        at all, which makes the main loop much simpler:
      
      	for (cur_disk_bytenr = orig_disk_bytenr;
      	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
      	     cur_disk_bytenr += count * sectorsize) {
      
      		/* Lookup csum tree */
      		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
      					 search_len, csum_dst);
      		if (!count) {
      			/* Csum hole handling */
      		}
      	}
      
      - Use single variable as the source to calculate all other offsets
        Instead of all different type of variables, we use only one main
        variable, cur_disk_bytenr, which represents the current disk bytenr.
      
        All involved values can be calculated from that variable, and
        all those variable will only be visible in the inner loop.
      
      The above refactoring makes btrfs_lookup_bio_sums() way more robust than
      it used to be, especially related to the file offset lookup.  Now
      file_offset lookup is only related to data reloc inode, otherwise we
      don't need to bother file_offset at all.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6275193e
    • Qu Wenruo's avatar
      btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums · 9e46458a
      Qu Wenruo authored
      The function btrfs_lookup_bio_sums() is only called for read bios.
      While btrfs_find_ordered_sum() is to search ordered extent sums, which
      is only for write path.
      
      This means to read a page we either:
      
      - Submit read bio if it's not uptodate
        This means we only need to search csum tree for checksums.
      
      - The page is already uptodate
        It can be marked uptodate for previous read, or being marked dirty.
        As we always mark page uptodate for dirty page.
        In that case, we don't need to submit read bio at all, thus no need
        to search any checksums.
      
      Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums().
      And since btrfs_lookup_bio_sums() is the only caller for
      btrfs_find_ordered_sum(), also remove the implementation.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e46458a
    • Qu Wenruo's avatar
      btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors · 884b07d0
      Qu Wenruo authored
      To support sectorsize < PAGE_SIZE case, we need to take extra care of
      extent buffer accessors.
      
      Since sectorsize is smaller than PAGE_SIZE, one page can contain
      multiple tree blocks, we must use eb->start to determine the real offset
      to read/write for extent buffer accessors.
      
      This patch introduces two helpers to do this:
      
      - get_eb_page_index()
        This is to calculate the index to access extent_buffer::pages.
        It's just a simple wrapper around "start >> PAGE_SHIFT".
      
        For sectorsize == PAGE_SIZE case, nothing is changed.
        For sectorsize < PAGE_SIZE case, we always get index as 0, and
        the existing page shift also works.
      
      - get_eb_offset_in_page()
        This is to calculate the offset to access extent_buffer::pages.
        This needs to take extent_buffer::start into consideration.
      
        For sectorsize == PAGE_SIZE case, extent_buffer::start is always
        aligned to PAGE_SIZE, thus adding extent_buffer::start to
        offset_in_page() won't change the result.
        For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
        us the correct offset to access.
      
      This patch will touch the following parts to cover all extent buffer
      accessors:
      
      - BTRFS_SETGET_HEADER_FUNCS()
      - read_extent_buffer()
      - read_extent_buffer_to_user()
      - memcmp_extent_buffer()
      - write_extent_buffer_chunk_tree_uuid()
      - write_extent_buffer_fsid()
      - write_extent_buffer()
      - memzero_extent_buffer()
      - copy_extent_buffer_full()
      - copy_extent_buffer()
      - memcpy_extent_buffer()
      - memmove_extent_buffer()
      - btrfs_get_token_##bits()
      - btrfs_get_##bits()
      - btrfs_set_token_##bits()
      - btrfs_set_##bits()
      - generic_bin_search()
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      884b07d0
    • Qu Wenruo's avatar
      btrfs: update num_extent_pages to support subpage sized extent buffer · 4a3dc938
      Qu Wenruo authored
      For subpage sized extent buffer, we have ensured no extent buffer will
      cross page boundary, thus we would only need one page for any extent
      buffer.
      
      Update function num_extent_pages to handle such case.  Now
      num_extent_pages() returns 1 for subpage sized extent buffer.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a3dc938
    • Qu Wenruo's avatar
      btrfs: don't allow tree block to cross page boundary for subpage support · 1aaac38c
      Qu Wenruo authored
      As a preparation for subpage sector size support (allowing filesystem
      with sector size smaller than page size to be mounted) if the sector
      size is smaller than page size, we don't allow tree block to be read if
      it crosses 64K(*) boundary.
      
      The 64K is selected because:
      
      - we are only going to support 64K page size for subpage for now
      - 64K is also the maximum supported node size
      
      This ensures that tree blocks are always contained in one page for a
      system with 64K page size, which can greatly simplify the handling.
      
      Otherwise we would have to do complex multi-page handling of tree
      blocks.  Currently there is no way to create such tree blocks.
      
      In kernel we have avoided such tree blocks allocation even on 4K page
      size, as it can lead to RAID56 stripe scrubbing.
      
      While btrfs-progs have fixed its chunk allocator since 2016 for convert,
      and has extra checks to do the same behavior as the kernel.
      
      Just add such graceful checks in case of an ancient filesystem.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1aaac38c
    • Qu Wenruo's avatar
      btrfs: calculate inline extent buffer page size based on page size · deb67895
      Qu Wenruo authored
      Btrfs only support 64K as maximum node size, thus for 4K page system, we
      would have at most 16 pages for one extent buffer.
      
      For a system using 64K page size, we would really have just one page.
      
      While we always use 16 pages for extent_buffer::pages, this means for
      systems using 64K pages, we are wasting memory for 15 page pointers
      which will never be used.
      
      Calculate the array size based on page size and the node size maximum.
      
      - for systems using 4K page size, it will stay 16 pages
      - for systems using 64K page size, it will be 1 page
      
      Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
      avoid circular inclusion of ctree.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      deb67895
    • Qu Wenruo's avatar
      btrfs: factor out btree page submission code to a helper · f91e0d0c
      Qu Wenruo authored
      In btree_write_cache_pages() we have a btree page submission routine
      buried deeply in a nested loop.
      
      This patch will extract that part of code into a helper function,
      submit_eb_page(), to do the same work.
      
      Since submit_eb_page() now can return >0 for successful extent
      buffer submission, remove the "ASSERT(ret <= 0);" line.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f91e0d0c
    • Qu Wenruo's avatar
      btrfs: make btrfs_verify_data_csum follow sector size · f44cf410
      Qu Wenruo authored
      Currently btrfs_verify_data_csum() just passes the whole page to
      check_data_csum(), which is fine since we only support sectorsize ==
      PAGE_SIZE.
      
      To support subpage, we need to properly honor per-sector
      checksum verification, just like what we did in dio read path.
      
      This patch will do the csum verification in a for loop, starts with
      pg_off == start - page_offset(page), with sectorsize increase for
      each loop.
      
      For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
      will only loop once.
      
      For subpage case, we do the iterate over each sector and if we found any
      error, we return error.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f44cf410
    • Qu Wenruo's avatar
      btrfs: pass bio_offset to check_data_csum() directly · 7ffd27e3
      Qu Wenruo authored
      Parameter icsum for check_data_csum() is a little hard to understand.
      So is the phy_offset for btrfs_verify_data_csum().
      
      Both parameters are calculated values for csum lookup.
      
      Instead of some calculated value, just pass bio_offset and let the
      final and only user, check_data_csum(), calculate whatever it needs.
      
      Since we are here, also make the bio_offset parameter and some related
      variables to be u32 (unsigned int).
      As bio size is limited by its bi_size, which is unsigned int, and has
      extra size limit check during various bio operations.
      Thus we are ensured that bio_offset won't overflow u32.
      
      Thus for all involved functions, not only rename the parameter from
      @phy_offset to @bio_offset, but also reduce its width to u32, so we
      won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ffd27e3
    • Qu Wenruo's avatar
      btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset · 1941b64b
      Qu Wenruo authored
      The parameter bio_offset of extent_submit_bio_start_t is very confusing.
      If it's really bio_offset (offset to bio), then it should be u32.  But
      in fact, it's only utilized by dio read, and that member is used as file
      offset, which must be u64.
      
      Rename it to dio_file_offset since the only user uses it as file offset,
      and add comment for who is using it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1941b64b
    • Boris Burkov's avatar
      btrfs: fix lockdep warning when creating free space tree · 8a6a87cd
      Boris Burkov authored
      A lock dependency loop exists between the root tree lock, the extent tree
      lock, and the free space tree lock.
      
      The root tree lock depends on the free space tree lock because
      btrfs_create_tree holds the new tree's lock while adding it to the root
      tree.
      
      The extent tree lock depends on the root tree lock because during
      umount, we write out space cache v1, which writes inodes in the root
      tree, which results in holding the root tree lock while doing a lookup
      in the extent tree.
      
      Finally, the free space tree depends on the extent tree because
      populate_free_space_tree holds a locked path in the extent tree and then
      does a lookup in the free space tree to add the new item.
      
      The simplest of the three to break is the one during tree creation: we
      unlock the leaf before inserting the tree node into the root tree, which
      fixes the lockdep warning.
      
        [30.480136] ======================================================
        [30.480830] WARNING: possible circular locking dependency detected
        [30.481457] 5.9.0-rc8+ #76 Not tainted
        [30.481897] ------------------------------------------------------
        [30.482500] mount/520 is trying to acquire lock:
        [30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
        [30.484054]
      	      but task is already holding lock:
        [30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
        [30.485581]
      	      which lock already depends on the new lock.
      
        [30.486397]
      	      the existing dependency chain (in reverse order) is:
        [30.487205]
      	      -> #2 (btrfs-extent-01#2){++++}-{3:3}:
        [30.487825]        down_read_nested+0x43/0x150
        [30.488306]        __btrfs_tree_read_lock+0x39/0x180
        [30.488868]        __btrfs_read_lock_root_node+0x3a/0x50
        [30.489477]        btrfs_search_slot+0x464/0x9b0
        [30.490009]        check_committed_ref+0x59/0x1d0
        [30.490603]        btrfs_cross_ref_exist+0x65/0xb0
        [30.491108]        run_delalloc_nocow+0x405/0x930
        [30.491651]        btrfs_run_delalloc_range+0x60/0x6b0
        [30.492203]        writepage_delalloc+0xd4/0x150
        [30.492688]        __extent_writepage+0x18d/0x3a0
        [30.493199]        extent_write_cache_pages+0x2af/0x450
        [30.493743]        extent_writepages+0x34/0x70
        [30.494231]        do_writepages+0x31/0xd0
        [30.494642]        __filemap_fdatawrite_range+0xad/0xe0
        [30.495194]        btrfs_fdatawrite_range+0x1b/0x50
        [30.495677]        __btrfs_write_out_cache+0x40d/0x460
        [30.496227]        btrfs_write_out_cache+0x8b/0x110
        [30.496716]        btrfs_start_dirty_block_groups+0x211/0x4e0
        [30.497317]        btrfs_commit_transaction+0xc0/0xba0
        [30.497861]        sync_filesystem+0x71/0x90
        [30.498303]        btrfs_remount+0x81/0x433
        [30.498767]        reconfigure_super+0x9f/0x210
        [30.499261]        path_mount+0x9d1/0xa30
        [30.499722]        do_mount+0x55/0x70
        [30.500158]        __x64_sys_mount+0xc4/0xe0
        [30.500616]        do_syscall_64+0x33/0x40
        [30.501091]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.501629]
      	      -> #1 (btrfs-root-00){++++}-{3:3}:
        [30.502241]        down_read_nested+0x43/0x150
        [30.502727]        __btrfs_tree_read_lock+0x39/0x180
        [30.503291]        __btrfs_read_lock_root_node+0x3a/0x50
        [30.503903]        btrfs_search_slot+0x464/0x9b0
        [30.504405]        btrfs_insert_empty_items+0x60/0xa0
        [30.504973]        btrfs_insert_item+0x60/0xd0
        [30.505412]        btrfs_create_tree+0x1b6/0x210
        [30.505913]        btrfs_create_free_space_tree+0x54/0x110
        [30.506460]        btrfs_mount_rw+0x15d/0x20f
        [30.506937]        btrfs_remount+0x356/0x433
        [30.507369]        reconfigure_super+0x9f/0x210
        [30.507868]        path_mount+0x9d1/0xa30
        [30.508264]        do_mount+0x55/0x70
        [30.508668]        __x64_sys_mount+0xc4/0xe0
        [30.509186]        do_syscall_64+0x33/0x40
        [30.509652]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.510271]
      	      -> #0 (btrfs-free-space-00){++++}-{3:3}:
        [30.510972]        __lock_acquire+0x11ad/0x1b60
        [30.511432]        lock_acquire+0xa2/0x360
        [30.511917]        down_read_nested+0x43/0x150
        [30.512383]        __btrfs_tree_read_lock+0x39/0x180
        [30.512947]        __btrfs_read_lock_root_node+0x3a/0x50
        [30.513455]        btrfs_search_slot+0x464/0x9b0
        [30.513947]        search_free_space_info+0x45/0x90
        [30.514465]        __add_to_free_space_tree+0x92/0x39d
        [30.515010]        btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
        [30.515639]        btrfs_mount_rw+0x15d/0x20f
        [30.516142]        btrfs_remount+0x356/0x433
        [30.516538]        reconfigure_super+0x9f/0x210
        [30.517065]        path_mount+0x9d1/0xa30
        [30.517438]        do_mount+0x55/0x70
        [30.517824]        __x64_sys_mount+0xc4/0xe0
        [30.518293]        do_syscall_64+0x33/0x40
        [30.518776]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.519335]
      	      other info that might help us debug this:
      
        [30.520210] Chain exists of:
      		btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2
      
        [30.521407]  Possible unsafe locking scenario:
      
        [30.522037]        CPU0                    CPU1
        [30.522456]        ----                    ----
        [30.522941]   lock(btrfs-extent-01#2);
        [30.523311]                                lock(btrfs-root-00);
        [30.523952]                                lock(btrfs-extent-01#2);
        [30.524620]   lock(btrfs-free-space-00);
        [30.525068]
      	       *** DEADLOCK ***
      
        [30.525669] 5 locks held by mount/520:
        [30.526116]  #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
        [30.527056]  #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
        [30.527960]  #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
        [30.529118]  #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
        [30.530113]  #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
        [30.531124]
      	      stack backtrace:
        [30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
        [30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
        [30.533215] Call Trace:
        [30.533452]  dump_stack+0x8d/0xc0
        [30.533797]  check_noncircular+0x13c/0x150
        [30.534233]  __lock_acquire+0x11ad/0x1b60
        [30.534667]  lock_acquire+0xa2/0x360
        [30.535063]  ? __btrfs_tree_read_lock+0x39/0x180
        [30.535525]  down_read_nested+0x43/0x150
        [30.535939]  ? __btrfs_tree_read_lock+0x39/0x180
        [30.536400]  __btrfs_tree_read_lock+0x39/0x180
        [30.536862]  __btrfs_read_lock_root_node+0x3a/0x50
        [30.537304]  btrfs_search_slot+0x464/0x9b0
        [30.537713]  ? trace_hardirqs_on+0x1c/0xf0
        [30.538148]  search_free_space_info+0x45/0x90
        [30.538572]  __add_to_free_space_tree+0x92/0x39d
        [30.539071]  ? printk+0x48/0x4a
        [30.539367]  btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
        [30.539972]  btrfs_mount_rw+0x15d/0x20f
        [30.540350]  btrfs_remount+0x356/0x433
        [30.540773]  ? shrink_dcache_sb+0xd9/0x100
        [30.541203]  reconfigure_super+0x9f/0x210
        [30.541642]  path_mount+0x9d1/0xa30
        [30.542040]  do_mount+0x55/0x70
        [30.542366]  __x64_sys_mount+0xc4/0xe0
        [30.542822]  do_syscall_64+0x33/0x40
        [30.543197]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.543691] RIP: 0033:0x7f109f7ab93a
        [30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        [30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
        [30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
        [30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
        [30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
        [30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a6a87cd
    • Boris Burkov's avatar
      btrfs: skip space_cache v1 setup when not using it · af456a2c
      Boris Burkov authored
      If we are not using space cache v1, we should not create the free space
      object or free space inodes. This comes up when we delete the existing
      free space objects/inodes when migrating to v2, only to see them get
      recreated for every dirtied block group.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af456a2c
    • Boris Burkov's avatar
      btrfs: remove free space items when disabling space cache v1 · 36b216c8
      Boris Burkov authored
      When the filesystem transitions from space cache v1 to v2 or to
      nospace_cache, it removes the old cached data, but does not remove
      the FREE_SPACE items nor the free space inodes they point to. This
      doesn't cause any issues besides being a bit inefficient, since these
      items no longer do anything useful.
      
      To fix it, when we are mounting, and plan to disable the space cache,
      destroy each block group's free space item and free space inode.
      The code to remove the items is lifted from the existing use case of
      removing the block group, with a light adaptation to handle whether or
      not we have already looked up the free space inode.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36b216c8
    • Boris Burkov's avatar
      btrfs: warn when remount will not change the free space tree · 2838d255
      Boris Burkov authored
      If the remount is ro->ro, rw->ro, or rw->rw, we will not create or
      clear the free space tree. This can be surprising, so print a warning
      to dmesg to make the failure more visible. It is also important to
      ensure that the space cache options (SPACE_CACHE, FREE_SPACE_TREE) are
      consistent, so ensure those are set to properly match the current on
      disk state (which won't be changing).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2838d255
    • Boris Burkov's avatar
      btrfs: use superblock state to print space_cache mount option · 04c41559
      Boris Burkov authored
      To make the contents of /proc/mounts better match the actual state of
      the filesystem, base the display of the space cache mount options off
      the contents of the super block rather than the last mount options
      passed in. Since there are many scenarios where the mount will ignore a
      space cache option, simply showing the passed in option is misleading.
      
      For example, if we mount with -o remount,space_cache=v2 on a read-write
      file system without an existing free space tree, we won't build a free
      space tree, but /proc/mounts will read space_cache=v2 (until we mount
      again and it goes away)
      
      cache_generation is set iff space_cache=v1, FREE_SPACE_TREE is set iff
      space_cache=v2, and if neither is the case, we print nospace_cache.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04c41559