1. 13 May, 2021 4 commits
    • Filipe Manana's avatar
      btrfs: fix removed dentries still existing after log is synced · 54a40fc3
      Filipe Manana authored
      When we move one inode from one directory to another and both the inode
      and its previous parent directory were logged before, we are not supposed
      to have the dentry for the old parent if we have a power failure after the
      log is synced. Only the new dentry is supposed to exist.
      
      Generally this works correctly, however there is a scenario where this is
      not currently working, because the old parent of the file/directory that
      was moved is not authoritative for a range that includes the dir index and
      dir item keys of the old dentry. This case is better explained with the
      following example and reproducer:
      
        # The test requires a very specific layout of keys and items in the
        # fs/subvolume btree to trigger the bug. So we want to make sure that
        # on whatever platform we are, we have the same leaf/node size.
        #
        # Currently in btrfs the node/leaf size can not be smaller than the page
        # size (but it can be greater than the page size). So use the largest
        # supported node/leaf size (64K).
      
        $ mkfs.btrfs -f -n 65536 /dev/sdc
        $ mount /dev/sdc /mnt
      
        # "testdir" is inode 257.
        $ mkdir /mnt/testdir
        $ chmod 755 /mnt/testdir
      
        # Create several empty files to have the directory "testdir" with its
        # items spread over several leaves (7 in this case).
        $ for ((i = 1; i <= 1200; i++)); do
             echo -n > /mnt/testdir/file$i
          done
      
        # Create our test directory "dira", inode number 1458, which gets all
        # its items in leaf 7.
        #
        # The BTRFS_DIR_ITEM_KEY item for inode 257 ("testdir") that points to
        # the entry named "dira" is in leaf 2, while the BTRFS_DIR_INDEX_KEY
        # item that points to that entry is in leaf 3.
        #
        # For this particular filesystem node size (64K), file count and file
        # names, we endup with the directory entry items from inode 257 in
        # leaves 2 and 3, as previously mentioned - what matters for triggering
        # the bug exercised by this test case is that those items are not placed
        # in leaf 1, they must be placed in a leaf different from the one
        # containing the inode item for inode 257.
        #
        # The corresponding BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items for
        # the parent inode (257) are the following:
        #
        #    item 460 key (257 DIR_ITEM 3724298081) itemoff 48344 itemsize 34
        #         location key (1458 INODE_ITEM 0) type DIR
        #         transid 6 data_len 0 name_len 4
        #         name: dira
        #
        # and:
        #
        #    item 771 key (257 DIR_INDEX 1202) itemoff 36673 itemsize 34
        #         location key (1458 INODE_ITEM 0) type DIR
        #         transid 6 data_len 0 name_len 4
        #         name: dira
      
        $ mkdir /mnt/testdir/dira
      
        # Make sure everything done so far is durably persisted.
        $ sync
      
        # Now do a change to inode 257 ("testdir") that does not result in
        # COWing leaves 2 and 3 - the leaves that contain the directory items
        # pointing to inode 1458 (directory "dira").
        #
        # Changing permissions, the owner/group, updating or adding a xattr,
        # etc, will not change (COW) leaves 2 and 3. So for the sake of
        # simplicity change the permissions of inode 257, which results in
        # updating its inode item and therefore change (COW) only leaf 1.
      
        $ chmod 700 /mnt/testdir
      
        # Now fsync directory inode 257.
        #
        # Since only the first leaf was changed/COWed, we log the inode item of
        # inode 257 and only the dentries found in the first leaf, all have a
        # key type of BTRFS_DIR_ITEM_KEY, and no keys of type
        # BTRFS_DIR_INDEX_KEY, because they sort after the former type and none
        # exist in the first leaf.
        #
        # We also log 3 items that represent ranges for dir items and dir
        # indexes for which the log is authoritative:
        #
        # 1) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
        #    authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
        #    in the range [0, 2285968570] (the offset here is the crc32c of the
        #    dentry's name). The value 2285968570 corresponds to the offset of
        #    the first key of leaf 2 (which is of type BTRFS_DIR_ITEM_KEY);
        #
        # 2) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
        #    authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
        #    in the range [4293818216, (u64)-1] (the offset here is the crc32c
        #    of the dentry's name). The value 4293818216 corresponds to the
        #    offset of the highest key of type BTRFS_DIR_ITEM_KEY plus 1
        #    (4293818215 + 1), which is located in leaf 2;
        #
        # 3) a key of type BTRFS_DIR_LOG_INDEX_KEY, with an offset of 1203,
        #    which indicates the log is authoritative for all keys of type
        #    BTRFS_DIR_INDEX_KEY that have an offset in the range
        #    [1203, (u64)-1]. The value 1203 corresponds to the offset of the
        #    last key of type BTRFS_DIR_INDEX_KEY plus 1 (1202 + 1), which is
        #    located in leaf 3;
        #
        # Also, because "testdir" is a directory and inode 1458 ("dira") is a
        # child directory, we log inode 1458 too.
      
        $ xfs_io -c "fsync" /mnt/testdir
      
        # Now move "dira", inode 1458, to be a child of the root directory
        # (inode 256).
        #
        # Because this inode was previously logged, when "testdir" was fsynced,
        # the log is updated so that the old inode reference, referring to inode
        # 257 as the parent, is deleted and the new inode reference, referring
        # to inode 256 as the parent, is added to the log.
      
        $ mv /mnt/testdir/dira /mnt
      
        # Now change some file and fsync it. This guarantees the log changes
        # made by the previous move/rename operation are persisted. We do not
        # need to do any special modification to the file, just any change to
        # any file and sync the log.
      
        $ xfs_io -c "pwrite -S 0xab 0 64K" -c "fsync" /mnt/testdir/file1
      
        # Simulate a power failure and then mount again the filesystem to
        # replay the log tree. We want to verify that we are able to mount the
        # filesystem, meaning log replay was successful, and that directory
        # inode 1458 ("dira") only has inode 256 (the filesystem's root) as
        # its parent (and no longer a child of inode 257).
        #
        # It used to happen that during log replay we would end up having
        # inode 1458 (directory "dira") with 2 hard links, being a child of
        # inode 257 ("testdir") and inode 256 (the filesystem's root). This
        # resulted in the tree checker detecting the issue and causing the
        # mount operation to fail (with -EIO).
        #
        # This happened because in the log we have the new name/parent for
        # inode 1458, which results in adding the new dentry with inode 256
        # as the parent, but the previous dentry, under inode 257 was never
        # removed - this is because the ranges for dir items and dir indexes
        # of inode 257 for which the log is authoritative do not include the
        # old dir item and dir index for the dentry of inode 257 referring to
        # inode 1458:
        #
        # - for dir items, the log is authoritative for the ranges
        #   [0, 2285968570] and [4293818216, (u64)-1]. The dir item at inode 257
        #   pointing to inode 1458 has a key of (257 DIR_ITEM 3724298081), as
        #   previously mentioned, so the dir item is not deleted when the log
        #   replay procedure processes the authoritative ranges, as 3724298081
        #   is outside both ranges;
        #
        # - for dir indexes, the log is authoritative for the range
        #   [1203, (u64)-1], and the dir index item of inode 257 pointing to
        #   inode 1458 has a key of (257 DIR_INDEX 1202), as previously
        #   mentioned, so the dir index item is not deleted when the log
        #   replay procedure processes the authoritative range.
      
        <power failure>
      
        $ mount /dev/sdc /mnt
        mount: /mnt: can't read superblock on /dev/sdc.
      
        $ dmesg
        (...)
        [87849.840509] BTRFS info (device sdc): start tree-log replay
        [87849.875719] BTRFS critical (device sdc): corrupt leaf: root=5 block=30539776 slot=554 ino=1458, invalid nlink: has 2 expect no more than 1 for dir
        [87849.878084] BTRFS info (device sdc): leaf 30539776 gen 7 total ptrs 557 free space 2092 owner 5
        [87849.879516] BTRFS info (device sdc): refs 1 lock_owner 0 current 2099108
        [87849.880613] 	item 0 key (1181 1 0) itemoff 65275 itemsize 160
        [87849.881544] 		inode generation 6 size 0 mode 100644
        [87849.882692] 	item 1 key (1181 12 257) itemoff 65258 itemsize 17
        (...)
        [87850.562549] 	item 556 key (1458 12 257) itemoff 16017 itemsize 14
        [87850.563349] BTRFS error (device dm-0): block=30539776 write time tree block corruption detected
        [87850.564386] ------------[ cut here ]------------
        [87850.564920] WARNING: CPU: 3 PID: 2099108 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
        [87850.566129] Modules linked in: btrfs dm_zero dm_snapshot (...)
        [87850.573789] CPU: 3 PID: 2099108 Comm: mount Not tainted 5.12.0-rc8-btrfs-next-86 #1
        (...)
        [87850.587481] Call Trace:
        [87850.587768]  btree_csum_one_bio+0x244/0x2b0 [btrfs]
        [87850.588354]  ? btrfs_bio_fits_in_stripe+0xd8/0x110 [btrfs]
        [87850.589003]  btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
        [87850.589654]  submit_one_bio+0x61/0x70 [btrfs]
        [87850.590248]  submit_extent_page+0x91/0x2f0 [btrfs]
        [87850.590842]  write_one_eb+0x175/0x440 [btrfs]
        [87850.591370]  ? find_extent_buffer_nolock+0x1c0/0x1c0 [btrfs]
        [87850.592036]  btree_write_cache_pages+0x1e6/0x610 [btrfs]
        [87850.592665]  ? free_debug_processing+0x1d5/0x240
        [87850.593209]  do_writepages+0x43/0xf0
        [87850.593798]  ? __filemap_fdatawrite_range+0xa4/0x100
        [87850.594391]  __filemap_fdatawrite_range+0xc5/0x100
        [87850.595196]  btrfs_write_marked_extents+0x68/0x160 [btrfs]
        [87850.596202]  btrfs_write_and_wait_transaction.isra.0+0x4d/0xd0 [btrfs]
        [87850.597377]  btrfs_commit_transaction+0x794/0xca0 [btrfs]
        [87850.598455]  ? _raw_spin_unlock_irqrestore+0x32/0x60
        [87850.599305]  ? kmem_cache_free+0x15a/0x3d0
        [87850.600029]  btrfs_recover_log_trees+0x346/0x380 [btrfs]
        [87850.601021]  ? replay_one_extent+0x7d0/0x7d0 [btrfs]
        [87850.601988]  open_ctree+0x13c9/0x1698 [btrfs]
        [87850.602846]  btrfs_mount_root.cold+0x13/0xed [btrfs]
        [87850.603771]  ? kmem_cache_alloc_trace+0x7c9/0x930
        [87850.604576]  ? vfs_parse_fs_string+0x5d/0xb0
        [87850.605293]  ? kfree+0x276/0x3f0
        [87850.605857]  legacy_get_tree+0x30/0x50
        [87850.606540]  vfs_get_tree+0x28/0xc0
        [87850.607163]  fc_mount+0xe/0x40
        [87850.607695]  vfs_kern_mount.part.0+0x71/0x90
        [87850.608440]  btrfs_mount+0x13b/0x3e0 [btrfs]
        (...)
        [87850.629477] ---[ end trace 68802022b99a1ea0 ]---
        [87850.630849] BTRFS: error (device sdc) in btrfs_commit_transaction:2381: errno=-5 IO failure (Error while writing out transaction)
        [87850.632422] BTRFS warning (device sdc): Skipping commit of aborted transaction.
        [87850.633416] BTRFS: error (device sdc) in cleanup_transaction:1978: errno=-5 IO failure
        [87850.634553] BTRFS: error (device sdc) in btrfs_replay_log:2431: errno=-5 IO failure (Failed to recover log tree)
        [87850.637529] BTRFS error (device sdc): open_ctree failed
      
      In this example the inode we moved was a directory, so it was easy to
      detect the problem because directories can only have one hard link and
      the tree checker immediately detects that. If the moved inode was a file,
      then the log replay would succeed and we would end up having both the
      new hard link (/mnt/foo) and the old hard link (/mnt/testdir/foo) present,
      but only the new one should be present.
      
      Fix this by forcing re-logging of the old parent directory when logging
      the new name during a rename operation. This ensures we end up with a log
      that is authoritative for a range covering the keys for the old dentry,
      therefore causing the old dentry do be deleted when replaying the log.
      
      A test case for fstests will follow up soon.
      
      Fixes: 64d6b281 ("btrfs: remove unnecessary check_parent_dirs_for_sync()")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54a40fc3
    • Boris Burkov's avatar
      btrfs: return whole extents in fiemap · 15c7745c
      Boris Burkov authored
        `xfs_io -c 'fiemap <off> <len>' <file>`
      
      can give surprising results on btrfs that differ from xfs.
      
      btrfs prints out extents trimmed to fit the user input. If the user's
      fiemap request has an offset, then rather than returning each whole
      extent which intersects that range, we also trim the start extent to not
      have start < off.
      
      Documentation in filesystems/fiemap.txt and the xfs_io man page suggests
      that returning the whole extent is expected.
      
      Some cases which all yield the same fiemap in xfs, but not btrfs:
        dd if=/dev/zero of=$f bs=4k count=1
        sudo xfs_io -c 'fiemap 0 1024' $f
          0: [0..7]: 26624..26631
        sudo xfs_io -c 'fiemap 2048 1024' $f
          0: [4..7]: 26628..26631
        sudo xfs_io -c 'fiemap 2048 4096' $f
          0: [4..7]: 26628..26631
        sudo xfs_io -c 'fiemap 3584 512' $f
          0: [7..7]: 26631..26631
        sudo xfs_io -c 'fiemap 4091 5' $f
          0: [7..6]: 26631..26630
      
      I believe this is a consequence of the logic for merging contiguous
      extents represented by separate extent items. That logic needs to track
      the last offset as it loops through the extent items, which happens to
      pick up the start offset on the first iteration, and trim off the
      beginning of the full extent. To fix it, start `off` at 0 rather than
      `start` so that we keep the iteration/merging intact without cutting off
      the start of the extent.
      
      after the fix, all the above commands give:
      
        0: [0..7]: 26624..26631
      
      The merging logic is exercised by fstest generic/483, and I have written
      a new fstest for checking we don't have backwards or zero-length fiemaps
      for cases like those above.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      15c7745c
    • Josef Bacik's avatar
      btrfs: avoid RCU stalls while running delayed iputs · 71795ee5
      Josef Bacik authored
      Generally a delayed iput is added when we might do the final iput, so
      usually we'll end up sleeping while processing the delayed iputs
      naturally.  However there's no guarantee of this, especially for small
      files.  In production we noticed 5 instances of RCU stalls while testing
      a kernel release overnight across 1000 machines, so this is relatively
      common:
      
        host count: 5
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: ....: (20998 ticks this GP) idle=59e/1/0x4000000000000002 softirq=12333372/12333372 fqs=3208
         	(t=21031 jiffies g=27810193 q=41075) NMI backtrace for cpu 1
        CPU: 1 PID: 1713 Comm: btrfs-cleaner Kdump: loaded Not tainted 5.6.13-0_fbk12_rc1_5520_gec92bffc1ec9 #1
        Call Trace:
          <IRQ> dump_stack+0x50/0x70
          nmi_cpu_backtrace.cold.6+0x30/0x65
          ? lapic_can_unplug_cpu.cold.30+0x40/0x40
          nmi_trigger_cpumask_backtrace+0xba/0xca
          rcu_dump_cpu_stacks+0x99/0xc7
          rcu_sched_clock_irq.cold.90+0x1b2/0x3a3
          ? trigger_load_balance+0x5c/0x200
          ? tick_sched_do_timer+0x60/0x60
          ? tick_sched_do_timer+0x60/0x60
          update_process_times+0x24/0x50
          tick_sched_timer+0x37/0x70
          __hrtimer_run_queues+0xfe/0x270
          hrtimer_interrupt+0xf4/0x210
          smp_apic_timer_interrupt+0x5e/0x120
          apic_timer_interrupt+0xf/0x20 </IRQ>
         RIP: 0010:queued_spin_lock_slowpath+0x17d/0x1b0
         RSP: 0018:ffffc9000da5fe48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
         RAX: 0000000000000000 RBX: ffff889fa81d0cd8 RCX: 0000000000000029
         RDX: ffff889fff86c0c0 RSI: 0000000000080000 RDI: ffff88bfc2da7200
         RBP: ffff888f2dcdd768 R08: 0000000001040000 R09: 0000000000000000
         R10: 0000000000000001 R11: ffffffff82a55560 R12: ffff88bfc2da7200
         R13: 0000000000000000 R14: ffff88bff6c2a360 R15: ffffffff814bd870
         ? kzalloc.constprop.57+0x30/0x30
         list_lru_add+0x5a/0x100
         inode_lru_list_add+0x20/0x40
         iput+0x1c1/0x1f0
         run_delayed_iput_locked+0x46/0x90
         btrfs_run_delayed_iputs+0x3f/0x60
         cleaner_kthread+0xf2/0x120
         kthread+0x10b/0x130
      
      Fix this by adding a cond_resched_lock() to the loop processing delayed
      iputs so we can avoid these sort of stalls.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71795ee5
    • Johannes Thumshirn's avatar
      btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error · d6f67afb
      Johannes Thumshirn authored
      Commit 7000babd ("btrfs: assign proper values to a bool variable in
      dev_extent_hole_check_zoned") assigned false to the hole_start parameter
      of dev_extent_hole_check_zoned().
      
      The hole_start parameter is not boolean and returns the start location of
      the found hole.
      
      Fixes: 7000babd ("btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned")
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d6f67afb
  2. 04 May, 2021 3 commits
    • Tom Rix's avatar
      btrfs: initialize return variable in cleanup_free_space_cache_v1 · 77364faf
      Tom Rix authored
      Static analysis reports this problem
      
        free-space-cache.c:3965:2: warning: Undefined or garbage value returned
          return ret;
          ^~~~~~~~~~
      
      ret is set in the node handling loop.  Treat doing nothing as a success
      and initialize ret to 0, although it's unlikely the loop would be
      skipped. We always have block groups, but as it could lead to
      transaction abort in the caller it's better to be safe.
      
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77364faf
    • Naohiro Aota's avatar
      btrfs: zoned: sanity check zone type · 784daf2b
      Naohiro Aota authored
      The fstests test case generic/475 creates a dm-linear device that gets
      changed to a dm-error device. This leads to errors in loading the block
      group's zone information when running on a zoned file system, ultimately
      resulting in a list corruption. When running on a kernel with list
      debugging enabled this leads to the following crash.
      
       BTRFS: error (device dm-2) in cleanup_transaction:1953: errno=-5 IO failure
       kernel BUG at lib/list_debug.c:54!
       invalid opcode: 0000 [#1] SMP PTI
       CPU: 1 PID: 2433 Comm: umount Tainted: G        W         5.12.0+ #1018
       RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
       RSP: 0018:ffffc90001473df0 EFLAGS: 00010296
       RAX: 0000000000000054 RBX: ffff8881038fd000 RCX: ffffc90001473c90
       RDX: 0000000100001a31 RSI: 0000000000000003 RDI: 0000000000000003
       RBP: ffff888308871108 R08: 0000000000000003 R09: 0000000000000001
       R10: 3961373532383838 R11: 6666666620736177 R12: ffff888308871000
       R13: ffff8881038fd088 R14: ffff8881038fdc78 R15: dead000000000100
       FS:  00007f353c9b1540(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f353cc2c710 CR3: 000000018e13c000 CR4: 00000000000006a0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_free_block_groups+0xc9/0x310 [btrfs]
        close_ctree+0x2ee/0x31a [btrfs]
        ? call_rcu+0x8f/0x270
        ? mutex_lock+0x1c/0x40
        generic_shutdown_super+0x67/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x90
        cleanup_mnt+0x13e/0x1b0
        task_work_run+0x63/0xb0
        exit_to_user_mode_loop+0xd9/0xe0
        exit_to_user_mode_prepare+0x3e/0x60
        syscall_exit_to_user_mode+0x1d/0x50
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      As dm-error has no support for zones, btrfs will run it's zone emulation
      mode on this device. The zone emulation mode emulates conventional zones,
      so bail out if the zone bitmap that gets populated on mount sees the zone
      as sequential while we're thinking it's a conventional zone when creating
      a block group.
      
      Note: this scenario is unlikely in a real wold application and can only
      happen by this (ab)use of device-mapper targets.
      
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      784daf2b
    • Anand Jain's avatar
      btrfs: fix unmountable seed device after fstrim · 5e753a81
      Anand Jain authored
      The following test case reproduces an issue of wrongly freeing in-use
      blocks on the readonly seed device when fstrim is called on the rw sprout
      device. As shown below.
      
      Create a seed device and add a sprout device to it:
      
        $ mkfs.btrfs -fq -dsingle -msingle /dev/loop0
        $ btrfstune -S 1 /dev/loop0
        $ mount /dev/loop0 /btrfs
        $ btrfs dev add -f /dev/loop1 /btrfs
        BTRFS info (device loop0): relocating block group 290455552 flags system
        BTRFS info (device loop0): relocating block group 1048576 flags system
        BTRFS info (device loop0): disk added /dev/loop1
        $ umount /btrfs
      
      Mount the sprout device and run fstrim:
      
        $ mount /dev/loop1 /btrfs
        $ fstrim /btrfs
        $ umount /btrfs
      
      Now try to mount the seed device, and it fails:
      
        $ mount /dev/loop0 /btrfs
        mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
      
      Block 5292032 is missing on the readonly seed device:
      
       $ dmesg -kt | tail
       <snip>
       BTRFS error (device loop0): bad tree block start, want 5292032 have 0
       BTRFS warning (device loop0): couldn't read-tree root
       BTRFS error (device loop0): open_ctree failed
      
      From the dump-tree of the seed device (taken before the fstrim). Block
      5292032 belonged to the block group starting at 5242880:
      
        $ btrfs inspect dump-tree -e /dev/loop0 | grep -A1 BLOCK_GROUP
        <snip>
        item 3 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16169 itemsize 24
        	block group used 114688 chunk_objectid 256 flags METADATA
        <snip>
      
      From the dump-tree of the sprout device (taken before the fstrim).
      fstrim used block-group 5242880 to find the related free space to free:
      
        $ btrfs inspect dump-tree -e /dev/loop1 | grep -A1 BLOCK_GROUP
        <snip>
        item 1 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16226 itemsize 24
        	block group used 32768 chunk_objectid 256 flags METADATA
        <snip>
      
      BPF kernel tracing the fstrim command finds the missing block 5292032
      within the range of the discarded blocks as below:
      
        kprobe:btrfs_discard_extent {
        	printf("freeing start %llu end %llu num_bytes %llu:\n",
        		arg1, arg1+arg2, arg2);
        }
      
        freeing start 5259264 end 5406720 num_bytes 147456
        <snip>
      
      Fix this by avoiding the discard command to the readonly seed device.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e753a81
  3. 28 Apr, 2021 4 commits
    • Filipe Manana's avatar
      btrfs: fix deadlock when cloning inline extents and using qgroups · f9baa501
      Filipe Manana authored
      There are a few exceptional cases where cloning an inline extent needs to
      copy the inline extent data into a page of the destination inode.
      
      When this happens, we end up starting a transaction while having a dirty
      page for the destination inode and while having the range locked in the
      destination's inode iotree too. Because when reserving metadata space
      for a transaction we may need to flush existing delalloc in case there is
      not enough free space, we have a mechanism in place to prevent a deadlock,
      which was introduced in commit 3d45f221 ("btrfs: fix deadlock when
      cloning inline extent and low on free metadata space").
      
      However when using qgroups, a transaction also reserves metadata qgroup
      space, which can also result in flushing delalloc in case there is not
      enough available space at the moment. When this happens we deadlock, since
      flushing delalloc requires locking the file range in the inode's iotree
      and the range was already locked at the very beginning of the clone
      operation, before attempting to start the transaction.
      
      When this issue happens, stack traces like the following are reported:
      
        [72747.556262] task:kworker/u81:9   state:D stack:    0 pid:  225 ppid:     2 flags:0x00004000
        [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
        [72747.556271] Call Trace:
        [72747.556273]  __schedule+0x296/0x760
        [72747.556277]  schedule+0x3c/0xa0
        [72747.556279]  io_schedule+0x12/0x40
        [72747.556284]  __lock_page+0x13c/0x280
        [72747.556287]  ? generic_file_readonly_mmap+0x70/0x70
        [72747.556325]  extent_write_cache_pages+0x22a/0x440 [btrfs]
        [72747.556331]  ? __set_page_dirty_nobuffers+0xe7/0x160
        [72747.556358]  ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
        [72747.556362]  ? update_group_capacity+0x25/0x210
        [72747.556366]  ? cpumask_next_and+0x1a/0x20
        [72747.556391]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.556394]  do_writepages+0x41/0xd0
        [72747.556398]  __writeback_single_inode+0x39/0x2a0
        [72747.556403]  writeback_sb_inodes+0x1ea/0x440
        [72747.556407]  __writeback_inodes_wb+0x5f/0xc0
        [72747.556410]  wb_writeback+0x235/0x2b0
        [72747.556414]  ? get_nr_inodes+0x35/0x50
        [72747.556417]  wb_workfn+0x354/0x490
        [72747.556420]  ? newidle_balance+0x2c5/0x3e0
        [72747.556424]  process_one_work+0x1aa/0x340
        [72747.556426]  worker_thread+0x30/0x390
        [72747.556429]  ? create_worker+0x1a0/0x1a0
        [72747.556432]  kthread+0x116/0x130
        [72747.556435]  ? kthread_park+0x80/0x80
        [72747.556438]  ret_from_fork+0x1f/0x30
      
        [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [72747.566961] Call Trace:
        [72747.566964]  __schedule+0x296/0x760
        [72747.566968]  ? finish_wait+0x80/0x80
        [72747.566970]  schedule+0x3c/0xa0
        [72747.566995]  wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
        [72747.566999]  ? finish_wait+0x80/0x80
        [72747.567024]  lock_extent_bits+0x37/0x90 [btrfs]
        [72747.567047]  btrfs_invalidatepage+0x299/0x2c0 [btrfs]
        [72747.567051]  ? find_get_pages_range_tag+0x2cd/0x380
        [72747.567076]  __extent_writepage+0x203/0x320 [btrfs]
        [72747.567102]  extent_write_cache_pages+0x2bb/0x440 [btrfs]
        [72747.567106]  ? update_load_avg+0x7e/0x5f0
        [72747.567109]  ? enqueue_entity+0xf4/0x6f0
        [72747.567134]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.567137]  ? enqueue_task_fair+0x93/0x6f0
        [72747.567140]  do_writepages+0x41/0xd0
        [72747.567144]  __filemap_fdatawrite_range+0xc7/0x100
        [72747.567167]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [72747.567195]  btrfs_work_helper+0xc2/0x300 [btrfs]
        [72747.567200]  process_one_work+0x1aa/0x340
        [72747.567202]  worker_thread+0x30/0x390
        [72747.567205]  ? create_worker+0x1a0/0x1a0
        [72747.567208]  kthread+0x116/0x130
        [72747.567211]  ? kthread_park+0x80/0x80
        [72747.567214]  ret_from_fork+0x1f/0x30
      
        [72747.569686] task:fsstress        state:D stack:    0 pid:841421 ppid:841417 flags:0x00000000
        [72747.569689] Call Trace:
        [72747.569691]  __schedule+0x296/0x760
        [72747.569694]  schedule+0x3c/0xa0
        [72747.569721]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569725]  ? finish_wait+0x80/0x80
        [72747.569753]  btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
        [72747.569781]  btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
        [72747.569804]  btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
        [72747.569810]  ? path_lookupat.isra.48+0x97/0x140
        [72747.569833]  btrfs_file_write_iter+0x81/0x410 [btrfs]
        [72747.569836]  ? __kmalloc+0x16a/0x2c0
        [72747.569839]  do_iter_readv_writev+0x160/0x1c0
        [72747.569843]  do_iter_write+0x80/0x1b0
        [72747.569847]  vfs_writev+0x84/0x140
        [72747.569869]  ? btrfs_file_llseek+0x38/0x270 [btrfs]
        [72747.569873]  do_writev+0x65/0x100
        [72747.569876]  do_syscall_64+0x33/0x40
        [72747.569879]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [72747.569899] task:fsstress        state:D stack:    0 pid:841424 ppid:841417 flags:0x00004000
        [72747.569903] Call Trace:
        [72747.569906]  __schedule+0x296/0x760
        [72747.569909]  schedule+0x3c/0xa0
        [72747.569936]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569940]  ? finish_wait+0x80/0x80
        [72747.569967]  __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
        [72747.569989]  start_transaction+0x279/0x580 [btrfs]
        [72747.570014]  clone_copy_inline_extent+0x332/0x490 [btrfs]
        [72747.570041]  btrfs_clone+0x5b7/0x7a0 [btrfs]
        [72747.570068]  ? lock_extent_bits+0x64/0x90 [btrfs]
        [72747.570095]  btrfs_clone_files+0xfc/0x150 [btrfs]
        [72747.570122]  btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
        [72747.570126]  do_clone_file_range+0xed/0x200
        [72747.570131]  vfs_clone_file_range+0x37/0x110
        [72747.570134]  ioctl_file_clone+0x7d/0xb0
        [72747.570137]  do_vfs_ioctl+0x138/0x630
        [72747.570140]  __x64_sys_ioctl+0x62/0xc0
        [72747.570143]  do_syscall_64+0x33/0x40
        [72747.570146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      So fix this by skipping the flush of delalloc for an inode that is
      flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
      such a special case of cloning an inline extent, when flushing delalloc
      during qgroup metadata reservation.
      
      The special cases for cloning inline extents were added in kernel 5.7 by
      by commit 05a5a762 ("Btrfs: implement full reflink support for
      inline extents"), while having qgroup metadata space reservation flushing
      delalloc when low on space was added in kernel 5.9 by commit
      c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get
      -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
      kernel backports.
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
      Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.9+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9baa501
    • Filipe Manana's avatar
      btrfs: fix race leading to unpersisted data and metadata on fsync · 626e9f41
      Filipe Manana authored
      When doing a fast fsync on a file, there is a race which can result in the
      fsync returning success to user space without logging the inode and without
      durably persisting new data.
      
      The following example shows one possible scenario for this:
      
         $ mkfs.btrfs -f /dev/sdc
         $ mount /dev/sdc /mnt
      
         $ touch /mnt/bar
         $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz
      
         # Now we have:
         # file bar == inode 257
         # file baz == inode 258
      
         $ mv /mnt/baz /mnt/foo
      
         # Now we have:
         # file bar == inode 257
         # file foo == inode 258
      
         $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
      
         # fsync bar before foo, it is important to trigger the race.
         $ xfs_io -c "fsync" /mnt/bar
         $ xfs_io -c "fsync" /mnt/foo
      
         # After this:
         # inode 257, file bar, is empty
         # inode 258, file foo, has 1M filled with 0xcd
      
         <power failure>
      
         # Replay the log:
         $ mount /dev/sdc /mnt
      
         # After this point file foo should have 1M filled with 0xcd and not 0xab
      
      The following steps explain how the race happens:
      
      1) Before the first fsync of inode 258, when it has the "baz" name, its
         ->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1.
         The inode also has the full sync flag set;
      
      2) After the first fsync, we set inode 258 ->logged_trans to 6, which is
         the generation of the current transaction, and set ->last_log_commit
         to 0, which is the current value of ->last_sub_trans (done at
         btrfs_log_inode()).
      
         The full sync flag is cleared from the inode during the fsync.
      
         The log sub transaction that was committed had an ID of 0 and when we
         synced the log, at btrfs_sync_log(), we incremented root->log_transid
         from 0 to 1;
      
      3) During the rename:
      
         We update inode 258, through btrfs_update_inode(), and that causes its
         ->last_sub_trans to be set to 1 (the current log transaction ID), and
         ->last_log_commit remains with a value of 0.
      
         After updating inode 258, because we have previously logged the inode
         in the previous fsync, we log again the inode through the call to
         btrfs_log_new_name(). This results in updating the inode's
         ->last_log_commit from 0 to 1 (the current value of its
         ->last_sub_trans).
      
         The ->last_sub_trans of inode 257 is updated to 1, which is the ID of
         the next log transaction;
      
      4) Then a buffered write against inode 258 is made. This leaves the value
         of ->last_sub_trans as 1 (the ID of the current log transaction, stored
         at root->log_transid);
      
      5) Then an fsync against inode 257 (or any other inode other than 258),
         happens. This results in committing the log transaction with ID 1,
         which results in updating root->last_log_commit to 1 and bumping
         root->log_transid from 1 to 2;
      
      6) Then an fsync against inode 258 starts. We flush delalloc and wait only
         for writeback to complete, since the full sync flag is not set in the
         inode's runtime flags - we do not wait for ordered extents to complete.
      
         Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the
         ordered extent completes. The call returns true:
      
           static inline bool btrfs_inode_in_log(...)
           {
               bool ret = false;
      
               spin_lock(&inode->lock);
               if (inode->logged_trans == generation &&
                   inode->last_sub_trans <= inode->last_log_commit &&
                   inode->last_sub_trans <= inode->root->last_log_commit)
                       ret = true;
               spin_unlock(&inode->lock);
               return ret;
           }
      
         generation has a value of 6 (fs_info->generation), ->logged_trans also
         has a value of 6 (set when we logged the inode during the first fsync
         and when logging it during the rename), ->last_sub_trans has a value
         of 1, set during the rename (step 3), ->last_log_commit also has a
         value of 1 (set in step 3) and root->last_log_commit has a value of 1,
         which was set in step 5 when fsyncing inode 257.
      
         As a consequence we don't log the inode, any new extents and do not
         sync the log, resulting in a data loss if a power failure happens
         after the fsync and before the current transaction commits.
         Also, because we do not log the inode, after a power failure the mtime
         and ctime of the inode do not match those we had before.
      
         When the ordered extent completes before we call btrfs_inode_in_log(),
         then the call returns false and we log the inode and sync the log,
         since at the end of ordered extent completion we update the inode and
         set ->last_sub_trans to 2 (the value of root->log_transid) and
         ->last_log_commit to 1.
      
      This problem is found after removing the check for the emptiness of the
      inode's list of modified extents in the recent commit 209ecbb8
      ("btrfs: remove stale comment and logic from btrfs_inode_in_log()"),
      added in the 5.13 merge window. However checking the emptiness of the
      list is not really the way to solve this problem, and was never intended
      to, because while that solves the problem for COW writes, the problem
      persists for NOCOW writes because in that case the list is always empty.
      
      In the case of NOCOW writes, even though we wait for the writeback to
      complete before returning from btrfs_sync_file(), we end up not logging
      the inode, which has a new mtime/ctime, and because we don't sync the log,
      we never issue disk barriers (send REQ_PREFLUSH to the device) since that
      only happens when we sync the log (when we write super blocks at
      btrfs_sync_log()). So effectively, for a NOCOW case, when we return from
      btrfs_sync_file() to user space, we are not guaranteeing that the data is
      durably persisted on disk.
      
      Also, while the example above uses a rename exchange to show how the
      problem happens, it is not the only way to trigger it. An alternative
      could be adding a new hard link to inode 258, since that also results
      in calling btrfs_log_new_name() and updating the inode in the log.
      An example reproducer using the addition of a hard link instead of a
      rename operation:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ touch /mnt/bar
        $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo
      
        $ ln /mnt/foo /mnt/foo_link
        $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
      
        $ xfs_io -c "fsync" /mnt/bar
        $ xfs_io -c "fsync" /mnt/foo
      
        <power failure>
      
        # Replay the log:
        $ mount /dev/sdc /mnt
      
        # After this point file foo often has 1M filled with 0xab and not 0xcd
      
      The reasons leading to the final fsync of file foo, inode 258, not
      persisting the new data are the same as for the previous example with
      a rename operation.
      
      So fix by never skipping logging and log syncing when there are still any
      ordered extents in flight. To avoid making the conditional if statement
      that checks if logging an inode is needed harder to read, place all the
      logic into an helper function with separate if statements to make it more
      manageable and easier to read.
      
      A test case for fstests will follow soon.
      
      For NOCOW writes, the problem existed before commit b5e6c3e1
      ("btrfs: always wait on ordered extents at fsync time"), introduced in
      kernel 4.19, then it went away with that commit since we started to always
      wait for ordered extent completion before logging.
      
      The problem came back again once the fast fsync path was changed again to
      avoid waiting for ordered extent completion, in commit 48778179
      ("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10.
      
      However, for COW writes, the race only happens after the recent
      commit 209ecbb8 ("btrfs: remove stale comment and logic from
      btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW
      writes, the bug existed before that commit. So tag 5.10+ as the release
      for stable backports.
      
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      626e9f41
    • Filipe Manana's avatar
      btrfs: do not consider send context as valid when trying to flush qgroups · ffb7c2e9
      Filipe Manana authored
      At qgroup.c:try_flush_qgroup() we are asserting that current->journal_info
      is either NULL or has the value BTRFS_SEND_TRANS_STUB.
      
      However allowing for BTRFS_SEND_TRANS_STUB makes no sense because:
      
      1) It is misleading, because send operations are read-only and do not
         ever need to reserve qgroup space;
      
      2) We already assert that current->journal_info != BTRFS_SEND_TRANS_STUB
         at transaction.c:start_transaction();
      
      3) On a kernel without CONFIG_BTRFS_ASSERT=y set, it would result in
         a crash if try_flush_qgroup() is ever called in a send context, because
         at transaction.c:start_transaction we cast current->journal_info into
         a struct btrfs_trans_handle pointer and then dereference it.
      
      So just do allow a send context at try_flush_qgroup() and update the
      comment about it.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffb7c2e9
    • Filipe Manana's avatar
      btrfs: zoned: fix silent data loss after failure splitting ordered extent · adbd914d
      Filipe Manana authored
      On a zoned filesystem, sometimes we need to split an ordered extent into 3
      different ordered extents. The original ordered extent is shortened, at
      the front and at the rear, and we create two other new ordered extents to
      represent the trimmed parts of the original ordered extent.
      
      After adjusting the original ordered extent, we create an ordered extent
      to represent the pre-range, and that may fail with ENOMEM for example.
      After that we always try to create the ordered extent for the post-range,
      and if that happens to succeed we end up returning success to the caller
      as we overwrite the 'ret' variable which contained the previous error.
      
      This means we end up with a file range for which there is no ordered
      extent, which results in the range never getting a new file extent item
      pointing to the new data location. And since the split operation did
      not return an error, writeback does not fail and the inode's mapping is
      not flagged with an error, resulting in a subsequent fsync not reporting
      an error either.
      
      It's possibly very unlikely to have the creation of the post-range ordered
      extent succeed after the creation of the pre-range ordered extent failed,
      but it's not impossible.
      
      So fix this by making sure we only create the post-range ordered extent
      if there was no error creating the ordered extent for the pre-range.
      
      Fixes: d22002fd ("btrfs: zoned: split ordered extent when bio is sent")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adbd914d
  4. 20 Apr, 2021 7 commits
    • Johannes Thumshirn's avatar
      btrfs: zoned: automatically reclaim zones · 18bb8bbf
      Johannes Thumshirn authored
      When a file gets deleted on a zoned file system, the space freed is not
      returned back into the block group's free space, but is migrated to
      zone_unusable.
      
      As this zone_unusable space is behind the current write pointer it is not
      possible to use it for new allocations. In the current implementation a
      zone is reset once all of the block group's space is accounted as zone
      unusable.
      
      This behaviour can lead to premature ENOSPC errors on a busy file system.
      
      Instead of only reclaiming the zone once it is completely unusable,
      kick off a reclaim job once the amount of unusable bytes exceeds a user
      configurable threshold between 51% and 100%. It can be set per mounted
      filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
      by default.
      
      Similar to reclaiming unused block groups, these dirty block groups are
      added to a to_reclaim list and then on a transaction commit, the reclaim
      process is triggered but after we deleted unused block groups, which will
      free space for the relocation process.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18bb8bbf
    • Johannes Thumshirn's avatar
      btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock · f3372065
      Johannes Thumshirn authored
      As a preparation for extending the block group deletion use case, rename
      the unused_bgs_mutex to reclaim_bgs_lock.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3372065
    • Johannes Thumshirn's avatar
      btrfs: zoned: reset zones of relocated block groups · 01e86008
      Johannes Thumshirn authored
      When relocating a block group the freed up space is not discarded in one
      big block, but each extent is discarded on its own with -odisard=sync.
      
      For a zoned filesystem we need to discard the whole block group at once,
      so btrfs_discard_extent() will translate the discard into a
      REQ_OP_ZONE_RESET operation, which then resets the device's zone.
      Failure to reset the zone is not fatal error.
      
      Discussion about the approach and regarding transaction blocking:
      https://lore.kernel.org/linux-btrfs/CAL3q7H4SjS_d5rBepfTMhU8Th3bJzdmyYd0g4Z60yUgC_rC_ZA@mail.gmail.com/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      01e86008
    • Qu Wenruo's avatar
      btrfs: more graceful errors/warnings on 32bit systems when reaching limits · e9306ad4
      Qu Wenruo authored
      Btrfs uses internally mapped u64 address space for all its metadata.
      Due to the page cache limit on 32bit systems, btrfs can't access
      metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See
      how MAX_LFS_FILESIZE and page::index are defined.  This is 16T for 4K
      page size while 256T for 64K page size.
      
      Users can have a filesystem which doesn't have metadata beyond the
      boundary at mount time, but later balance can cause it to create
      metadata beyond the boundary.
      
      And modification to MM layer is unrealistic just for such minor use
      case. We can't do more than to prevent mounting such filesystem or warn
      early when the numbers are still within the limits.
      
      To address such problem, this patch will introduce the following checks:
      
      - Mount time rejection
        This will reject any fs which has metadata chunk at or beyond the
        boundary.
      
      - Mount time early warning
        If there is any metadata chunk beyond 5/8th of the boundary, we do an
        early warning and hope the end user will see it.
      
      - Runtime extent buffer rejection
        If we're going to allocate an extent buffer at or beyond the boundary,
        reject such request with EOVERFLOW.
        This is definitely going to cause problems like transaction abort, but
        we have no better ways.
      
      - Runtime extent buffer early warning
        If an extent buffer beyond 5/8th of the max file size is allocated, do
        an early warning.
      
      Above error/warning message will only be printed once for each fs to
      reduce dmesg flood.
      
      If the mount is rejected, the filesystem will be mountable only on a
      64bit host.
      
      Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/Reported-by: default avatarErik Jensen <erikjensen@rkjnsn.net>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9306ad4
    • Filipe Manana's avatar
      btrfs: zoned: fix unpaired block group unfreeze during device replace · 0dc16ef4
      Filipe Manana authored
      When doing a device replace on a zoned filesystem, if we find a block
      group with ->to_copy == 0, we jump to the label 'done', which will result
      in later calling btrfs_unfreeze_block_group(), even though at this point
      we never called btrfs_freeze_block_group().
      
      Since at this point we have neither turned the block group to RO mode nor
      made any progress, we don't need to jump to the label 'done'. So fix this
      by jumping instead to the label 'skip' and dropping our reference on the
      block group before the jump.
      
      Fixes: 78ce9fc2 ("btrfs: zoned: mark block groups to copy for device-replace")
      CC: stable@vger.kernel.org # 5.12
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0dc16ef4
    • Filipe Manana's avatar
      btrfs: fix race when picking most recent mod log operation for an old root · f9690f42
      Filipe Manana authored
      Commit dbcc7d57 ("btrfs: fix race when cloning extent buffer during
      rewind of an old root"), fixed a race when we need to rewind the extent
      buffer of an old root. It was caused by picking a new mod log operation
      for the extent buffer while getting a cloned extent buffer with an outdated
      number of items (off by -1), because we cloned the extent buffer without
      locking it first.
      
      However there is still another similar race, but in the opposite direction.
      The cloned extent buffer has a number of items that does not match the
      number of tree mod log operations that are going to be replayed. This is
      because right after we got the last (most recent) tree mod log operation to
      replay and before locking and cloning the extent buffer, another task adds
      a new pointer to the extent buffer, which results in adding a new tree mod
      log operation and incrementing the number of items in the extent buffer.
      So after cloning we have mismatch between the number of items in the extent
      buffer and the number of mod log operations we are going to apply to it.
      This results in hitting a BUG_ON() that produces the following stack trace:
      
         ------------[ cut here ]------------
         kernel BUG at fs/btrfs/tree-mod-log.c:675!
         invalid opcode: 0000 [#1] SMP KASAN PTI
         CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G        W         5.12.0-7d1efdf501f8-misc-next+ #99
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
         RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
         Code: 05 48 8d 74 10 (...)
         RSP: 0018:ffffc90001027090 EFLAGS: 00010293
         RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
         RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
         RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
         R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
         R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
         FS:  00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
         Call Trace:
          btrfs_get_old_root+0x16a/0x5c0
          ? lock_downgrade+0x400/0x400
          btrfs_search_old_slot+0x192/0x520
          ? btrfs_search_slot+0x1090/0x1090
          ? free_extent_buffer.part.61+0xd7/0x140
          ? free_extent_buffer+0x13/0x20
          resolve_indirect_refs+0x3e9/0xfc0
          ? lock_downgrade+0x400/0x400
          ? __kasan_check_read+0x11/0x20
          ? add_prelim_ref.part.11+0x150/0x150
          ? lock_downgrade+0x400/0x400
          ? __kasan_check_read+0x11/0x20
          ? lock_acquired+0xbb/0x620
          ? __kasan_check_write+0x14/0x20
          ? do_raw_spin_unlock+0xa8/0x140
          ? rb_insert_color+0x340/0x360
          ? prelim_ref_insert+0x12d/0x430
          find_parent_nodes+0x5c3/0x1830
          ? stack_trace_save+0x87/0xb0
          ? resolve_indirect_refs+0xfc0/0xfc0
          ? fs_reclaim_acquire+0x67/0xf0
          ? __kasan_check_read+0x11/0x20
          ? lockdep_hardirqs_on_prepare+0x210/0x210
          ? fs_reclaim_acquire+0x67/0xf0
          ? __kasan_check_read+0x11/0x20
          ? ___might_sleep+0x10f/0x1e0
          ? __kasan_kmalloc+0x9d/0xd0
          ? trace_hardirqs_on+0x55/0x120
          btrfs_find_all_roots_safe+0x142/0x1e0
          ? find_parent_nodes+0x1830/0x1830
          ? trace_hardirqs_on+0x55/0x120
          ? ulist_free+0x1f/0x30
          ? btrfs_inode_flags_to_xflags+0x50/0x50
          iterate_extent_inodes+0x20e/0x580
          ? tree_backref_for_extent+0x230/0x230
          ? release_extent_buffer+0x225/0x280
          ? read_extent_buffer+0xdd/0x110
          ? lock_downgrade+0x400/0x400
          ? __kasan_check_read+0x11/0x20
          ? lock_acquired+0xbb/0x620
          ? __kasan_check_write+0x14/0x20
          ? do_raw_spin_unlock+0xa8/0x140
          ? _raw_spin_unlock+0x22/0x30
          ? release_extent_buffer+0x225/0x280
          iterate_inodes_from_logical+0x129/0x170
          ? iterate_inodes_from_logical+0x129/0x170
          ? btrfs_inode_flags_to_xflags+0x50/0x50
          ? iterate_extent_inodes+0x580/0x580
          ? __vmalloc_node+0x92/0xb0
          ? init_data_container+0x34/0xb0
          ? init_data_container+0x34/0xb0
          ? kvmalloc_node+0x60/0x80
          btrfs_ioctl_logical_to_ino+0x158/0x230
          btrfs_ioctl+0x2038/0x4360
          ? __kasan_check_write+0x14/0x20
          ? mmput+0x3b/0x220
          ? btrfs_ioctl_get_supported_features+0x30/0x30
          ? __kasan_check_read+0x11/0x20
          ? __kasan_check_read+0x11/0x20
          ? lock_release+0xc8/0x650
          ? __might_fault+0x64/0xd0
          ? __kasan_check_read+0x11/0x20
          ? lock_downgrade+0x400/0x400
          ? lockdep_hardirqs_on_prepare+0x210/0x210
          ? lockdep_hardirqs_on_prepare+0x13/0x210
          ? _raw_spin_unlock_irqrestore+0x51/0x63
          ? __kasan_check_read+0x11/0x20
          ? do_vfs_ioctl+0xfc/0x9d0
          ? ioctl_file_clone+0xe0/0xe0
          ? lock_downgrade+0x400/0x400
          ? lockdep_hardirqs_on_prepare+0x210/0x210
          ? __kasan_check_read+0x11/0x20
          ? lock_release+0xc8/0x650
          ? __task_pid_nr_ns+0xd3/0x250
          ? __kasan_check_read+0x11/0x20
          ? __fget_files+0x160/0x230
          ? __fget_light+0xf2/0x110
          __x64_sys_ioctl+0xc3/0x100
          do_syscall_64+0x37/0x80
          entry_SYSCALL_64_after_hwframe+0x44/0xae
         RIP: 0033:0x7f29ae85b427
         Code: 00 00 90 48 8b (...)
         RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
         RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
         RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
         RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
         R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
         R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
         Modules linked in:
         ---[ end trace 85e5fce078dfbe04 ]---
      
        (gdb) l *(tree_mod_log_rewind+0x3b1)
        0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
        670                      * the modification. As we're going backwards, we do the
        671                      * opposite of each operation here.
        672                      */
        673                     switch (tm->op) {
        674                     case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
        675                             BUG_ON(tm->slot < n);
        676                             fallthrough;
        677                     case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
        678                     case BTRFS_MOD_LOG_KEY_REMOVE:
        679                             btrfs_set_node_key(eb, &tm->key, tm->slot);
        (gdb) quit
      
      The following steps explain in more detail how it happens:
      
      1) We have one tree mod log user (through fiemap or the logical ino ioctl),
         with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
         This is task A;
      
      2) Another task is at ctree.c:balance_level() and we have eb X currently as
         the root of the tree, and we promote its single child, eb Y, as the new
         root.
      
         Then, at ctree.c:balance_level(), we call:
      
            ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
      
      3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
         of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
         pointing to ebX->start. We only have one item in eb X, so we create
         only one tree mod log operation, and store in the "tm_list" array;
      
      4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
         log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
         to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
         set to the level of eb X and ->generation set to the generation of eb X;
      
      5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
         "tm_list" as argument. After that, tree_mod_log_free_eb() calls
         tree_mod_log_insert(). This inserts the mod log operation of type
         BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
         with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
      
      6) Then, after inserting the "tm_list" single element into the tree mod
         log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
         gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
      
      7) Back to ctree.c:balance_level(), we free eb X by calling
         btrfs_free_tree_block() on it. Because eb X was created in the current
         transaction, has no other references and writeback did not happen for
         it, we add it back to the free space cache/tree;
      
      8) Later some other task B allocates the metadata extent from eb X, since
         it is marked as free space in the space cache/tree, and uses it as a
         node for some other btree;
      
      9) The tree mod log user task calls btrfs_search_old_slot(), which calls
         btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
         with time_seq == 1 and eb_root == eb Y;
      
      10) The first iteration of the while loop finds the tree mod log element
          with sequence number 3, for the logical address of eb Y and of type
          BTRFS_MOD_LOG_ROOT_REPLACE;
      
      11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
          break out of the loop, and set root_logical to point to
          tm->old_root.logical, which corresponds to the logical address of
          eb X;
      
      12) On the next iteration of the while loop, the call to
          tree_mod_log_search_oldest() returns the smallest tree mod log element
          for the logical address of eb X, which has a sequence number of 2, an
          operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
          corresponds to the old slot 0 of eb X (eb X had only 1 item in it
          before being freed at step 7);
      
      13) We then break out of the while loop and return the tree mod log
          operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
          for slot 0 of eb X, to btrfs_get_old_root();
      
      14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
          operation and set "logical" to the logical address of eb X, which was
          the old root. We then call tree_mod_log_search() passing it the logical
          address of eb X and time_seq == 1;
      
      15) But before calling tree_mod_log_search(), task B locks eb X, adds a
          key to eb X, which results in adding a tree mod log operation of type
          BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
          log, and increments the number of items in eb X from 0 to 1.
          Now fs_info->tree_mod_seq has a value of 4;
      
      16) Task A then calls tree_mod_log_search(), which returns the most recent
          tree mod log operation for eb X, which is the one just added by task B
          at the previous step, with a sequence number of 4, a type of
          BTRFS_MOD_LOG_KEY_ADD and for slot 0;
      
      17) Before task A locks and clones eb X, task A adds another key to eb X,
          which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
          with a sequence number of 5, for slot 1 of eb X, increments the
          number of items in eb X from 1 to 2, and unlocks eb X.
          Now fs_info->tree_mod_seq has a value of 5;
      
      18) Task A then locks eb X and clones it. The clone has a value of 2 for
          the number of items and the pointer "tm" points to the tree mod log
          operation with sequence number 4, not the most recent one with a
          sequence number of 5, so there is mismatch between the number of
          mod log operations that are going to be applied to the cloned version
          of eb X and the number of items in the clone;
      
      19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
          tree mod log operation with sequence number 4 and a type of
          BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
      
      20) At tree_mod_log_rewind(), we set the local variable "n" with a value
          of 2, which is the number of items in the clone of eb X.
      
          Then in the first iteration of the while loop, we process the mod log
          operation with sequence number 4, which is targeted at slot 0 and has
          a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
          2 to 1.
      
          Then we pick the next tree mod log operation for eb X, which is the
          tree mod log operation with a sequence number of 2, a type of
          BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
          added in step 5 to the tree mod log tree.
      
          We go back to the top of the loop to process this mod log operation,
          and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
      
              (...)
              switch (tm->op) {
              case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                      BUG_ON(tm->slot < n);
                      fallthrough;
      	(...)
      
      Fix this by checking for a more recent tree mod log operation after locking
      and cloning the extent buffer of the old root node, and use it as the first
      operation to apply to the cloned extent buffer when rewinding it.
      
      Stable backport notes: due to moved code and renames, in =< 5.11 the
      change should be applied to ctree.c:get_old_root.
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
      Fixes: 834328a8 ("Btrfs: tree mod log's old roots could still be part of the tree")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9690f42
    • Filipe Manana's avatar
      btrfs: fix metadata extent leak after failure to create subvolume · 67addf29
      Filipe Manana authored
      When creating a subvolume we allocate an extent buffer for its root node
      after starting a transaction. We setup a root item for the subvolume that
      points to that extent buffer and then attempt to insert the root item into
      the root tree - however if that fails, due to ENOMEM for example, we do
      not free the extent buffer previously allocated and we do not abort the
      transaction (as at that point we did nothing that can not be undone).
      
      This means that we effectively do not return the metadata extent back to
      the free space cache/tree and we leave a delayed reference for it which
      causes a metadata extent item to be added to the extent tree, in the next
      transaction commit, without having backreferences. When this happens
      'btrfs check' reports the following:
      
        $ btrfs check /dev/sdi
        Opening filesystem to check...
        Checking filesystem on /dev/sdi
        UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb
        [1/7] checking root items
        [2/7] checking extents
        ref mismatch on [30425088 16384] extent item 1, found 0
        backref 30425088 root 256 not referenced back 0x564a91c23d70
        incorrect global backref count on 30425088 found 1 wanted 0
        backpointer mismatch on [30425088 16384]
        owner ref check failed [30425088 16384]
        ERROR: errors found in extent allocation tree or chunk allocation
        [3/7] checking free space cache
        [4/7] checking fs roots
        [5/7] checking only csums items (without verifying data)
        [6/7] checking root refs
        [7/7] checking quota groups skipped (not enabled on this FS)
        found 212992 bytes used, error(s) found
        total csum bytes: 0
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 124669
        file data blocks allocated: 65536
         referenced 65536
      
      So fix this by freeing the metadata extent if btrfs_insert_root() returns
      an error.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      67addf29
  5. 19 Apr, 2021 22 commits
    • Qu Wenruo's avatar
      btrfs: handle remount to no compress during compression · 1d8ba9e7
      Qu Wenruo authored
      [BUG]
      When running btrfs/071 with inode_need_compress() removed from
      compress_file_range(), we got the following crash:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000018
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
        RIP: 0010:compress_file_range+0x476/0x7b0 [btrfs]
        Call Trace:
         ? submit_compressed_extents+0x450/0x450 [btrfs]
         async_cow_start+0x16/0x40 [btrfs]
         btrfs_work_helper+0xf2/0x3e0 [btrfs]
         process_one_work+0x278/0x5e0
         worker_thread+0x55/0x400
         ? process_one_work+0x5e0/0x5e0
         kthread+0x168/0x190
         ? kthread_create_worker_on_cpu+0x70/0x70
         ret_from_fork+0x22/0x30
        ---[ end trace 65faf4eae941fa7d ]---
      
      This is already after the patch "btrfs: inode: fix NULL pointer
      dereference if inode doesn't need compression."
      
      [CAUSE]
      @pages is firstly created by kcalloc() in compress_file_extent():
                      pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
      
      Then passed to btrfs_compress_pages() to be utilized there:
      
                      ret = btrfs_compress_pages(...
                                                 pages,
                                                 &nr_pages,
                                                 ...);
      
      btrfs_compress_pages() will initialize each page as output, in
      zlib_compress_pages() we have:
      
                              pages[nr_pages] = out_page;
                              nr_pages++;
      
      Normally this is completely fine, but there is a special case which
      is in btrfs_compress_pages() itself:
      
              switch (type) {
              default:
                      return -E2BIG;
              }
      
      In this case, we didn't modify @pages nor @out_pages, leaving them
      untouched, then when we cleanup pages, the we can hit NULL pointer
      dereference again:
      
              if (pages) {
                      for (i = 0; i < nr_pages; i++) {
                              WARN_ON(pages[i]->mapping);
                              put_page(pages[i]);
                      }
              ...
              }
      
      Since pages[i] are all initialized to zero, and btrfs_compress_pages()
      doesn't change them at all, accessing pages[i]->mapping would lead to
      NULL pointer dereference.
      
      This is not possible for current kernel, as we check
      inode_need_compress() before doing pages allocation.
      But if we're going to remove that inode_need_compress() in
      compress_file_extent(), then it's going to be a problem.
      
      [FIX]
      When btrfs_compress_pages() hits its default case, modify @out_pages to
      0 to prevent such problem from happening.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212331
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d8ba9e7
    • Johannes Thumshirn's avatar
      btrfs: zoned: fail mount if the device does not support zone append · 1d68128c
      Johannes Thumshirn authored
      For zoned btrfs, zone append is mandatory to write to a sequential write
      only zone, otherwise parallel writes to the same zone could result in
      unaligned write errors.
      
      If a zoned block device does not support zone append (e.g. a dm-crypt
      zoned device using a non-NULL IV cypher), fail to mount.
      
      CC: stable@vger.kernel.org # 5.12
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d68128c
    • Filipe Manana's avatar
      btrfs: fix race between transaction aborts and fsyncs leading to use-after-free · 061dde82
      Filipe Manana authored
      There is a race between a task aborting a transaction during a commit,
      a task doing an fsync and the transaction kthread, which leads to an
      use-after-free of the log root tree. When this happens, it results in a
      stack trace like the following:
      
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        BTRFS: error (device dm-0) in cleanup_transaction:1958: errno=-5 IO failure
        BTRFS warning (device dm-0): lost page write due to IO error on /dev/mapper/error-test (-5)
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0xa4e8 len 4096 err no 10
        BTRFS error (device dm-0): error writing primary super block to device 1
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e000 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e008 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e010 len 4096 err no 10
        BTRFS: error (device dm-0) in write_all_supers:4110: errno=-5 IO failure (1 errors while writing supers)
        BTRFS: error (device dm-0) in btrfs_sync_log:3308: errno=-5 IO failure
        general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b68: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 2 PID: 2458471 Comm: fsstress Not tainted 5.12.0-rc5-btrfs-next-84 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        RIP: 0010:__mutex_lock+0x139/0xa40
        Code: c0 74 19 (...)
        RSP: 0018:ffff9f18830d7b00 EFLAGS: 00010202
        RAX: 6b6b6b6b6b6b6b68 RBX: 0000000000000001 RCX: 0000000000000002
        RDX: ffffffffb9c54d13 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff9f18830d7bc0 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff9f18830d7be0 R11: 0000000000000001 R12: ffff8c6cd199c040
        R13: ffff8c6c95821358 R14: 00000000fffffffb R15: ffff8c6cbcf01358
        FS:  00007fa9140c2b80(0000) GS:ffff8c6fac600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fa913d52000 CR3: 000000013d2b4003 CR4: 0000000000370ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? __btrfs_handle_fs_error+0xde/0x146 [btrfs]
         ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
         ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
         btrfs_sync_log+0x7c1/0xf20 [btrfs]
         btrfs_sync_file+0x40c/0x580 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7fa9142a55c3
        Code: 8b 15 09 (...)
        RSP: 002b:00007fff26278d48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 0000563c83cb4560 RCX: 00007fa9142a55c3
        RDX: 00007fff26278cb0 RSI: 00007fff26278cb0 RDI: 0000000000000005
        RBP: 0000000000000005 R08: 0000000000000001 R09: 00007fff26278d5c
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000340
        R13: 00007fff26278de0 R14: 00007fff26278d96 R15: 0000563c83ca57c0
        Modules linked in: btrfs dm_zero dm_snapshot dm_thin_pool (...)
        ---[ end trace ee2f1b19327d791d ]---
      
      The steps that lead to this crash are the following:
      
      1) We are at transaction N;
      
      2) We have two tasks with a transaction handle attached to transaction N.
         Task A and Task B. Task B is doing an fsync;
      
      3) Task B is at btrfs_sync_log(), and has saved fs_info->log_root_tree
         into a local variable named 'log_root_tree' at the top of
         btrfs_sync_log(). Task B is about to call write_all_supers(), but
         before that...
      
      4) Task A calls btrfs_commit_transaction(), and after it sets the
         transaction state to TRANS_STATE_COMMIT_START, an error happens before
         it waits for the transaction's 'num_writers' counter to reach a value
         of 1 (no one else attached to the transaction), so it jumps to the
         label "cleanup_transaction";
      
      5) Task A then calls cleanup_transaction(), where it aborts the
         transaction, setting BTRFS_FS_STATE_TRANS_ABORTED on fs_info->fs_state,
         setting the ->aborted field of the transaction and the handle to an
         errno value and also setting BTRFS_FS_STATE_ERROR on fs_info->fs_state.
      
         After that, at cleanup_transaction(), it deletes the transaction from
         the list of transactions (fs_info->trans_list), sets the transaction
         to the state TRANS_STATE_COMMIT_DOING and then waits for the number
         of writers to go down to 1, as it's currently 2 (1 for task A and 1
         for task B);
      
      6) The transaction kthread is running and sees that BTRFS_FS_STATE_ERROR
         is set in fs_info->fs_state, so it calls btrfs_cleanup_transaction().
      
         There it sees the list fs_info->trans_list is empty, and then proceeds
         into calling btrfs_drop_all_logs(), which frees the log root tree with
         a call to btrfs_free_log_root_tree();
      
      7) Task B calls write_all_supers() and, shortly after, under the label
         'out_wake_log_root', it deferences the pointer stored in
         'log_root_tree', which was already freed in the previous step by the
         transaction kthread. This results in a use-after-free leading to a
         crash.
      
      Fix this by deleting the transaction from the list of transactions at
      cleanup_transaction() only after setting the transaction state to
      TRANS_STATE_COMMIT_DOING and waiting for all existing tasks that are
      attached to the transaction to release their transaction handles.
      This makes the transaction kthread wait for all the tasks attached to
      the transaction to be done with the transaction before dropping the
      log roots and doing other cleanups.
      
      Fixes: ef67963d ("btrfs: drop logs when we've aborted a transaction")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      061dde82
    • Qu Wenruo's avatar
      btrfs: introduce submit_eb_subpage() to submit a subpage metadata page · c4aec299
      Qu Wenruo authored
      The new function, submit_eb_subpage(), will submit all the dirty extent
      buffers in the page.
      
      The major difference between submit_eb_page() and submit_eb_subpage()
      is:
      - How to grab extent buffer
        Now we use find_extent_buffer_nospinlock() other than using
        page::private.
      
      All other different handling is already done in functions like
      lock_extent_buffer_for_io() and write_one_eb().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4aec299
    • Qu Wenruo's avatar
      btrfs: make lock_extent_buffer_for_io() to be subpage compatible · f3156df9
      Qu Wenruo authored
      For subpage metadata, we don't use page locking at all.  So just skip
      the page locking part for subpage.  The rest of the function can be
      reused.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3156df9
    • Qu Wenruo's avatar
      btrfs: introduce write_one_subpage_eb() function · 35b6ddfa
      Qu Wenruo authored
      The new function, write_one_subpage_eb(), as a subroutine for subpage
      metadata write, will handle the extent buffer bio submission.
      
      The major differences between the new write_one_subpage_eb() and
      write_one_eb() is:
      
      - No page locking
        When entering write_one_subpage_eb() the page is no longer locked.
        We only lock the page for its status update, and unlock immediately.
        Now we completely rely on extent io tree locking.
      
      - Extra bitmap update along with page status update
        Now page dirty and writeback is controlled by
        btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
        They both follow the schema that any sector is dirty/writeback, then
        the full page gets dirty/writeback.
      
      - When to update the nr_written number
        Now we take a shortcut, if we have cleared the last dirty bit of the
        page, we update nr_written.
        This is not completely perfect, but should emulate the old behavior
        well enough.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35b6ddfa
    • Qu Wenruo's avatar
      btrfs: introduce end_bio_subpage_eb_writepage() function · 2f3186d8
      Qu Wenruo authored
      The new function, end_bio_subpage_eb_writepage(), will handle the
      metadata writeback endio.
      
      The major differences involved are:
      
      - How to grab extent buffer
        Now page::private is a pointer to btrfs_subpage, we can no longer grab
        extent buffer directly.
        Thus we need to use the bv_offset to locate the extent buffer manually
        and iterate through the whole range.
      
      - Use btrfs_subpage_end_writeback() caller
        This helper will handle the subpage writeback for us.
      
      Since this function is executed under endio context, when grabbing
      extent buffers it can't grab eb->refs_lock as that lock is not designed
      to be grabbed under hardirq context.
      
      So here introduce a helper, find_extent_buffer_nolock(), for such
      situation, and convert find_extent_buffer() to use that helper.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f3186d8
    • Josef Bacik's avatar
      btrfs: check return value of btrfs_commit_transaction in relocation · fb686c68
      Josef Bacik authored
      There are a few places where we don't check the return value of
      btrfs_commit_transaction in relocation.c.  Thankfully all these places
      have straightforward error handling, so simply change all of the sites
      at once.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb686c68
    • Josef Bacik's avatar
      btrfs: do proper error handling in merge_reloc_roots · 24213fa4
      Josef Bacik authored
      We have a BUG_ON() if we get an error back from btrfs_get_fs_root().
      This honestly should never fail, as at this point we have a solid
      coordination of fs root to reloc root, and these roots will all be in
      memory.  But in the name of killing BUG_ON()'s remove these and handle
      the error condition properly, ASSERT()'ing for developers.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24213fa4
    • Josef Bacik's avatar
      btrfs: handle extent corruption with select_one_root properly · 8717cf44
      Josef Bacik authored
      In corruption cases we could have paths from a block up to no root at
      all, and thus we'll BUG_ON(!root) in select_one_root.  Handle this by
      adding an ASSERT() for developers, and returning an error for normal
      users.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8717cf44
    • Josef Bacik's avatar
      btrfs: cleanup error handling in prepare_to_merge · e0b085b0
      Josef Bacik authored
      This probably can't happen even with a corrupt file system, because we
      would have failed much earlier on than here.  However there's no reason
      we can't just check and bail out as appropriate, so do that and convert
      the correctness BUG_ON() to an ASSERT().
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0b085b0
    • Josef Bacik's avatar
      btrfs: do not panic in __add_reloc_root · 57a304cf
      Josef Bacik authored
      If we have a duplicate entry for a reloc root then we could have fs
      corruption that resulted in a double allocation.  Since this shouldn't
      happen unless there is corruption, add an ASSERT(ret != -EEXIST) to all
      of the callers of __add_reloc_root() to catch any logic mistakes for
      developers, otherwise normal error handling will happen for normal
      users.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57a304cf
    • Josef Bacik's avatar
      btrfs: handle __add_reloc_root failures in btrfs_recover_relocation · 3c925863
      Josef Bacik authored
      We can already handle errors appropriately from this function, deal with
      an error coming from __add_reloc_root appropriately.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c925863
    • Josef Bacik's avatar
      btrfs: do proper error handling in create_reloc_inode · 790c1b8c
      Josef Bacik authored
      We already handle some errors in this function, and the callers do the
      correct error handling, so clean up the rest of the function to do the
      appropriate error handling.
      
      There's a little extra work that needs to be done here, as we create the
      inode item before we create the orphan item.  We could potentially add
      the orphan item, but if we failed to create the inode item we would have
      to abort the transaction.
      
      Instead add a helper to delete the inode item we created in the case
      that we're unable to look up the inode (this would likely be caused by
      an ENOMEM), which if it succeeds means we can avoid a transaction abort
      in this particular error case.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      790c1b8c
    • Josef Bacik's avatar
      btrfs: remove the extent item sanity checks in relocate_block_group · 24cd6389
      Josef Bacik authored
      These checks are all taken care of for us by the tree checker code:
      
      - the flags don't change or are updated consistently
      - the v0 extent item format is invalid and caught in many other places
        too
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24cd6389
    • Josef Bacik's avatar
      btrfs: tree-checker: check for BTRFS_BLOCK_FLAG_FULL_BACKREF being set improperly · 0ebb6bbb
      Josef Bacik authored
      We need to validate that a data extent item does not have the
      FULL_BACKREF flag set on its flags.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ebb6bbb
    • Josef Bacik's avatar
      btrfs: handle extent reference errors in do_relocation · eb6b7fb4
      Josef Bacik authored
      We can already deal with errors appropriately from do_relocation, simply
      handle any errors that come from changing the refs at this point
      cleanly.  We have to abort the transaction if we fail here as we've
      modified metadata at this point.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb6b7fb4
    • Josef Bacik's avatar
      btrfs: handle errors in reference count manipulation in replace_path · 253e258c
      Josef Bacik authored
      If any of the reference count manipulation stuff fails in replace_path
      we need to abort the transaction, as we've modified the blocks already.
      We can simply break at this point and everything will be cleaned up.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      253e258c
    • Josef Bacik's avatar
      btrfs: handle btrfs_search_slot failure in replace_path · 0e9873e2
      Josef Bacik authored
      The search can fail for various reasons, in case of errors there's no
      cleanup to be done so we can pass the error to the caller, adjusting for
      the case where the key is not found and search slot returns 1.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e9873e2
    • Josef Bacik's avatar
      btrfs: handle btrfs_cow_block errors in replace_path · 45b87c5d
      Josef Bacik authored
      If we error out COWing the root node when doing a replace_path then we
      simply unlock and free the buffer and return the error.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      45b87c5d
    • Josef Bacik's avatar
      btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'s · 7a9213a9
      Josef Bacik authored
      A few BUG_ON()'s in replace_path are purely to keep us from making
      logical mistakes, so replace them with ASSERT()'s.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a9213a9
    • Josef Bacik's avatar
      btrfs: do proper error handling in btrfs_update_reloc_root · 592fbcd5
      Josef Bacik authored
      We call btrfs_update_root in btrfs_update_reloc_root, which can fail for
      all sorts of reasons, including IO errors.  Instead of panicing the box
      lets return the error, now that all callers properly handle those
      errors.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      592fbcd5