• Filipe Manana's avatar
    btrfs: fix race between ordered extent completion and fiemap · a1a4a9ca
    Filipe Manana authored
    For fiemap we recently stopped locking the target extent range for the
    whole duration of the fiemap call, in order to avoid a deadlock in a
    scenario where the fiemap buffer happens to be a memory mapped range of
    the same file. This use case is very unlikely to be useful in practice but
    it may be triggered by fuzz testing (syzbot, etc).
    
    However by not locking the target extent range for the whole duration of
    the fiemap call we can race with an ordered extent. This happens like
    this:
    
    1) The fiemap task finishes processing a file extent item that covers
       the file range [512K, 1M[, and that file extent item is the last item
       in the leaf currently being processed;
    
    2) And ordered extent for the file range [768K, 2M[, in COW mode,
       completes (btrfs_finish_one_ordered()) and the file extent item
       covering the range [512K, 1M[ is trimmed to cover the range
       [512K, 768K[ and then a new file extent item for the range [768K, 2M[
       is inserted in the inode's subvolume tree;
    
    3) The fiemap task calls fiemap_next_leaf_item(), which then calls
       btrfs_next_leaf() to find the next leaf / item. This finds that the
       the next key following the one we previously processed (its type is
       BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
       to the new file extent item inserted by the ordered extent, which has
       a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;
    
    4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
       the warning:
    
          if (cache->offset + cache->len > offset) {
                   WARN_ON(1);
                   return -EINVAL;
          }
    
       Since we get 1M > 768K, because the previously emitted entry for the
       old extent covering the file range [512K, 1M[ ends at an offset that
       is greater than the new extent's start offset (768K). This makes fiemap
       fail with -EINVAL besides triggering the warning that produces a stack
       trace like the following:
    
         [1621.677651] ------------[ cut here ]------------
         [1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
         [1621.677899] Modules linked in: btrfs blake2b_generic (...)
         [1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
         [1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
         [1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
         [1621.678033] Code: 2b 4c 89 63 (...)
         [1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
         [1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
         [1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
         [1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
         [1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
         [1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
         [1621.678044] FS:  00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
         [1621.678046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
         [1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [1621.678056] Call Trace:
         [1621.678074]  <TASK>
         [1621.678076]  ? __warn+0x80/0x130
         [1621.678082]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
         [1621.678159]  ? report_bug+0x1f4/0x200
         [1621.678164]  ? handle_bug+0x42/0x70
         [1621.678167]  ? exc_invalid_op+0x14/0x70
         [1621.678170]  ? asm_exc_invalid_op+0x16/0x20
         [1621.678178]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
         [1621.678253]  extent_fiemap+0x766/0xa30 [btrfs]
         [1621.678339]  btrfs_fiemap+0x45/0x80 [btrfs]
         [1621.678420]  do_vfs_ioctl+0x1e4/0x870
         [1621.678431]  __x64_sys_ioctl+0x6a/0xc0
         [1621.678434]  do_syscall_64+0x52/0x120
         [1621.678445]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
    
    There's also another case where before calling btrfs_next_leaf() we are
    processing a hole or a prealloc extent and we had several delalloc ranges
    within that hole or prealloc extent. In that case if the ordered extents
    complete before we find the next key, we may end up finding an extent item
    with an offset smaller than (or equals to) the offset in cache->offset.
    
    So fix this by changing emit_fiemap_extent() to address these three
    scenarios like this:
    
    1) For the first case, steps listed above, adjust the length of the
       previously cached extent so that it does not overlap with the current
       extent, emit the previous one and cache the current file extent item;
    
    2) For the second case where he had a hole or prealloc extent with
       multiple delalloc ranges inside the hole or prealloc extent's range,
       and the current file extent item has an offset that matches the offset
       in the fiemap cache, just discard what we have in the fiemap cache and
       assign the current file extent item to the cache, since it's more up
       to date;
    
    3) For the third case where he had a hole or prealloc extent with
       multiple delalloc ranges inside the hole or prealloc extent's range
       and the offset of the file extent item we just found is smaller than
       what we have in the cache, just skip the current file extent item
       if its range end at or behind the cached extent's end, because we may
       have emitted (to the fiemap user space buffer) delalloc ranges that
       overlap with the current file extent item's range. If the file extent
       item's range goes beyond the end offset of the cached extent, just
       emit the cached extent and cache a subrange of the file extent item,
       that goes from the end offset of the cached extent to the end offset
       of the file extent item.
    
    Dealing with those cases in those ways makes everything consistent by
    reflecting the current state of file extent items in the btree and
    without emitting extents that have overlapping ranges (which would be
    confusing and violating expectations).
    
    This issue could be triggered often with test case generic/561, and was
    also hit and reported by Wang Yugui.
    Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
    Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/
    Fixes: b0ad381f ("btrfs: fix deadlock with fiemap and extent locking")
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    a1a4a9ca
extent_io.c 141 KB