• Filipe Manana's avatar
    btrfs: fix processing of delayed tree block refs during backref walking · 943553ef
    Filipe Manana authored
    During backref walking, when processing a delayed reference with a type of
    BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there:
    
    1) We are accessing the delayed references extent_op, and its key, without
       the protection of the delayed ref head's lock;
    
    2) If there's no extent op for the delayed ref head, we end up with an
       uninitialized key in the stack, variable 'tmp_op_key', and then pass
       it to add_indirect_ref(), which adds the reference to the indirect
       refs rb tree.
    
       This is wrong, because indirect references should have a NULL key
       when we don't have access to the key, and in that case they should be
       added to the indirect_missing_keys rb tree and not to the indirect rb
       tree.
    
       This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting
       from freeing an extent buffer, therefore with a count of -1, it will
       not cancel out the corresponding reference we have in the extent tree
       (with a count of 1), since both references end up in different rb
       trees.
    
       When using fiemap, where we often need to check if extents are shared
       through shared subtrees resulting from snapshots, it means we can
       incorrectly report an extent as shared when it's no longer shared.
       However this is temporary because after the transaction is committed
       the extent is no longer reported as shared, as running the delayed
       reference results in deleting the tree block reference from the extent
       tree.
    
       Outside the fiemap context, the result is unpredictable, as the key was
       not initialized but it's used when navigating the rb trees to insert
       and search for references (prelim_ref_compare()), and we expect all
       references in the indirect rb tree to have valid keys.
    
    The following reproducer triggers the second bug:
    
       $ cat test.sh
       #!/bin/bash
    
       DEV=/dev/sdj
       MNT=/mnt/sdj
    
       mkfs.btrfs -f $DEV
       mount -o compress $DEV $MNT
    
       # With a compressed 128M file we get a tree height of 2 (level 1 root).
       xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo
    
       btrfs subvolume snapshot $MNT $MNT/snap
    
       # Fiemap should output 0x2008 in the flags column.
       # 0x2000 means shared extent
       # 0x8 means encoded extent (because it's compressed)
       echo
       echo "fiemap after snapshot, range [120M, 120M + 128K):"
       xfs_io -c "fiemap -v 120M 128K" $MNT/foo
       echo
    
       # Overwrite one extent and fsync to flush delalloc and COW a new path
       # in the snapshot's tree.
       #
       # After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type
       # BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent
       # buffer in the path.
       #
       # In the extent tree we have inline references of type
       # BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent
       # buffers, so they should cancel each other, and the extent buffers in
       # the fs tree should no longer be considered as shared.
       #
       echo "Overwriting file range [120M, 120M + 128K)..."
       xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo
       xfs_io -c "fsync" $MNT/snap/foo
    
       # Fiemap should output 0x8 in the flags column. The extent in the range
       # [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs
       # tree.
       echo
       echo "fiemap after overwrite range [120M, 120M + 128K):"
       xfs_io -c "fiemap -v 120M 128K" $MNT/foo
       echo
    
       umount $MNT
    
    Running it before this patch:
    
       $ ./test.sh
       (...)
       wrote 134217728/134217728 bytes at offset 0
       128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec)
       Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
    
       fiemap after snapshot, range [120M, 120M + 128K):
       /mnt/sdj/foo:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [245760..246015]: 34304..34559       256 0x2008
    
       Overwriting file range [120M, 120M + 128K)...
       wrote 131072/131072 bytes at offset 125829120
       128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec)
    
       fiemap after overwrite range [120M, 120M + 128K):
       /mnt/sdj/foo:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [245760..246015]: 34304..34559       256 0x2008
    
    The extent in the range [120M, 120M + 128K) is still reported as shared
    (0x2000 bit set) after overwriting that range and flushing delalloc, which
    is not correct - an entire path was COWed in the snapshot's tree and the
    extent is now only referenced by the original fs tree.
    
    Running it after this patch:
    
       $ ./test.sh
       (...)
       wrote 134217728/134217728 bytes at offset 0
       128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec)
       Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
    
       fiemap after snapshot, range [120M, 120M + 128K):
       /mnt/sdj/foo:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [245760..246015]: 34304..34559       256 0x2008
    
       Overwriting file range [120M, 120M + 128K)...
       wrote 131072/131072 bytes at offset 125829120
       128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec)
    
       fiemap after overwrite range [120M, 120M + 128K):
       /mnt/sdj/foo:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [245760..246015]: 34304..34559       256   0x8
    
    Now the extent is not reported as shared anymore.
    
    So fix this by passing a NULL key pointer to add_indirect_ref() when
    processing a delayed reference for a tree block if there's no extent op
    for our delayed ref head with a defined key. Also access the extent op
    only after locking the delayed ref head's lock.
    
    The reproducer will be converted later to a test case for fstests.
    
    Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
    Fixes: a6dbceaf ("btrfs: Remove unused op_key var from add_delayed_refs")
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    943553ef
backref.c 88.6 KB