• Qu Wenruo's avatar
    btrfs: Flush before reflinking any extent to prevent NOCOW write falling back... · a94d1d0c
    Qu Wenruo authored
    btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to COW without data reservation
    
    [BUG]
    The following script can cause unexpected fsync failure:
    
      #!/bin/bash
    
      dev=/dev/test/test
      mnt=/mnt/btrfs
    
      mkfs.btrfs -f $dev -b 512M > /dev/null
      mount $dev $mnt -o nospace_cache
    
      # Prealloc one extent
      xfs_io -f -c "falloc 8k 64m" $mnt/file1
      # Fill the remaining data space
      xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
      sync
    
      # Write into the prealloc extent
      xfs_io -c "pwrite 1m 16m" $mnt/file1
    
      # Reflink then fsync, fsync would fail due to ENOSPC
      xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
      umount $dev
    
    The fsync fails with ENOSPC, and the last page of the buffered write is
    lost.
    
    [CAUSE]
    This is caused by:
    - Btrfs' back reference only has extent level granularity
      So write into shared extent must be COWed even only part of the extent
      is shared.
    
    So for above script we have:
    - fallocate
      Create a preallocated extent where we can do NOCOW write.
    
    - fill all the remaining data and unallocated space
    
    - buffered write into preallocated space
      As we have not enough space available for data and the extent is not
      shared (yet) we fall into NOCOW mode.
    
    - reflink
      Now part of the large preallocated extent is shared, later write
      into that extent must be COWed.
    
    - fsync triggers writeback
      But now the extent is shared and therefore we must fallback into COW
      mode, which fails with ENOSPC since there's not enough space to
      allocate data extents.
    
    [WORKAROUND]
    The workaround is to ensure any buffered write in the related extents
    (not just the reflink source range) get flushed before reflink/dedupe,
    so that NOCOW writes succeed that happened before reflinking succeed.
    
    The workaround is expensive, we could do it better by only flushing
    NOCOW range, but that needs extra accounting for NOCOW range.
    For now, fix the possible data loss first.
    Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    a94d1d0c
ioctl.c 141 KB