• Filipe Manana's avatar
    Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes · 609e804d
    Filipe Manana authored
    When we are mixing buffered writes with direct IO writes against the same
    file and snapshotting is happening concurrently, we can end up with a
    corrupt file content in the snapshot. Example:
    
    1) Inode/file is empty.
    
    2) Snapshotting starts.
    
    2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
       inode to 256Kb, disk_i_size remains zero. This happens after the task
       doing the snapshot flushes all existing delalloc.
    
    3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
       completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
       updates the inode item in the fs tree with a size of 1Mb (which is
       the value of disk_i_size).
    
    4) The dealloc for the range [0, 256Kb[ did not start yet.
    
    5) The transaction used in the DIO ordered extent completion, which updated
       the inode item, is committed by the snapshotting task.
    
    6) Snapshot creation completes.
    
    7) Dealloc for the range [0, 256Kb[ is flushed.
    
    After that when reading the file from the snapshot we always get zeroes for
    the range [0, 256Kb[, the file has a size of 1Mb and the data written by
    the direct IO write is found. From an application's point of view this is
    a corruption, since in the source subvolume it could never read a version
    of the file that included the data from the direct IO write without the
    data from the buffered write included as well. In the snapshot's tree,
    file extent items are missing for the range [0, 256Kb[.
    
    The issue, obviously, does not happen when using the -o flushoncommit
    mount option.
    
    Fix this by flushing delalloc for all the roots that are about to be
    snapshotted when committing a transaction. This guarantees total ordering
    when updating the disk_i_size of an inode since the flush for dealloc is
    done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
    is done once no more external writers exist. This is similar to what we
    do when using the flushoncommit mount option, but we do it only if the
    transaction has snapshots to create and only for the roots of the
    subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
    snapshot creation ioctl, so the flush work we do inside the transaction
    is minimized.
    
    This issue, involving buffered and direct IO writes with snapshotting, is
    often triggered by fstest btrfs/078, and got reported by fsck when not
    using the NO_HOLES features, for example:
    
      $ cat results/btrfs/078.full
      (...)
      _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
      *** fsck.btrfs output ***
      [1/7] checking root items
      [2/7] checking extents
      [3/7] checking free space cache
      [4/7] checking fs roots
      root 258 inode 264 errors 100, file extent discount
      Found file extent holes:
            start: 524288, len: 65536
      ERROR: errors found in fs roots
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    609e804d
transaction.c 66 KB