• Filipe Manana's avatar
    Btrfs: fix data loss in the fast fsync path · 0ab92529
    Filipe Manana authored
    commit 3a8b36f3 upstream.
    
    When using the fast file fsync code path we can miss the fact that new
    writes happened since the last file fsync and therefore return without
    waiting for the IO to finish and write the new extents to the fsync log.
    
    Here's an example scenario where the fsync will miss the fact that new
    file data exists that wasn't yet durably persisted:
    
    1. fs_info->last_trans_committed == N - 1 and current transaction is
       transaction N (fs_info->generation == N);
    
    2. do a buffered write;
    
    3. fsync our inode, this clears our inode's full sync flag, starts
       an ordered extent and waits for it to complete - when it completes
       at btrfs_finish_ordered_io(), the inode's last_trans is set to the
       value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
       btrfs_set_inode_last_trans);
    
    4. transaction N is committed, so fs_info->last_trans_committed is now
       set to the value N and fs_info->generation remains with the value N;
    
    5. do another buffered write, when this happens btrfs_file_write_iter
       sets our inode's last_trans to the value N + 1 (that is
       fs_info->generation + 1 == N + 1);
    
    6. transaction N + 1 is started and fs_info->generation now has the
       value N + 1;
    
    7. transaction N + 1 is committed, so fs_info->last_trans_committed
       is set to the value N + 1;
    
    8. fsync our inode - because it doesn't have the full sync flag set,
       we only start the ordered extent, we don't wait for it to complete
       (only in a later phase) therefore its last_trans field has the
       value N + 1 set previously by btrfs_file_write_iter(), and so we
       have:
    
           inode->last_trans <= fs_info->last_trans_committed
               (N + 1)              (N + 1)
    
       Which made us not log the last buffered write and exit the fsync
       handler immediately, returning success (0) to user space and resulting
       in data loss after a crash.
    
    This can actually be triggered deterministically and the following excerpt
    from a testcase I made for xfstests triggers the issue. It moves a dummy
    file across directories and then fsyncs the old parent directory - this
    is just to trigger a transaction commit, so moving files around isn't
    directly related to the issue but it was chosen because running 'sync' for
    example does more than just committing the current transaction, as it
    flushes/waits for all file data to be persisted. The issue can also happen
    at random periods, since the transaction kthread periodicaly commits the
    current transaction (about every 30 seconds by default).
    The body of the test is:
    
      _scratch_mkfs >> $seqres.full 2>&1
      _init_flakey
      _mount_flakey
    
      # Create our main test file 'foo', the one we check for data loss.
      # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
      # bit from its flags (btrfs inode specific flags).
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                      -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
    
      # Now create one other file and 2 directories. We will move this second file
      # from one directory to the other later because it forces btrfs to commit its
      # currently open transaction if we fsync the old parent directory. This is
      # necessary to trigger the data loss bug that affected btrfs.
      mkdir $SCRATCH_MNT/testdir_1
      touch $SCRATCH_MNT/testdir_1/bar
      mkdir $SCRATCH_MNT/testdir_2
    
      # Make sure everything is durably persisted.
      sync
    
      # Write more 8Kb of data to our file.
      $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
    
      # Move our 'bar' file into a new directory.
      mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
    
      # Fsync our first directory. Because it had a file moved into some other
      # directory, this made btrfs commit the currently open transaction. This is
      # a condition necessary to trigger the data loss bug.
      $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
    
      # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
      # data we wrote previously to be persisted and available if a crash happens.
      # This did not happen with btrfs, because of the transaction commit that
      # happened when we fsynced the parent directory.
      $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
    
      # Simulate a crash/power loss.
      _load_flakey_table $FLAKEY_DROP_WRITES
      _unmount_flakey
    
      _load_flakey_table $FLAKEY_ALLOW_WRITES
      _mount_flakey
    
      # Now check that all data we wrote before are available.
      echo "File content after log replay:"
      od -t x1 $SCRATCH_MNT/foo
    
      status=0
      exit
    
    The expected golden output for the test, which is what we get with this
    fix applied (or when running against ext3/4 and xfs), is:
    
      wrote 8192/8192 bytes at offset 0
      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
      wrote 8192/8192 bytes at offset 8192
      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
      File content after log replay:
      0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
      *
      0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
      *
      0040000
    
    Without this fix applied, the output shows the test file does not have
    the second 8Kb extent that we successfully fsynced:
    
      wrote 8192/8192 bytes at offset 0
      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
      wrote 8192/8192 bytes at offset 8192
      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
      File content after log replay:
      0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
      *
      0020000
    
    So fix this by skipping the fsync only if we're doing a full sync and
    if the inode's last_trans is <= fs_info->last_trans_committed, or if
    the inode is already in the log. Also remove setting the inode's
    last_trans in btrfs_file_write_iter since it's useless/unreliable.
    
    Also because btrfs_file_write_iter no longer sets inode->last_trans to
    fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
    bail out if last_trans is 0, otherwise something as simple as the following
    example wouldn't log the second write on the last fsync:
    
      1. write to file
    
      2. fsync file
    
      3. fsync file
           |--> btrfs_inode_in_log() returns true and it set last_trans to 0
    
      4. write to file
           |--> btrfs_file_write_iter() no longers sets last_trans, so it
                remained with a value of 0
      5. fsync
           |--> inode->last_trans == 0, so it bails out without logging the
                second write
    
    A test case for xfstests will be sent soon.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarChris Mason <clm@fb.com>
    Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
    0ab92529
file.c 74.2 KB