• Filipe Manana's avatar
    Btrfs: ensure ordered extent errors aren't missed on fsync · b38ef71c
    Filipe Manana authored
    When doing a fsync with a fast path we have a time window where we can miss
    the fact that writeback of some file data failed, and therefore we endup
    returning success (0) from fsync when we should return an error.
    The steps that lead to this are the following:
    
    1) We start all ordered extents by calling filemap_fdatawrite_range();
    
    2) We do some other work like locking the inode's i_mutex, start a transaction,
       start a log transaction, etc;
    
    3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
       ordered extents from inode's ordered tree into a list;
    
    4) But by the time we do ordered extent collection, some ordered extents we started
       at step 1) might have already completed with an error, and therefore we didn't
       found them in the ordered tree and had no idea they finished with an error. This
       makes our fsync return success (0) to userspace, but has no bad effects on the log
       like for example insertion of file extent items into the log that point to unwritten
       extents, because the invalid extent maps were removed before the ordered extent
       completed (in inode.c:btrfs_finish_ordered_io).
    
    So after collecting the ordered extents just check if the inode's i_mapping has any
    error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
    writeback fails for a page of an ordered extent, we call mapping_set_error (done in
    extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
    that sets one of those error flags in the inode's i_mapping flags.
    
    This change also has the side effect of fixing the issue where for fast fsyncs we
    never checked/cleared the error flags from the inode's i_mapping flags, which means
    that a full fsync performed after a fast fsync could get such errors that belonged
    to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
    calls filemap_fdatawait_range(), and the later checks for and clears those flags,
    while for fast fsyncs we never call filemap_fdatawait_range() or anything else
    that checks for and clears the error flags from the inode's i_mapping.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarChris Mason <clm@fb.com>
    b38ef71c
tree-log.c 120 KB