• Filipe Manana's avatar
    btrfs: fix race between marking inode needs to be logged and log syncing · bc0939fc
    Filipe Manana authored
    We have a race between marking that an inode needs to be logged, either
    at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
    btrfs_sync_log(). The following steps describe how the race happens.
    
    1) We are at transaction N;
    
    2) Inode I was previously fsynced in the current transaction so it has:
    
        inode->logged_trans set to N;
    
    3) The inode's root currently has:
    
       root->log_transid set to 1
       root->last_log_commit set to 0
    
       Which means only one log transaction was committed to far, log
       transaction 0. When a log tree is created we set ->log_transid and
       ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
    
    4) One more range of pages is dirtied in inode I;
    
    5) Some task A starts an fsync against some other inode J (same root), and
       so it joins log transaction 1.
    
       Before task A calls btrfs_sync_log()...
    
    6) Task B starts an fsync against inode I, which currently has the full
       sync flag set, so it starts delalloc and waits for the ordered extent
       to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
    
    7) During ordered extent completion we have btrfs_update_inode() called
       against inode I, which in turn calls btrfs_set_inode_last_trans(),
       which does the following:
    
         spin_lock(&inode->lock);
         inode->last_trans = trans->transaction->transid;
         inode->last_sub_trans = inode->root->log_transid;
         inode->last_log_commit = inode->root->last_log_commit;
         spin_unlock(&inode->lock);
    
       So ->last_trans is set to N and ->last_sub_trans set to 1.
       But before setting ->last_log_commit...
    
    8) Task A is at btrfs_sync_log():
    
       - it increments root->log_transid to 2
       - starts writeback for all log tree extent buffers
       - waits for the writeback to complete
       - writes the super blocks
       - updates root->last_log_commit to 1
    
       It's a lot of slow steps between updating root->log_transid and
       root->last_log_commit;
    
    9) The task doing the ordered extent completion, currently at
       btrfs_set_inode_last_trans(), then finally runs:
    
         inode->last_log_commit = inode->root->last_log_commit;
         spin_unlock(&inode->lock);
    
       Which results in inode->last_log_commit being set to 1.
       The ordered extent completes;
    
    10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
        true because we have all the following conditions met:
    
        inode->logged_trans == N which matches fs_info->generation &&
        inode->last_subtrans (1) <= inode->last_log_commit (1) &&
        inode->last_subtrans (1) <= root->last_log_commit (1) &&
        list inode->extent_tree.modified_extents is empty
    
        And as a consequence we return without logging the inode, so the
        existing logged version of the inode does not point to the extent
        that was written after the previous fsync.
    
    It should be impossible in practice for one task be able to do so much
    progress in btrfs_sync_log() while another task is at
    btrfs_set_inode_last_trans() right after it reads root->log_transid and
    before it reads root->last_log_commit. Even if kernel preemption is enabled
    we know the task at btrfs_set_inode_last_trans() can not be preempted
    because it is holding the inode's spinlock.
    
    However there is another place where we do the same without holding the
    spinlock, which is in the memory mapped write path at:
    
      vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
      {
         (...)
         BTRFS_I(inode)->last_trans = fs_info->generation;
         BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
         BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
         (...)
    
    So with preemption happening after setting ->last_sub_trans and before
    setting ->last_log_commit, it is less of a stretch to have another task
    do enough progress at btrfs_sync_log() such that the task doing the memory
    mapped write ends up with ->last_sub_trans and ->last_log_commit set to
    the same value. It is still a big stretch to get there, as the task doing
    btrfs_sync_log() has to start writeback, wait for its completion and write
    the super blocks.
    
    So fix this in two different ways:
    
    1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
       value of ->last_sub_trans minus 1;
    
    2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
       like we do for buffered and direct writes at btrfs_file_write_iter(),
       which is all we need to make sure multiple writes and fsyncs to an
       inode in the same transaction never result in an fsync missing that
       the inode changed and needs to be logged. Turn this into a helper
       function and use it both at btrfs_page_mkwrite() and at
       btrfs_file_write_iter() - this also fixes the problem that at
       btrfs_page_mkwrite() we were setting those fields without the
       protection of the inode's spinlock.
    
    This is an extremely unlikely race to happen in practice.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    bc0939fc
btrfs_inode.h 10.8 KB