• Filipe Manana's avatar
    btrfs: fix race that makes inode logging fallback to transaction commit · 47d3db41
    Filipe Manana authored
    When logging an inode and the previous transaction is still committing, we
    have a time window where we can end up incorrectly think an inode has its
    last_unlink_trans field with a value greater than the last transaction
    committed, which results in the logging to fallback to a full transaction
    commit, which is usually much more expensive than doing a log commit.
    
    The race is described by the following steps:
    
    1) We are at transaction 1000;
    
    2) We modify an inode X (a directory) using transaction 1000 and set its
       last_unlink_trans field to 1000, because for example we removed one
       of its subdirectories;
    
    3) We create a new inode Y with a dentry in inode X using transaction 1000,
       so its generation field is set to 1000;
    
    4) The commit for transaction 1000 is started by task A;
    
    5) The task committing transaction 1000 sets the transaction state to
       unblocked, writes the dirty extent buffers and the super blocks, then
       unlocks tree_log_mutex;
    
    6) Some task starts a new transaction with a generation of 1001;
    
    7) We do some modification to inode Y (using transaction 1001);
    
    8) The transaction 1000 commit starts unpinning extents. At this point
       fs_info->last_trans_committed still has a value of 999;
    
    9) Task B starts an fsync on inode Y, and gets a handle for transaction
       1001. When it gets to check_parent_dirs_for_sync() it does the checking
       of the ancestor dentries because the following check does not evaluate
       to true:
    
           if (S_ISREG(inode->vfs_inode.i_mode) &&
               inode->generation <= last_committed &&
               inode->last_unlink_trans <= last_committed)
                   goto out;
    
       The generation value for inode Y is 1000 and last_committed, which has
       the value read from fs_info->last_trans_committed, has a value of 999,
       so that check evaluates to false and we proceed to check the ancestor
       inodes.
    
       Once we get to the first ancestor, inode X, we call
       btrfs_must_commit_transaction() on it, which evaluates to true:
    
       static bool btrfs_must_commit_transaction(...)
       {
           struct btrfs_fs_info *fs_info = inode->root->fs_info;
           bool ret = false;
    
           mutex_lock(&inode->log_mutex);
           if (inode->last_unlink_trans > fs_info->last_trans_committed) {
               /*
                * Make sure any commits to the log are forced to be full
                * commits.
                */
                btrfs_set_log_full_commit(trans);
                ret = true;
           }
        (...)
    
        because inode's X last_unlink_trans has a value of 1000 and
        fs_info->last_trans_committed still has a value of 999, it returns
        true to check_parent_dirs_for_sync(), making it return 1 which is
        propagated up to btrfs_sync_file(), causing it to fallback to a full
        transaction commit of transaction 1001.
    
        We should have not fallen back to commit transaction 1001, since inode
        X had last_unlink_trans set to 1000 and the super blocks for
        transaction 1000 were already written. So while not resulting in a
        functional problem, it leads to a lot more work and higher latencies
        for a fsync since committing a transaction is usually more expensive
        than committing a log (if other filesystem changes happened under that
        transaction).
    
    Similar problem happens when logging directories, for the same reason as
    btrfs_must_commit_transaction() returns true on an inode with its
    last_unlink_trans having the generation of the previous transaction and
    that transaction is still committing, unpinning its freed extents.
    
    So fix this by comparing last_unlink_trans with the id of the current
    transaction instead of fs_info->last_trans_committed.
    
    This case is often hit when running dbench for a long enough duration, as
    it does lots of rename and rmdir operations (both update the field
    last_unlink_trans of an inode) and fsyncs of files and directories.
    
    This patch belongs to a patch set that is comprised of the following
    patches:
    
      btrfs: fix race causing unnecessary inode logging during link and rename
      btrfs: fix race that results in logging old extents during a fast fsync
      btrfs: fix race that causes unnecessary logging of ancestor inodes
      btrfs: fix race that makes inode logging fallback to transaction commit
      btrfs: fix race leading to unnecessary transaction commit when logging inode
      btrfs: do not block inode logging for so long during transaction commit
    
    Performance results are mentioned in the change log of the last patch.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    47d3db41
tree-log.c 173 KB