• Filipe Manana's avatar
    btrfs: do not block inode logging for so long during transaction commit · 47876f7c
    Filipe Manana authored
    Early on during a transaction commit we acquire the tree_log_mutex and
    hold it until after we write the super blocks. But before writing the
    extent buffers dirtied by the transaction and the super blocks we unblock
    the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
    fs_info->running_transaction to NULL.
    
    This means that after that and before writing the super blocks, new
    transactions can start. However if any transaction wants to log an inode,
    it will block waiting for the transaction commit to write its dirty
    extent buffers and the super blocks because the tree_log_mutex is only
    released after those operations are complete, and starting a new log
    transaction blocks on that mutex (at start_log_trans()).
    
    Writing the dirty extent buffers and the super blocks can take a very
    significant amount of time to complete, but we could allow the tasks
    wanting to log an inode to proceed with most of their steps:
    
    1) create the log trees
    2) log metadata in the trees
    3) write their dirty extent buffers
    
    They only need to wait for the previous transaction commit to complete
    (write its super blocks) before they attempt to write their super blocks,
    otherwise we could end up with a corrupt filesystem after a crash.
    
    So change start_log_trans() to use the root tree's log_mutex to serialize
    for the creation of the log root tree instead of using the tree_log_mutex,
    and make btrfs_sync_log() acquire the tree_log_mutex before writing the
    super blocks. This allows for inode logging to wait much less time when
    there is a previous transaction that is still committing, often not having
    to wait at all, as by the time when we try to sync the log the previous
    transaction already wrote its super blocks.
    
    This patch belongs to a patch set that is comprised of the following
    patches:
    
      btrfs: fix race causing unnecessary inode logging during link and rename
      btrfs: fix race that results in logging old extents during a fast fsync
      btrfs: fix race that causes unnecessary logging of ancestor inodes
      btrfs: fix race that makes inode logging fallback to transaction commit
      btrfs: fix race leading to unnecessary transaction commit when logging inode
      btrfs: do not block inode logging for so long during transaction commit
    
    The following script that uses dbench was used to measure the impact of
    the whole patchset:
    
      $ cat test-dbench.sh
      #!/bin/bash
    
      DEV=/dev/nvme0n1
      MNT=/mnt/btrfs
      MOUNT_OPTIONS="-o ssd"
    
      echo "performance" | \
          tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    
      mkfs.btrfs -f -m single -d single $DEV
      mount $MOUNT_OPTIONS $DEV $MNT
    
      dbench -D $MNT -t 300 64
    
      umount $MNT
    
    The test was run on a machine with 12 cores, 64G of ram, using a NVMe
    device and a non-debug kernel configuration (Debian's default).
    
    Before patch set:
    
     Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    11277211    0.250    85.340
     Close        8283172     0.002     6.479
     Rename        477515     1.935    86.026
     Unlink       2277936     0.770    87.071
     Deltree          256    15.732    81.379
     Mkdir            128     0.003     0.009
     Qpathinfo    10221180    0.056    44.404
     Qfileinfo    1789967     0.002     4.066
     Qfsinfo      1874399     0.003     9.176
     Sfileinfo     918589     0.061    10.247
     Find         3951758     0.341    54.040
     WriteX       5616547     0.047    85.079
     ReadX        17676028    0.005     9.704
     LockX          36704     0.003     1.800
     UnlockX        36704     0.002     0.687
     Flush         790541    14.115   676.236
    
    Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms
    
    After patch set:
    
    Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    12687926    0.171    86.526
     Close        9320780     0.002     8.063
     Rename        537253     1.444    78.576
     Unlink       2561827     0.559    87.228
     Deltree          374    11.499    73.549
     Mkdir            187     0.003     0.005
     Qpathinfo    11500300    0.061    36.801
     Qfileinfo    2017118     0.002     7.189
     Qfsinfo      2108641     0.003     4.825
     Sfileinfo    1033574     0.008     8.065
     Find         4446553     0.408    47.835
     WriteX       6335667     0.045    84.388
     ReadX        19887312    0.003     9.215
     LockX          41312     0.003     1.394
     UnlockX        41312     0.002     1.425
     Flush         889233    13.014   623.259
    
    Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms
    
    +12.7% throughput, -8.2% max latency
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    47876f7c
ctree.h 124 KB