• Filipe Manana's avatar
    btrfs: use delayed items when logging a directory · 30b80f3c
    Filipe Manana authored
    When logging a directory we start by flushing all its delayed items.
    That results in adding dir index items to the subvolume btree, for new
    dentries, and removing dir index items from the subvolume btree for any
    dentries that were deleted.
    
    This makes it straightforward to log a directory simply by iterating over
    all the modified subvolume btree leaves, especially when we used to log
    both dir index keys and dir item keys (before commit 339d0354
    ("btrfs: only copy dir index keys when logging a directory") and when we
    used to copy old dir index entries for leaves modified in the current
    transaction (before commit 732d591a ("btrfs: stop copying old dir
    items when logging a directory")).
    
    From an efficiency point of view this has a couple of drawbacks:
    
    1) Adds extra latency, due to copying delayed items to the subvolume btree
       and deleting dir index items from the btree.
    
       Further if there are other tasks accessing the btree, which is common
       (syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks,
       etc, finishing an ordered extent, etc), lock contention can cause
       further delays, both to the task logging a directory and to the other
       tasks accessing the btree;
    
    2) More time spent overall flushing delayed items, if after logging the
       directory further changes are done to the directory in the same
       transaction.
    
       For example, if we add 10 dentries to a directory, fsync it, add more
       10 dentries, fsync it again, then add more 10 dentries and fsync it
       again, then we end up inserting 3 batches of 10 items to the subvolume
       btree. With the changes from this patch, we flush all the delayed items
       to the btree only once - a single batch of 30 items, and outside the
       logging code (transaction commit or when delayed items are flushed
       asynchronously).
    
    This change simply skips the flushing of delayed items every time we log a
    directory. Instead we copy the delayed insertion items directly to the log
    tree and delete delayed deletion items directly from the log tree.
    Therefore avoiding changing first the subvolume btree and then scanning it
    for new items to copy from it to the log tree and detecting deletions
    by observing gaps in consecutive dir index keys in subvolume btree leaves.
    
    Running the following tests on a non-debug kernel (Debian's default kernel
    config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram,
    produced the results below.
    
    The results compare a branch without this patch and all the other patches
    it depends on versus the same branch with the patchset applied.
    
    The patchset is comprised of the following patches:
    
      btrfs: don't drop dir index range items when logging a directory
      btrfs: remove the root argument from log_new_dir_dentries()
      btrfs: update stale comment for log_new_dir_dentries()
      btrfs: free list element sooner at log_new_dir_dentries()
      btrfs: avoid memory allocation at log_new_dir_dentries() for common case
      btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()
      btrfs: store index number instead of key in struct btrfs_delayed_item
      btrfs: remove unused logic when looking up delayed items
      btrfs: shrink the size of struct btrfs_delayed_item
      btrfs: search for last logged dir index if it's not cached in the inode
      btrfs: move need_log_inode() to above log_conflicting_inodes()
      btrfs: move log_new_dir_dentries() above btrfs_log_inode()
      btrfs: log conflicting inodes without holding log mutex of the initial inode
      btrfs: skip logging parent dir when conflicting inode is not a dir
      btrfs: use delayed items when logging a directory
    
    Custom test script for testing time spent at btrfs_log_inode():
    
       #!/bin/bash
    
       DEV=/dev/nvme0n1
       MNT=/mnt/nvme0n1
    
       # Total number of files to create in the test directory.
       NUM_FILES=10000
       # Fsync after creating or renaming N files.
       FSYNC_AFTER=100
    
       umount $DEV &> /dev/null
       mkfs.btrfs -f $DEV
       mount -o ssd $DEV $MNT
    
       TEST_DIR=$MNT/testdir
       mkdir $TEST_DIR
    
       echo "Creating files..."
       for ((i = 1; i <= $NUM_FILES; i++)); do
               echo -n > $TEST_DIR/file_$i
               if (( ($i % $FSYNC_AFTER) == 0 )); then
                       xfs_io -c "fsync" $TEST_DIR
               fi
       done
    
       sync
    
       echo "Renaming files..."
       for ((i = 1; i <= $NUM_FILES; i++)); do
               mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed
               if (( ($i % $FSYNC_AFTER) == 0 )); then
                       xfs_io -c "fsync" $TEST_DIR
               fi
       done
    
       umount $MNT
    
    And using the following bpftrace script to capture the total time that is
    spent at btrfs_log_inode():
    
       #!/usr/bin/bpftrace
    
       k:btrfs_log_inode
       {
               @start_log_inode[tid] = nsecs;
       }
    
       kr:btrfs_log_inode
       /@start_log_inode[tid]/
       {
               $dur = (nsecs - @start_log_inode[tid]) / 1000;
               @btrfs_log_inode_total_time = sum($dur);
               delete(@start_log_inode[tid]);
       }
    
       END
       {
               clear(@start_log_inode);
       }
    
    Result before applying patchset:
    
       @btrfs_log_inode_total_time: 622642
    
    Result after applying patchset:
    
       @btrfs_log_inode_total_time: 354134    (-43.1% time spent)
    
    The following dbench script was also used for testing:
    
       #!/bin/bash
    
       NUM_JOBS=$(nproc --all)
    
       DEV=/dev/nvme0n1
       MNT=/mnt/nvme0n1
       MOUNT_OPTIONS="-o ssd"
       MKFS_OPTIONS="-O no-holes -R free-space-tree"
    
       echo "performance" | \
           tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    
       umount $DEV &> /dev/null
       mkfs.btrfs -f $MKFS_OPTIONS $DEV
       mount $MOUNT_OPTIONS $DEV $MNT
    
       dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS
    
       umount $MNT
    
    Before patchset:
    
     Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    3322265     0.034    21.032
     Close        2440562     0.002     0.994
     Rename        140664     1.150   269.633
     Unlink        670796     1.093   269.678
     Deltree           96     5.481    15.510
     Mkdir             48     0.004     0.052
     Qpathinfo    3010924     0.014     8.127
     Qfileinfo     528055     0.001     0.518
     Qfsinfo       552113     0.003     0.372
     Sfileinfo     270575     0.005     0.688
     Find         1164176     0.052    13.931
     WriteX       1658537     0.019     5.918
     ReadX        5207412     0.003     1.034
     LockX          10818     0.003     0.079
     UnlockX        10818     0.002     0.313
     Flush         232811     1.027   269.735
    
    Throughput 869.867 MB/sec (sync dirs)  12 clients  12 procs  max_latency=269.741 ms
    
    After patchset:
    
     Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    4152738     0.029    20.863
     Close        3050770     0.002     1.119
     Rename        175829     0.871   211.741
     Unlink        838447     0.845   211.724
     Deltree          120     4.798    14.162
     Mkdir             60     0.003     0.005
     Qpathinfo    3763807     0.011     4.673
     Qfileinfo     660111     0.001     0.400
     Qfsinfo       690141     0.003     0.429
     Sfileinfo     338260     0.005     0.725
     Find         1455273     0.046     6.787
     WriteX       2073307     0.017     5.690
     ReadX        6509193     0.003     1.171
     LockX          13522     0.003     0.077
     UnlockX        13522     0.002     0.125
     Flush         291044     0.811   211.631
    
    Throughput 1089.27 MB/sec (sync dirs)  12 clients  12 procs  max_latency=211.750 ms
    
    (+25.2% throughput, -21.5% max latency)
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    30b80f3c
delayed-inode.c 59.4 KB