• Filipe Manana's avatar
    btrfs: reduce contention on log trees when logging checksums · 3ebac17c
    Filipe Manana authored
    The possibility of extents being shared (through clone and deduplication
    operations) requires special care when logging data checksums, to avoid
    having a log tree with different checksum items that cover ranges which
    overlap (which resulted in missing checksums after replaying a log tree).
    Such problems were fixed in the past by the following commits:
    
    commit 40e046ac ("Btrfs: fix missing data checksums after replaying a
                          log tree")
    
    commit e289f03e ("btrfs: fix corrupt log due to concurrent fsync of
                          inodes with shared extents")
    
    Test case generic/588 exercises the scenario solved by the first commit
    (purely sequential and deterministic) while test case generic/457 often
    triggered the case fixed by the second commit (not deterministic, requires
    specific timings under concurrency).
    
    The problems were addressed by deleting, from the log tree, any existing
    checksums before logging the new ones. And also by doing the deletion and
    logging of the cheksums while locking the checksum range in an extent io
    tree (root->log_csum_range), to deal with the case where we have concurrent
    fsyncs against files with shared extents.
    
    That however causes more contention on the leaves of a log tree where we
    store checksums (and all the nodes in the paths leading to them), even
    when we do not have shared extents, or all the shared extents were created
    by past transactions. It also adds a bit of contention on the spin lock of
    the log_csums_range extent io tree of the log root.
    
    This change adds a 'last_reflink_trans' field to the inode to keep track
    of the last transaction where a new extent was shared between inodes
    (through clone and deduplication operations). It is updated for both the
    source and destination inodes of reflink operations whenever a new extent
    (created in the current transaction) becomes shared by the inodes. This
    field is kept in memory only, not persisted in the inode item, similar
    to other existing fields (last_unlink_trans, logged_trans).
    
    When logging checksums for an extent, if the value of 'last_reflink_trans'
    is smaller then the current transaction's generation/id, we skip locking
    the extent range and deletion of checksums from the log tree, since we
    know we do not have new shared extents. This reduces contention on the
    log tree's leaves where checksums are stored.
    
    The following script, which uses fio, was used to measure the impact of
    this change:
    
      $ cat test-fsync.sh
      #!/bin/bash
    
      DEV=/dev/sdk
      MNT=/mnt/sdk
      MOUNT_OPTIONS="-o ssd"
      MKFS_OPTIONS="-d single -m single"
    
      if [ $# -ne 3 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
          exit 1
      fi
    
      NUM_JOBS=$1
      FILE_SIZE=$2
      FSYNC_FREQ=$3
    
      cat <<EOF > /tmp/fio-job.ini
      [writers]
      rw=write
      fsync=$FSYNC_FREQ
      fallocate=none
      group_reporting=1
      direct=0
      bs=64k
      ioengine=sync
      size=$FILE_SIZE
      directory=$MNT
      numjobs=$NUM_JOBS
      EOF
    
      echo "Using config:"
      echo
      cat /tmp/fio-job.ini
      echo
    
      mkfs.btrfs -f $MKFS_OPTIONS $DEV
      mount $MOUNT_OPTIONS $DEV $MNT
      fio /tmp/fio-job.ini
      umount $MNT
    
    The tests were performed for different numbers of jobs, file sizes and
    fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
    12 cores, with cpu governance set to performance mode on all cores), 16GiB
    of ram (the host has 64GiB) and using a NVMe device directly (without an
    intermediary filesystem in the host). While running the tests, the host
    was not used for anything else, to avoid disturbing the tests.
    
    The obtained results were the following (the last line of fio's output was
    pasted). Starting with 16 jobs is where a significant difference is
    observable in this particular setup and hardware (differences highlighted
    below). The very small differences for tests with less than 16 jobs are
    possibly just noise and random.
    
        **** 1 job, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec
    
    after this change:
    
    WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec
    
        **** 2 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec
    
    after this change:
    
    WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec
    
        **** 4 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec
    
    after this change:
    
    WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec
    
        **** 8 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec
    
    after this change:
    
    WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec
    
        **** 16 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec
    
    after this change:
    
    WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
    (+40.1% throughput, -28.5% runtime)
    
        **** 32 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec
    
    after this change:
    
    WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
    (+29.4% throughput, -22.7% runtime)
    
        **** 64 jobs, file size 512M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec
    
    after this change:
    
    WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
    (+29.3% throughput, -22.7% runtime)
    
        **** 128 jobs, file size 256M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec
    
    after this change:
    
    WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
    (+121.2% throughput, -54.6% runtime)
    
        **** 256 jobs, file size 256M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec
    
    after this change:
    
    WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
    (+74.9% throughput, -42.8% runtime)
    
        **** 512 jobs, file size 128M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec
    
    after this change:
    
    WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
    (+65.1% throughput, -39.3% runtime)
    
        **** 1024 jobs, file size 128M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec
    
    after this change:
    
    WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
    (+48.7% throughput, -33.1% runtime)
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    3ebac17c
btrfs_inode.h 10.1 KB