• Filipe Manana's avatar
    btrfs: stop copying old file extents when doing a full fsync · 7f30c072
    Filipe Manana authored
    When logging an inode in full sync mode, we go over every leaf that was
    modified in the current transaction and has items associated to our inode,
    and then copy all those items into the log tree. This includes copying
    file extent items that were created and added to the inode in past
    transactions, which is useless and only makes use more leaf space in the
    log tree.
    
    It's common to have a file with many file extent items spanning many
    leaves where only a few file extent items are new and need to be logged,
    and in such case we log all the file extent items we find in the modified
    leaves.
    
    So change the full sync behaviour to skip over file extent items that are
    not needed. Those are the ones that match the following criteria:
    
    1) Have a generation older than the current transaction and the inode
       was not a target of a reflink operation, as that can copy file extent
       items from a past generation from some other inode into our inode, so
       we have to log them;
    
    2) Start at an offset within i_size - we must log anything at or beyond
       i_size, otherwise we would lose prealloc extents after log replay.
    
    The following script exercises a scenario where this happens, and it's
    somehow close enough to what happened often on a SQL Server workload which
    I had to debug sometime ago to fix an issue where a pattern of writes to
    prealloc extents and fsync resulted in fsync failing with -EIO (that was
    commit ea7036de ("btrfs: fix fsync failure and transaction abort
    after writes to prealloc extents")). In that particular case, we had large
    files that had random writes and were often truncated, which made the
    next fsync be a full sync.
    
      $ cat test.sh
      #!/bin/bash
    
      DEV=/dev/sdi
      MNT=/mnt/sdi
    
      MKFS_OPTIONS="-O no-holes -R free-space-tree"
      MOUNT_OPTIONS="-o ssd"
    
      FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
      # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
      # FILE_SIZE=$((512 * 1024 * 1024)) # 512M
    
      mkfs.btrfs -f $MKFS_OPTIONS $DEV
      mount $MOUNT_OPTIONS $DEV $MNT
    
      # Create a file with many extents. Use direct IO to make it faster
      # to create the file - using buffered IO we would have to fsync
      # after each write (terribly slow).
      echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
      xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar
    
      # Commit the transaction, so every extent after this is from an
      # old generation.
      sync
    
      # Now rewrite only a few extents, which are all far spread apart from
      # each other (e.g. 1G / 32M = 32 extents).
      # After this only a few extents have a new generation, while all other
      # ones have an old generation.
      echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
      for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
          xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
      done
    
      # Fsync, the inode logged in full sync mode since it was never fsynced
      # before.
      echo "Fsyncing file..."
      xfs_io -c "fsync" $MNT/foobar
    
      umount $MNT
    
    And the following bpftrace program was running when executing the test
    script:
    
      $ cat bpf-script.sh
      #!/usr/bin/bpftrace
    
      k:btrfs_log_inode
      {
          @start_log_inode[tid] = nsecs;
      }
    
      kr:btrfs_log_inode
      /@start_log_inode[tid]/
      {
          @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
          delete(@start_log_inode[tid]);
      }
    
      k:btrfs_sync_log
      {
          @start_sync_log[tid] = nsecs;
      }
    
      kr:btrfs_sync_log
      /@start_sync_log[tid]/
      {
          $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
          printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
          printf("btrfs_sync_log()  took %llu us\n", $sync_log_dur);
          delete(@start_sync_log[tid]);
          delete(@log_inode_dur[tid]);
          exit();
      }
    
    With 512M test file, before this patch:
    
      btrfs_log_inode() took 15218 us
      btrfs_sync_log()  took 1328 us
    
      Log tree has 17 leaves and 1 node, its total size is 294912 bytes.
    
    With 512M test file, after this patch:
    
      btrfs_log_inode() took 14760 us
      btrfs_sync_log()  took 588 us
    
      Log tree has a single leaf, its total size is 16K.
    
    With 1G test file, before this patch:
    
      btrfs_log_inode() took 27301 us
      btrfs_sync_log()  took 1767 us
    
      Log tree has 33 leaves and 1 node, its total size is 557056 bytes.
    
    With 1G test file, after this patch:
    
      btrfs_log_inode() took 26166 us
      btrfs_sync_log()  took 593 us
    
      Log tree has a single leaf, its total size is 16K
    
    With 2G test file, before this patch:
    
      btrfs_log_inode() took 50892 us
      btrfs_sync_log()  took 3127 us
    
      Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.
    
    With 2G test file, after this patch:
    
      btrfs_log_inode() took 50126 us
      btrfs_sync_log()  took 586 us
    
      Log tree has a single leaf, its total size is 16K.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    7f30c072
tree-log.c 194 KB