1. 14 Mar, 2022 6 commits
    • Filipe Manana's avatar
      btrfs: stop trying to log subdirectories created in past transactions · de6bc7f5
      Filipe Manana authored
      When logging a directory we are trying to log subdirectories that were
      changed in the current transaction and created in a past transaction.
      This type of behaviour was introduced by commit 2f2ff0ee ("Btrfs:
      fix metadata inconsistencies after directory fsync"), to fix some metadata
      inconsistencies that in the meanwhile no longer need this behaviour due to
      numerous other changes that happened throughout the years.
      
      This behaviour, besides not needed anymore, it's also undesirable because:
      
      1) It's not reliable because it's only triggered for the directories
         of dentries (dir items) that happen to be present on a leaf that
         was changed in the current transaction. If a dentry that points to
         a directory resides on a leaf that was not changed in the current
         transaction, then it's not logged, as at log_dir_items() and
         log_new_dir_dentries() we use btrfs_search_forward();
      
      2) It's not required by posix or any standard, it's undefined territory.
         The only way to guarantee a subdirectory is logged, it to explicitly
         fsync it;
      
      Making the behaviour guaranteed would require scanning all directory
      items, check which point to a directory, and then fsync each subdirectory
      which was modified in the current transaction. This could be very
      expensive for large directories with many subdirectories and/or large
      subdirectories.
      
      So remove that obsolete logic.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de6bc7f5
    • Filipe Manana's avatar
      btrfs: stop copying old dir items when logging a directory · 732d591a
      Filipe Manana authored
      When logging a directory, we go over every leaf of the subvolume tree that
      was changed in the current transaction and copy all its dir index keys to
      the log tree.
      
      That includes copying dir index keys created in past transactions. This is
      done mostly for simplicity, as after logging the keys we log an item that
      specifies the start and end ranges of the keys we logged. That item is
      then used during log replay to figure out which keys need to be deleted -
      every key in that range that we find in the subvolume tree and is not in
      the log tree, needs to be deleted.
      
      Now that we log only dir index keys, and not dir item keys anymore, when
      we remove dentries from a directory (due to unlink and rename operations),
      we can get entire leaves that we changed only for deleting old dir index
      keys, or that have few dir index keys that are new - this is due to the
      fact that the offset for new index keys comes from a monotonically
      increasing counter.
      
      We can avoid logging dir index keys from past transactions, and in order
      to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key
      type) when we find gaps between consecutive index keys. This massively
      reduces the amount of logged metadata when we have deleted directory
      entries, even if it's a small percentage of the total number of entries.
      The reduction comes from both less items that are logged and instead of
      logging many dir index items (struct btrfs_dir_item), which have a size
      of 30 bytes plus a file name, we typically log just a few range items
      (struct btrfs_dir_log_item), which take only 8 bytes each.
      
      Even if no entries were deleted from a directory and only new entries
      were added, we typically still get a reduction on the amount of logged
      metadata, because it's very likely the first leaf that got the new
      dir index entries also has several old dir index entries.
      
      So change the logging logic to not log dir index keys created in past
      transactions and log a range item for every gap it finds between each
      pair of consecutive index keys, to ensure deletions are tracked and
      replayed on log replay.
      
      This patch is part of a patchset comprised of the following patches:
      
       1/4 btrfs: don't log unnecessary boundary keys when logging directory
       2/4 btrfs: put initial index value of a directory in a constant
       3/4 btrfs: stop copying old dir items when logging a directory
       4/4 btrfs: stop trying to log subdirectories created in past transactions
      
      The following test was run on a branch without this patchset and on a
      branch with the first three patches applied:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=1000000
        NUM_FILE_DELETES=10000
      
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        MOUNT_OPTIONS="-o ssd"
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES ))
        for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do
            rm -f $MNT/testdir/file_$i
        done
      
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
        echo
      
        umount $MNT
      
      The test was run on a non-debug kernel (Debian's default kernel config),
      and the results were the following for various values of NUM_FILES and
      NUM_FILE_DELETES:
      
      ** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
      
      dir fsync took 585 ms after deleting 10000 files
      
      ** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
      
      dir fsync took 34 ms after deleting 10000 files   (-94.2%)
      
      ** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
      
      dir fsync took 50 ms after deleting 1000 files
      
      ** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
      
      dir fsync took 7 ms after deleting 1000 files    (-86.0%)
      
      ** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
      
      dir fsync took 9 ms after deleting 100 files
      
      ** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
      
      dir fsync took 5 ms after deleting 100 files     (-44.4%)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      732d591a
    • Filipe Manana's avatar
      btrfs: put initial index value of a directory in a constant · 528ee697
      Filipe Manana authored
      At btrfs_set_inode_index_count() we refer twice to the number 2 as the
      initial index value for a directory (when it's empty), with a proper
      comment explaining the reason for that value. In the next patch I'll
      have to use that magic value in the directory logging code, so put
      the value in a #define at btrfs_inode.h, to avoid hardcoding the
      magic value again at tree-log.c.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      528ee697
    • Filipe Manana's avatar
      btrfs: don't log unnecessary boundary keys when logging directory · a450a4af
      Filipe Manana authored
      Before we start to log dir index keys from a leaf, we check if there is a
      previous index key, which normally is at the end of a leaf that was not
      changed in the current transaction. Then we log that key and set the start
      of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of
      that key. This is to ensure that if there were deleted index keys between
      that key and the first key we are going to log, those deletions are
      replayed in case we need to replay to the log after a power failure.
      However we really don't need to log that previous key, we can just set the
      start of the logged range to that key's offset plus 1. This achieves the
      same and avoids logging one dir index key.
      
      The same logic is performed when we finish logging the index keys of a
      leaf and we find that the next leaf has index keys and was not changed in
      the current transaction. We are logging the first key of that next leaf
      and use its offset as the end of range we log. This is just to ensure that
      if there were deleted index keys between the last index key we logged and
      the first key of that next leaf, those index keys are deleted if we end
      up replaying the log. However that is not necessary, we can avoid logging
      that first index key of the next leaf and instead set the end of the
      logged range to match the offset of that index key minus 1.
      
      So avoid logging those index keys at the boundaries and adjust the start
      and end offsets of the logged ranges as described above.
      
      This patch is part of a patchset comprised of the following patches:
      
        1/4 btrfs: don't log unnecessary boundary keys when logging directory
        2/4 btrfs: put initial index value of a directory in a constant
        3/4 btrfs: stop copying old dir items when logging a directory
        4/4 btrfs: stop trying to log subdirectories created in past transactions
      
      Performance test results are listed in the changelog of patch 3/4.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a450a4af
    • Sahil Kang's avatar
      btrfs: reuse existing pointers from btrfs_ioctl · dc408ccd
      Sahil Kang authored
      btrfs_ioctl already contains pointers to the inode and btrfs_root
      structs, so we can pass them into the subfunctions instead of the
      toplevel struct file.
      Signed-off-by: default avatarSahil Kang <sahil.kang@asilaycomputing.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc408ccd
    • Filipe Manana's avatar
      btrfs: remove write and wait of struct walk_control · c816d705
      Filipe Manana authored
      The ->write and ->wait fields of struct walk_control, used for log trees,
      are not used since 2008, more specifically since commit d0c803c4
      ("Btrfs: Record dirty pages tree-log pages in an extent_io tree") and
      since commit d0c803c4 ("Btrfs: Record dirty pages tree-log pages in
      an extent_io tree"). So just remove them, along with the function
      btrfs_write_tree_block(), which is also not used anymore after removing
      the ->write member.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c816d705
  2. 13 Mar, 2022 2 commits
  3. 12 Mar, 2022 8 commits
  4. 11 Mar, 2022 24 commits