1. 14 Mar, 2022 40 commits
    • Omar Sandoval's avatar
      fs: export variant of generic_write_checks without iov_iter · f6f7a25a
      Omar Sandoval authored
      Encoded I/O in Btrfs needs to check a write with a given logical size
      without an iov_iter that matches that size (because the iov_iter we have
      is for the compressed data). So, factor out the parts of
      generic_write_check() that don't need an iov_iter into a new
      generic_write_checks_count() function and export that.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6f7a25a
    • Omar Sandoval's avatar
      fs: export rw_verify_area() · 87112933
      Omar Sandoval authored
      I'm adding btrfs ioctls to read and write compressed data, and rather
      than duplicating the checks in rw_verify_area(), let's just export it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87112933
    • Sidong Yang's avatar
      btrfs: qgroup: remove outdated TODO comments · 457b0a3d
      Sidong Yang authored
      These comments are old, outdated and not very specific. It seems that it
      doesn't help to inspire anybody to work on that.  So we remove them.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarSidong Yang <realwakka@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      457b0a3d
    • Sidong Yang's avatar
      btrfs: qgroup: remove duplicated check in adding qgroup relations · a8f6f619
      Sidong Yang authored
      Removes duplicated check when adding qgroup relations.
      btrfs_add_qgroup_relations function adds relations by calling
      add_relation_rb(). add_relation_rb() checks that member/parentid exists
      in current qgroup_tree. But it already checked before calling the
      function. It seems that we don't need to double check.
      
      Add new function __add_relation_rb() that adds relations with
      qgroup structures and makes old function use the new one. And it makes
      btrfs_add_qgroup_relation() function work without double checks by
      calling the new function.
      Signed-off-by: default avatarSidong Yang <realwakka@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8f6f619
    • Dāvis Mosāns's avatar
      btrfs: add lzo workspace buffer length constants · dc4a4bdb
      Dāvis Mosāns authored
      It makes it more readable for length checking and is be used repeatedly.
      Signed-off-by: default avatarDāvis Mosāns <davispuh@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc4a4bdb
    • Qu Wenruo's avatar
      btrfs: populate extent_map::generation when reading from disk · 40e7efe0
      Qu Wenruo authored
      When btrfs_get_extent() tries to get some file extent from disk, it
      never populates extent_map::generation, leaving the value to be 0.
      
      On the other hand, for extent map generated by IO, it will get its
      generation properly set at finish_ordered_io()
      
       finish_ordered_io()
       |- unpin_extent_cache(gen = trans->transid)
          |- em->generation = gen;
      
      [CAUSE]
      Since extent_map::generation is mostly used by fsync code, and for fsync
      they only care about modified extents, which all have their
      em::generation > 0.
      
      Thus it's fine to not populate em read from disk for fsync.
      
      [CORNER CASE]
      However autodefrag also relies on em::generation to determine if one
      extent needs to be defragged.
      
      This unpopulated extent_map::generation can prevent the following
      autodefrag case from working:
      
      	mkfs.btrfs -f $dev
      	mount $dev $mnt -o autodefrag
      
      	# initial write to queue the inode for autodefrag
      	xfs_io -f -c "pwrite 0 4k" $mnt/file
      	sync
      
      	# Real fragmented write
      	xfs_io -f -s -c "pwrite -b 4096 0 32k" $mnt/file
      	sync
      	echo "=== before autodefrag ==="
      	xfs_io -c "fiemap -v" $mnt/file
      
      	# Drop cache to force em to be read from disk
      	echo 3 > /proc/sys/vm/drop_caches
      	mount -o remount,commit=1 $mnt
      	sleep 3
      	sync
      
      	echo "=== After autodefrag ==="
      	xfs_io -c "fiemap -v" $mnt/file
      	umount $mnt
      
      The result looks like this:
      
        === before autodefrag ===
        /mnt/btrfs/file:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..15]:         26672..26687        16   0x0
           1: [16..31]:        26656..26671        16   0x0
           2: [32..47]:        26640..26655        16   0x0
           3: [48..63]:        26624..26639        16   0x1
        === After autodefrag ===
        /mnt/btrfs/file:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..15]:         26672..26687        16   0x0
           1: [16..31]:        26656..26671        16   0x0
           2: [32..47]:        26640..26655        16   0x0
           3: [48..63]:        26624..26639        16   0x1
      
      This fragmented 32K will not be defragged by autodefrag.
      
      [FIX]
      To make things less weird, just populate extent_map::generation when
      reading file extents from disk.
      
      This would make above fragmented extents to be properly defragged:
      
        == before autodefrag ===
        /mnt/btrfs/file:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..15]:         26672..26687        16   0x0
           1: [16..31]:        26656..26671        16   0x0
           2: [32..47]:        26640..26655        16   0x0
           3: [48..63]:        26624..26639        16   0x1
        === After autodefrag ===
        /mnt/btrfs/file:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..63]:         26688..26751        64   0x1
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40e7efe0
    • Filipe Manana's avatar
      btrfs: assert we have a write lock when removing and replacing extent maps · 6d3b050e
      Filipe Manana authored
      Removing or replacing an extent map requires holding a write lock on the
      extent map's tree. We currently do that everywhere, except in one of the
      self tests, where it's harmless since there's no concurrency.
      
      In order to catch possible races in the future, assert that we are holding
      a write lock on the extent map tree before removing or replacing an extent
      map in the tree, and update the self test to obtain a write lock before
      removing extent maps.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d3b050e
    • Filipe Manana's avatar
      btrfs: remove no longer used counter when reading data page · ad3fc794
      Filipe Manana authored
      After commit 92082d40 ("btrfs: integrate page status update for
      data read path into begin/end_page_read"), the 'nr' counter at
      btrfs_do_readpage() is no longer used, we increment it but we never
      read from it. So just remove it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad3fc794
    • Filipe Manana's avatar
      btrfs: fix lost error return value when reading a data page · bbf0ea7e
      Filipe Manana authored
      At btrfs_do_readpage(), if we get an error when trying to lookup for an
      extent map, we end up marking the page with the error bit, clearing
      the uptodate bit on it, and doing everything else that should be done.
      However we return success (0) to the caller, when we should return the
      error encoded in the extent map pointer. So fix that by returning the
      error encoded in the pointer.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbf0ea7e
    • Filipe Manana's avatar
      btrfs: stop checking for NULL return from btrfs_get_extent() · c0347550
      Filipe Manana authored
      At extent_io.c, in the read page and write page code paths, we are testing
      if the return value from btrfs_get_extent() can be NULL. However that is
      not possible, as btrfs_get_extent() always returns either an error pointer
      or a (non-NULL) pointer to an extent map structure.
      
      Everywhere else outside extent_io.c we never check for NULL, we always
      treat any returned value as a non-NULL pointer if it does not encode an
      error.
      
      So check only for the IS_ERR() case at extent_io.c.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0347550
    • Filipe Manana's avatar
      btrfs: prepare extents to be logged before locking a log tree path · e1f53ed8
      Filipe Manana authored
      When we want to log an extent, in the fast fsync path, we obtain a path
      to the leaf that will hold the file extent item either through a deletion
      search, via btrfs_drop_extents(), or through an insertion search using
      btrfs_insert_empty_item(). After that we fill the file extent item's
      fields one by one directly on the leaf.
      
      Instead of doing that, we could prepare the file extent item before
      obtaining a btree path, and then copy the prepared extent item with a
      single operation once we get the path. This helps avoid some contention
      on the log tree, since we are holding write locks for longer than
      necessary, especially in the case where the path is obtained via
      btrfs_drop_extents() through a deletion search, which always keeps a
      write lock on the nodes at levels 1 and 2 (besides the leaf).
      
      This change does that, we prepare the file extent item that is going to
      be inserted before acquiring a path, and then copy it into a leaf using
      a single copy operation once we get a path.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The following test was run to measure the impact of the whole patchset:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=8
        FILE_SIZE=128M
        RUN_TIME=200
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=randwrite
        fsync=1
        fallocate=none
        group_reporting=1
        direct=0
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        ioengine=sync
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test ran inside a VM (8 cores, 32G of RAM) with the target disk
      mapping to a raw NVMe device, and using a non-debug kernel config
      (Debian's default config).
      
      Before the patchset:
      
      WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec
      
      After the patchset:
      
      WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec
      
      A 7.8% gain on throughput and +7.0% more IO done in the same period of
      time (200 seconds).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1f53ed8
    • Filipe Manana's avatar
      btrfs: remove useless path release in the fast fsync path · d8457531
      Filipe Manana authored
      There's no point in calling btrfs_release_path() after finishing the loop
      that logs the modified extents, since log_one_extent() returns with the
      path released. In case the list of extents is empty, the path is already
      released, so there's no need for that case as well.
      So just remove that unnecessary btrfs_release_path() call.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8457531
    • Filipe Manana's avatar
      btrfs: remove constraint on number of visited leaves when replacing extents · 7ecb4c31
      Filipe Manana authored
      At btrfs_drop_extents(), we try to replace a range of file extent items
      with a new file extent in a single btree search, to avoid the need to do
      a search for deletion, followed by a path release and followed by yet
      another search for insertion.
      
      When I originally added that optimization, in commit 1acae57b
      ("Btrfs: faster file extent item replace operations"), I left a constraint
      to do the fast replace only if we visited a single leaf. That was because
      in the most common case we find all file extent items that need to be
      deleted (or trimmed) in a single leaf, however it can work for other
      common cases like when we need to delete a few file extent items located
      at the end of a leaf and a few more located at the beginning of the next
      leaf. The key for the new file extent item is greater than the key of
      any deleted or trimmed file extent item from previous leaves, so we are
      fine to use the last leaf that we found as long as we are holding a
      write lock on it - even if the new key ends up at slot 0, as if that's
      the case, the btree search has obtained a write lock on any upper nodes
      that need to have a key pointer updated.
      
      So removed the constraint that limits the optimization to the case where
      we visited only a single leaf.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ecb4c31
    • Filipe Manana's avatar
      btrfs: avoid unnecessary computation when deleting items from a leaf · 0cae23b6
      Filipe Manana authored
      When deleting items from a leaf, we always compute the sum of the data
      sizes of the items that are going to be deleted. However we only use
      that sum when the last item to delete is behind the last item in the
      leaf. This unnecessarily wastes CPU time when we are deleting either
      the whole leaf or from some slot > 0 up to the last item in the leaf,
      and both of these cases are common (e.g. truncation operation, either
      as a result of truncate(2) or when logging inodes, deleting checksums
      after removing a large enough extent, etc).
      
      So compute only the sum of the data sizes if the last item to be
      deleted does not match the last item in the leaf.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cae23b6
    • Filipe Manana's avatar
      btrfs: avoid unnecessary COW of leaves when deleting items from a leaf · 7c4063d1
      Filipe Manana authored
      When we delete items from a leaf, if we end up with more than two thirds
      of unused leaf space, we try to delete the leaf by moving all its items
      into its left and right neighbour leaves. Sometimes that is not possible
      because there is not enough free space in the left and right leaves, and
      in that case we end up not deleting our leaf.
      
      The way we are doing this is not ideal and can be improved in the
      following ways:
      
      1) When we call push_leaf_left(), we pass a value of 1 byte to the data
         size parameter of push_leaf_left(). This is not realistic value because
         no item can have a size less than 25 bytes, which is the size of struct
         btrfs_item. This means that means that if the left leaf has not enough
         free space to push any item, we end up COWing it even if we end up not
         changing its content at all.
      
         COWing that leaf means allocating a new metadata extent, marking it
         dirty and doing more IO when committing a transaction or when syncing a
         log tree. For a log tree case, it's particularly more important to
         avoid the useless COW operation, as more IO can imply a higher latency
         for an fsync operation.
      
         So instead of passing 1 as the minimum data size for push_leaf_left(),
         pass the size of the first item in our leaf, as we don't want to COW
         the left leaf if we can't at least push the first item of our leaf;
      
      2) When we call push_leaf_right(), we also pass a value of 1 byte as the
         data size parameter of push_leaf_right(). Like the previous case, it
         will also result in COWing the right leaf even if we are not able to
         move any items into it, since there can't be any item with a size
         smaller than 25 bytes (the size of struct btrfs_item).
      
         So instead of passing 1 as the minimum data size to push_leaf_right(),
         pass a size that corresponds to the sum of the size of all the
         remaining items in our leaf. We are not interested in moving less than
         that, because if we do, we are not able to delete our leaf and we have
         COWed the right leaf for nothing. Plus, moving only some of the items
         of our leaf, it means an even less balanced tree.
      
         Just like the previous case, we want to avoid the useless COW of the
         right leaf, this way we don't have to spend time allocating one new
         metadata extent, and doing more IO when committing a transaction or
         syncing a log tree. For the log tree case it's specially more important
         because more IO can result in a higher latency for a fsync operation.
      
      So adjust the minimum data size passed to push_leaf_left() and
      push_leaf_right() as mentioned above.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      Not being able to delete a leaf that became less than 1/3 full after
      deleting items from it is actually common. For example, for the fio test
      mentioned in the changelog of patch 6/6, we are only able to delete a
      leaf at btrfs_del_items() about 5.3% of the time, due to its left and
      right neighbour leaves not having enough free space to push all the
      remaining items into them.
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c4063d1
    • Filipe Manana's avatar
      btrfs: remove unnecessary leaf free space checks when pushing items · b4e098a9
      Filipe Manana authored
      When trying to push items from a leaf into its left and right neighbours,
      we lock the left or right leaf, check if it has the required minimum free
      space, COW the leaf and then check again if it has the minimum required
      free space. This second check is pointless:
      
      1) Most and foremost because it's not needed. We have a write lock on the
         leaf and on its parent node, so no one can come in and change either
         the pre-COW or post-COW version of the leaf for the whole duration of
         the push_leaf_left() and push_leaf_right() calls;
      
      2) The call to btrfs_leaf_free_space() is not trivial, it has a fair
         amount of arithmetic operations and access to fields in the leaf's
         header and items, so it's not very cheap.
      
      So remove the duplicated free space checks.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4e098a9
    • Johannes Thumshirn's avatar
      btrfs: stop checking for NULL return from btrfs_get_extent_fiemap() · 6b5b7a41
      Johannes Thumshirn authored
      In get_extent_skip_holes() we're checking the return of
      btrfs_get_extent_fiemap() for an error pointer or NULL, but
      btrfs_get_extent_fiemap() will never return NULL, only error pointers or
      a valid extent_map.
      
      The other caller of btrfs_get_extent_fiemap(), find_desired_extent(),
      correctly only checks for error-pointers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b5b7a41
    • Pankaj Raghav's avatar
      btrfs: zoned: remove redundant assignment in btrfs_check_zoned_mode · f716fa47
      Pankaj Raghav authored
      Remove the redundant assignment to zone_info variable in
      btrfs_check_zoned_mode function.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f716fa47
    • David Sterba's avatar
      btrfs: replace BUILD_BUG_ON by static_assert · a55e65b8
      David Sterba authored
      The static_assert introduced in 6bab69c6 ("build_bug.h: add wrapper
      for _Static_assert") has been supported by compilers for a long time
      (gcc 4.6, clang 3.0) and can be used in header files. We don't need to
      put BUILD_BUG_ON to random functions but rather keep it next to the
      definition.
      
      The exception here is the UAPI header btrfs_tree.h that could be
      potentially included by userspace code and the static assert is not
      defined (nor used in any other header).
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a55e65b8
    • Johannes Thumshirn's avatar
      btrfs: zoned: allow DUP on meta-data block groups · 265f7237
      Johannes Thumshirn authored
      Allow creating or reading block-groups on a zoned device with DUP as a
      meta-data profile.
      
      This works because we're using the zoned_meta_io_lock and REQ_OP_WRITE
      operations for meta-data on zoned btrfs, so all writes to meta-data zones
      are aligned to the zone's write-pointer.
      
      Upon loading of the block-group, it is ensured both zones do have the same
      zone capacity and write-pointer offsets, so no extra machinery is needed
      to keep the write-pointers in sync for the meta-data zones. If this
      prerequisite is not met, loading of the block-group is refused.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      265f7237
    • Johannes Thumshirn's avatar
      btrfs: zoned: prepare for allowing DUP on zoned · dbfcc18f
      Johannes Thumshirn authored
      Allow for a block-group to be placed on more than one physical zone.
      
      This is a preparation for allowing DUP profiles for meta-data on a zoned
      file-system.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dbfcc18f
    • Johannes Thumshirn's avatar
      btrfs: zoned: make zone finishing multi stripe capable · 4dcbb8ab
      Johannes Thumshirn authored
      Currently finishing of a zone only works if the block group isn't
      spanning more than one zone.
      
      This limitation is purely artificial and can be easily expanded to block
      groups being places across multiple zones.
      
      This is a preparation for allowing DUP and later more complex block-group
      profiles on zoned btrfs.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4dcbb8ab
    • Johannes Thumshirn's avatar
      btrfs: zoned: make zone activation multi stripe capable · f9a912a3
      Johannes Thumshirn authored
      Currently activation of a zone only works if the block group isn't
      spanning more than one zone.
      
      This limitation is purely artificial and can be easily expanded to block
      groups being places across multiple zones.
      
      This is a preparation for allowing DUP and later more complex block-group
      profiles on zoned btrfs.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9a912a3
    • Josef Bacik's avatar
      btrfs: add support for multiple global roots · f7238e50
      Josef Bacik authored
      With extent tree v2 you will be able to create multiple csum, extent,
      and free space trees.  They will be used based on the block group, which
      will now use the block_group_item->chunk_objectid to point to the set of
      global roots that it will use.  When allocating new block groups we'll
      simply mod the gigabyte offset of the block group against the number of
      global roots we have and that will be the block groups global id.
      
      >From there we can take the bytenr that we're modifying in the respective
      tree, look up the block group and get that block groups corresponding
      global root id.  From there we can get to the appropriate global root
      for that bytenr.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7238e50
    • Josef Bacik's avatar
      btrfs: add code to support the block group root · 9c54e80d
      Josef Bacik authored
      This code adds the on disk structures for the block group root, which
      will hold the block group items for extent tree v2.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c54e80d
    • Josef Bacik's avatar
      btrfs: abstract out loading the tree root · bd676446
      Josef Bacik authored
      We're going to be adding more roots that need to be loaded from the
      super block, so abstract out the code to read the tree_root from the
      super block, and use this helper for the chunk root as well.  This will
      make it simpler to load the new trees in the future.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd676446
    • Josef Bacik's avatar
      btrfs: tree-checker: don't fail on empty extent roots for extent tree v2 · c2fa821c
      Josef Bacik authored
      For extent tree v2 we can definitely have empty extent roots, so skip
      this particular check if we have that set.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2fa821c
    • Josef Bacik's avatar
      btrfs: disable space cache related mount options for extent tree v2 · 63cd070d
      Josef Bacik authored
      We cannot fall back on the slow caching for extent tree v2, which means
      we can't just arbitrarily clear the free space trees at mount time.
      Furthermore we can't do v1 space cache with extent tree v2.  Simply
      ignore these mount options for extent tree v2 as they aren't relevant.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63cd070d
    • Josef Bacik's avatar
      btrfs: disable snapshot creation/deletion for extent tree v2 · 813febdb
      Josef Bacik authored
      When we stop tracking metadata blocks all of snapshotting will break, so
      disable it until I add the snapshot root and drop tree support.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      813febdb
    • Josef Bacik's avatar
      btrfs: disable scrub for extent-tree-v2 · da32c6d5
      Josef Bacik authored
      Scrub depends on extent references for every block, and with extent tree
      v2 we won't have that, so disable scrub until we can add back the proper
      code to handle extent-tree-v2.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da32c6d5
    • Josef Bacik's avatar
      btrfs: disable qgroups in extent tree v2 · ef3eccc1
      Josef Bacik authored
      Backref lookups are going to be drastically different with extent tree
      v2, disable qgroups until we do the work to add this support for extent
      tree v2.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef3eccc1
    • Josef Bacik's avatar
      btrfs: disable device manipulation ioctl's EXTENT_TREE_V2 · 914a519b
      Josef Bacik authored
      Device add, remove, and replace all require balance, which doesn't work
      right now on extent tree v2, so disable these for now.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      914a519b
    • Josef Bacik's avatar
      btrfs: disable balance for extent tree v2 for now · 4b349253
      Josef Bacik authored
      With global root id's it makes it problematic to do backref lookups for
      balance.  This isn't hard to deal with, but future changes are going to
      make it impossible to lookup backrefs on any COWonly roots, so go ahead
      and disable balance for now on extent tree v2 until we can add balance
      support back in future patches.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4b349253
    • Josef Bacik's avatar
      btrfs: add definition for EXTENT_TREE_V2 · 2c7d2a23
      Josef Bacik authored
      This adds the initial definition of the EXTENT_TREE_V2 incompat feature
      flag.  This also hides the support behind CONFIG_BTRFS_DEBUG.
      
      THIS IS A IN DEVELOPMENT FORMAT CHANGE, DO NOT USE UNLESS YOU ARE A
      DEVELOPER OR A TESTER.
      
      The format is in flux and will be added in stages, any fs will need to
      be re-made between updates to the format.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c7d2a23
    • Filipe Manana's avatar
      btrfs: use single variable to track return value at btrfs_log_inode() · 65faced5
      Filipe Manana authored
      At btrfs_log_inode(), we have two variables to track errors and the
      return value of the function, named 'ret' and 'err'. In some places we
      use 'ret' and if gets a non-zero value we assign its value to 'err'
      and then jump to the 'out' label, while in other places we use 'err'
      directly without 'ret' as an intermediary. This is inconsistent, error
      prone and not necessary. So change that to use only the 'ret' variable,
      making this consistent with most functions in btrfs.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65faced5
    • Filipe Manana's avatar
      btrfs: avoid inode logging during rename and link when possible · 0f8ce498
      Filipe Manana authored
      During a rename or link operation, we need to determine if an inode was
      previously logged or not, and if it was, do some update to the logged
      inode. We used to rely exclusively on the logged_trans field of struct
      btrfs_inode to determine that, but that was not reliable because the
      value of that field is not persisted in the inode item, so it's lost
      when an inode is evicted and loaded back again. That led to several
      issues in the past, such as not persisting deletions (such as the case
      fixed by commit 803f0f64 ("Btrfs: fix fsync not persisting dentry
      deletions due to inode evictions")), or resulting in losing a file
      after an inode eviction followed by a rename (commit ecc64fab
      ("btrfs: fix lost inode on log replay after mix of fsync, rename and
      inode eviction")), besides other issues.
      
      So the inode_logged() helper was introduced and used to determine if an
      inode was possibly logged before in the current transaction, with the
      caveat that it could return false positives, in the sense that even if an
      inode was not logged before in the current transaction, it could still
      return true, but never to return false in case the inode was logged.
      >From a functional point of view that is fine, but from a performance
      perspective it can introduce significant latencies to rename and link
      operations, as they will end up doing inode logging even when it is not
      necessary.
      
      Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
      installations and upgrades, with the zypper tool, were often taking a
      long time to complete. With strace it could be observed that zypper was
      spending about 99% of its time on rename operations, and then with
      further analysis we checked that directory logging was happening too
      frequently. Taking into account that installation/upgrade of some of the
      packages needed a few thousand file renames, the slowdown was very
      noticeable for the user.
      
      The issue was caused indirectly due to an excessive number of inode
      evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
      a 5.16-rc8 kernel. While triggering the inode evictions if something
      outside btrfs' control, btrfs could still behave better by eliminating
      the false positives from the inode_logged() helper.
      
      So change inode_logged() to actually eliminate such false positives caused
      by inode eviction and when an inode was never logged since the filesystem
      was mounted, as both cases relate to when the logged_trans field of struct
      btrfs_inode has a value of zero. When it can not determine if the inode
      was logged based only on the logged_trans value, lookup for the existence
      of the inode item in the log tree - if it's there then we known the inode
      was logged, if it's not there then it can not have been logged in the
      current transaction. Once we determine if the inode was logged, update
      the logged_trans value to avoid future calls to have to search in the log
      tree again.
      
      Alternatively, we could start storing logged_trans in the on disk inode
      item structure (struct btrfs_inode_item) in the unused space it still has,
      but that would be a bit odd because:
      
      1) We only care about logged_trans since the filesystem was mounted, we
         don't care about its value from a previous mount. Having it persisted
         in the inode item structure would not make the best use of the precious
         unused space;
      
      2) In order to get logged_trans persisted before inode eviction, we would
         have to update the delayed inode when we finish logging the inode and
         update its logged_trans in struct btrfs_inode, which makes it a bit
         cumbersome since we need to check if the delayed inode exists, if not
         create it and populate it and deal with any errors (-ENOMEM mostly).
      
      This change is part of a patchset comprised of the following patches:
      
        1/5 btrfs: add helper to delete a dir entry from a log tree
        2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
        3/5 btrfs: avoid logging all directory changes during renames
        4/5 btrfs: stop doing unnecessary log updates during a rename
        5/5 btrfs: avoid inode logging during rename and link when possible
      
      The following test script mimics part of what the zypper tool does during
      package installations/upgrades. It does not triggers inode evictions, but
      it's similar because it triggers false positives from the inode_logged()
      helper, because the inodes have a logged_trans of 0, there's a log tree
      due to a fsync of an unrelated file and the directory inode has its
      last_trans field set to the current transaction:
      
        $ cat test.sh
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Now do some change to an unrelated file and fsync it.
        # This is just to create a log tree to make sure that inode_logged()
        # does not return false when called against "testdir".
        xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
      
        # Do some change to testdir. This is to make sure inode_logged()
        # will return true when called against "testdir", because its
        # logged_trans is 0, it was changed in the current transaction
        # and there's a log tree.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on a box using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before patchset:                   27837 ms
      NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9236 ms (-66.8%)
      NUM_FILES=10000, after whole patchset applied:       8902 ms (-68.0%)
      
      NUM_FILES=5000, before patchset:                     9127 ms
      NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4640 ms (-49.2%)
      NUM_FILES=5000, after whole patchset applied:        4441 ms (-51.3%)
      
      NUM_FILES=2000, before patchset:                     2528 ms
      NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1983 ms (-21.6%)
      NUM_FILES=2000, after whole patchset applied:        1747 ms (-30.9%)
      
      NUM_FILES=1000, before patchset:                     1085 ms
      NUM_FILES=1000, after patches 1/5 to 4/5 applied:     893 ms (-17.7%)
      NUM_FILES=1000, after whole patchset applied:         867 ms (-20.1%)
      
      Running dbench on the same physical machine with the following script:
      
        $ cat run-dbench.sh
        #!/bin/bash
      
        NUM_JOBS=$(nproc --all)
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 120 $NUM_JOBS
      
        umount $MNT
      
      Before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    3761352     0.032   143.843
       Close        2762770     0.002     2.273
       Rename        159304     0.291    67.037
       Unlink        759784     0.207   143.998
       Deltree           72     4.028    15.977
       Mkdir             36     0.003     0.006
       Qpathinfo    3409780     0.013     9.678
       Qfileinfo     596772     0.001     0.878
       Qfsinfo       625189     0.003     1.245
       Sfileinfo     306443     0.006     1.840
       Find         13181061     0.063    19.798
       WriteX       1871137     0.021     8.532
       ReadX        5897325     0.003     3.567
       LockX          12252     0.003     0.258
       UnlockX        12252     0.002     0.100
       Flush         263666     3.327   155.632
      
      Throughput 980.047 MB/sec  12 clients  12 procs  max_latency=155.636 ms
      
      After whole patchset applied:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4195584     0.033   107.742
       Close        3081932     0.002     1.935
       Rename        177641     0.218    14.905
       Unlink        847333     0.166   107.822
       Deltree          118     5.315    15.247
       Mkdir             59     0.004     0.048
       Qpathinfo    3802612     0.014    10.302
       Qfileinfo     666748     0.001     1.034
       Qfsinfo       697329     0.003     0.944
       Sfileinfo     341712     0.006     2.099
       Find         1470365     0.065     9.359
       WriteX       2093921     0.021     8.087
       ReadX        6576234     0.003     3.407
       LockX          13660     0.003     0.308
       UnlockX        13660     0.002     0.114
       Flush         294090     2.906   115.539
      
      Throughput 1093.11 MB/sec  12 clients  12 procs  max_latency=115.544 ms
      
      +11.5% throughput    -25.8% max latency   rename max latency -77.8%
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f8ce498
    • Filipe Manana's avatar
      btrfs: stop doing unnecessary log updates during a rename · 259c4b96
      Filipe Manana authored
      During a rename, we call __btrfs_unlink_inode(), which will call
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order
      to remove an inode reference and a directory entry from the log. These
      are necessary when __btrfs_unlink_inode() is called from the unlink path,
      but not necessary when it's called from a rename context, because:
      
      1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the
         inode reference related to the old name, because later in the rename
         path we call btrfs_log_new_name(), which will drop all inode references
         from the log and copy all inode references from the subvolume tree to
         the log tree. So we are doing one unnecessary btree operation which
         adds additional latency and lock contention in case there are other
         tasks accessing the log tree;
      
      2) For the btrfs_del_dir_entries_in_log() call, we are now doing the
         equivalent at btrfs_log_new_name() since the previous patch in the
         series, that has the subject "btrfs: avoid logging all directory
         changes during renames". In fact, having __btrfs_unlink_inode() call
         this function not only adds additional latency and lock contention due
         to the extra btree operation, but also can make btrfs_log_new_name()
         unnecessarily log a range item to track the deletion of the old name,
         since it has no way to known that the directory entry related to the
         old name was previously logged and already deleted by
         __btrfs_unlink_inode() through its call to
         btrfs_del_dir_entries_in_log().
      
      So skip those calls at __btrfs_unlink_inode() when we are doing a rename.
      Skipping them also allows us now to reduce the duration of time we are
      pinning a log transaction during renames, which is always beneficial as
      it's not delaying so much other tasks trying to sync the log tree, in
      particular we end up not holding the log transaction pinned while adding
      the new name (adding inode ref, directory entry, etc).
      
      This change is part of a patchset comprised of the following patches:
      
        1/5 btrfs: add helper to delete a dir entry from a log tree
        2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
        3/5 btrfs: avoid logging all directory changes during renames
        4/5 btrfs: stop doing unnecessary log updates during a rename
        5/5 btrfs: avoid inode logging during rename and link when possible
      
      Just like the previous patch in the series, "btrfs: avoid logging all
      directory changes during renames", the following script mimics part of
      what a package installation/upgrade with zypper does, which is basically
      renaming a lot of files, in some directory under /usr, to a name with a
      suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box a using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before patchset:                   27399 ms
      NUM_FILES=10000, after patches 1/5 to 3/5 applied:   9093 ms (-66.8%)
      NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9016 ms (-67.1%)
      
      NUM_FILES=5000, before patchset:                     9241 ms
      NUM_FILES=5000, after patches 1/5 to 3/5 applied:    4642 ms (-49.8%)
      NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4553 ms (-50.7%)
      
      NUM_FILES=2000, before patchset:                     2550 ms
      NUM_FILES=2000, after patches 1/5 to 3/5 applied:    1788 ms (-29.9%)
      NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1767 ms (-30.7%)
      
      NUM_FILES=1000, before patchset:                     1088 ms
      NUM_FILES=1000, after patches 1/5 to 3/5 applied:     905 ms (-16.9%)
      NUM_FILES=1000, after patches 1/5 to 4/5 applied:     883 ms (-18.8%)
      
      The next patch in the series (5/5), also contains dbench results after
      applying to whole patchset.
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      259c4b96
    • Filipe Manana's avatar
      btrfs: avoid logging all directory changes during renames · 88d2beec
      Filipe Manana authored
      When doing a rename of a file, if the file or its old parent directory
      were logged before, we log the new name of the file and then make sure
      we log the old parent directory, to ensure that after a log replay the
      old name of the file is deleted and the new name added.
      
      The logging of the old parent directory can take some time, because it
      will scan all leaves modified in the current transaction, check which
      directory entries were already logged, copy the ones that were not
      logged before, etc. In this rename context all we need to do is make
      sure that the old name of the file is deleted on log replay, so instead
      of triggering a directory log operation, we can just delete the old
      directory entry from the log if it's there, or in case it isn't there,
      just log a range item to signal log replay that the old name must be
      deleted. So change btrfs_log_new_name() to do that.
      
      This scenario is actually not uncommon to trigger, and recently on a
      5.15 kernel, an openSUSE Tumbleweed user reported package installations
      and upgrades, with the zypper tool, were often taking a long time to
      complete, much more than usual. With strace it could be observed that
      zypper was spending over 99% of its time on rename operations, and then
      with further analysis we checked that directory logging was happening
      too frequently and causing high latencies for the rename operations.
      Taking into account that installation/upgrade of some of these packages
      needed about a few thousand file renames, the slowdown was very noticeable
      for the user.
      
      The issue was caused indirectly due to an excessive number of inode
      evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14
      or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure,
      in an efficient way, if an inode was previously logged in the current
      transaction, so we are pessimistic and assume it was, because in case
      it was we need to update the logged inode. More details on that in one
      of the patches in the same series (subject "btrfs: avoid inode logging
      during rename and link when possible"). Either way, in case the parent
      directory was logged before, we currently do more work then necessary
      during a rename, and this change minimizes that amount of work.
      
      The following script mimics part of what a package installation/upgrade
      with zypper does, which is basically renaming a lot of files, in some
      directory under /usr, to a name with a suffix of "-RPMDELETE":
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_FILES=10000
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        sync
      
        # Do some change to testdir and fsync it.
        echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
        xfs_io -c "fsync" $MNT/testdir
      
        echo "Renaming $NUM_FILES files..."
        start=$(date +%s%N)
        for ((i = 1; i <= $NUM_FILES; i++)); do
            mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
        done
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "Renames took $dur milliseconds"
      
        umount $MNT
      
      Testing this change on box using a non-debug kernel (Debian's default
      kernel config) gave the following results:
      
      NUM_FILES=10000, before this patch: 27399 ms
      NUM_FILES=10000, after this patch:   9093 ms (-66.8%)
      
      NUM_FILES=5000, before this patch:   9241 ms
      NUM_FILES=5000, after this patch:    4642 ms (-49.8%)
      
      NUM_FILES=2000, before this patch:   2550 ms
      NUM_FILES=2000, after this patch:    1788 ms (-29.9%)
      
      NUM_FILES=1000, before this patch:   1088 ms
      NUM_FILES=1000, after this patch:     905 ms (-16.9%)
      
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88d2beec
    • Filipe Manana's avatar
      btrfs: pass the dentry to btrfs_log_new_name() instead of the inode · d5f5bd54
      Filipe Manana authored
      In the next patch in the series, there will be the need to access the old
      name, and its length, of an inode when logging the inode during a rename.
      So instead of passing the inode to btrfs_log_new_name() pass the dentry,
      because from the dentry we can get the inode, the name and its length.
      
      This will avoid passing 3 new parameters to btrfs_log_new_name() in the
      next patch - the name, its length and an index number. This way we end
      up passing only 1 new parameter, the index number.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d5f5bd54
    • Filipe Manana's avatar
      btrfs: add helper to delete a dir entry from a log tree · 839061fe
      Filipe Manana authored
      Move the code that finds and deletes a logged dir entry out of
      btrfs_del_dir_entries_in_log() into a helper function. This new helper
      function will be used by another patch in the same series, and serves
      to avoid having duplicated logic.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      839061fe