• Qu Wenruo's avatar
    btrfs: skip update of block group item if used bytes are the same · 7248e0ce
    Qu Wenruo authored
    [BACKGROUND]
    
    When committing a transaction, we will update block group items for all
    dirty block groups.
    
    But in fact, dirty block groups don't always need to update their block
    group items.
    It's pretty common to have a metadata block group which experienced
    several COW operations, but still have the same amount of used bytes.
    
    In that case, we may unnecessarily COW a tree block doing nothing.
    
    [ENHANCEMENT]
    
    This patch will introduce btrfs_block_group::commit_used member to
    remember the last used bytes, and use that new member to skip
    unnecessary block group item update.
    
    This would be more common for large filesystems, where metadata block
    group can be as large as 1GiB, containing at most 64K metadata items.
    
    In that case, if COW added and then deleted one metadata item near the
    end of the block group, then it's completely possible we don't need to
    touch the block group item at all.
    
    [BENCHMARK]
    
    The change itself can have quite a high chance (20~80%) to skip block
    group item updates in lot of workloads.
    
    As a result, it would result shorter time spent on
    btrfs_write_dirty_block_groups(), and overall reduce the execution time
    of the critical section of btrfs_commit_transaction().
    
    Here comes a fio command, which will do random writes in 4K block size,
    causing a very heavy metadata updates.
    
    fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
        --ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
        --name=random_write --fallocate=none --time_based --fsync_on_close=1
    
    The file size (512M) and number of threads (4) means 2GiB file size in
    total, but during the full 300s run time, my dedicated SATA SSD is able
    to write around 20~25GiB, which is over 10 times the file size.
    
    Thus after we fill the initial 2G, we should not cause much block group
    item updates.
    
    Please note, the fio numbers by themselves don't have much change, but
    if we look deeper, there is some reduced execution time, especially for
    the critical section of btrfs_commit_transaction().
    
    I added extra trace_printk() to measure the following per-transaction
    execution time:
    
    - Critical section of btrfs_commit_transaction()
      By re-using the existing update_commit_stats() function, which
      has already calculated the interval correctly.
    
    - The while() loop for btrfs_write_dirty_block_groups()
      Although this includes the execution time of btrfs_run_delayed_refs(),
      it should still be representative overall.
    
    Both result involves transid 7~30, the same amount of transaction
    committed.
    
    The result looks like this:
    
                          |      Before       |     After      |  Diff
    ----------------------+-------------------+----------------+--------
    Transaction interval  | 229247198.5       | 215016933.6    | -6.2%
    Block group interval  | 23133.33333       | 18970.83333    | -18.0%
    
    The change in block group item updates is more obvious, as skipped block
    group item updates also mean less delayed refs.
    
    And the overall execution time for that block group update loop is
    pretty small, thus we can assume the extent tree is already mostly
    cached.  If we can skip an uncached tree block, it would cause more
    obvious change.
    
    Unfortunately the overall reduction in commit transaction critical
    section is much smaller, as the block group item updates loop is not
    really the major part, at least not for the above fio script.
    
    But still we have a observable reduction in the critical section.
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    7248e0ce
block-group.c 125 KB