• Filipe Manana's avatar
    btrfs: reduce amount of reserved metadata for delayed item insertion · 763748b2
    Filipe Manana authored
    Whenever we want to create a new dir index item (when creating an inode,
    create a hard link, rename a file) we reserve 1 unit of metadata space
    for it in a transaction (that's 256K for a node/leaf size of 16K), and
    then create a delayed insertion item for it to be added later to the
    subvolume's tree. That unit of metadata is kept until the delayed item
    is inserted into the subvolume tree, which may take a while to happen
    (in the worst case, it's done only when the transaction commits). If we
    have multiple dir index items to insert for the same directory, say N
    index items, and they all fit in a single leaf of metadata, then we are
    holding N units of reserved metadata space when all we need is 1 unit.
    
    This change addresses that, whenever a new delayed dir index item is
    added, we release the unit of metadata the caller has reserved when it
    started the transaction if adding that new dir index item does not
    result in touching one more metadata leaf, otherwise the reservation
    is kept by transferring it from the transaction block reserve to the
    delayed items block reserve, just like before. Given that with a leaf
    size of 16K we can have a few hundred dir index items in a single leaf
    (the exact value depends on file name lengths), this reduces pressure on
    metadata reservation by releasing unnecessary space much sooner.
    
    The following fs_mark test showed some improvement when creating many
    files in parallel on machine running a non debug kernel (debian's default
    kernel config) with 12 cores:
    
      $ cat test.sh
      #!/bin/bash
    
      DEV=/dev/nvme0n1
      MNT=/mnt/nvme0n1
      MOUNT_OPTIONS="-o ssd"
      FILES=100000
      THREADS=$(nproc --all)
    
      echo "performance" | \
          tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    
      mkfs.btrfs -f $DEV
      mount $MOUNT_OPTIONS $DEV $MNT
    
      OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k"
      for ((i = 1; i <= $THREADS; i++)); do
          OPTS="$OPTS -d $MNT/d$i"
      done
    
      fs_mark $OPTS
    
      umount $MNT
    
    Before:
    
    FSUse%        Count         Size    Files/sec     App Overhead
         2      1200000            0     225991.3          5465891
         4      2400000            0     345728.1          5512106
         4      3600000            0     346959.5          5557653
         8      4800000            0     329643.0          5587548
         8      6000000            0     312657.4          5606717
         8      7200000            0     281707.5          5727985
        12      8400000            0      88309.8          5020422
        12      9600000            0      85835.9          5207496
        16     10800000            0      81039.2          5404964
        16     12000000            0      58548.6          5842468
    
    After:
    
    FSUse%        Count         Size    Files/sec     App Overhead
         2      1200000            0     230604.5          5778375
         4      2400000            0     348908.3          5508072
         4      3600000            0     357028.7          5484337
         6      4800000            0     342898.3          5565703
         6      6000000            0     314670.8          5751555
         8      7200000            0     282548.2          5778177
        12      8400000            0      90844.9          5306819
        12      9600000            0      86963.1          5304689
        16     10800000            0      89113.2          5455248
        16     12000000            0      86693.5          5518933
    
    The "after" results are after applying this patch and all the other
    patches in the same patchset, which is comprised of the following
    changes:
    
      btrfs: balance btree dirty pages and delayed items after a rename
      btrfs: free the path earlier when creating a new inode
      btrfs: balance btree dirty pages and delayed items after clone and dedupe
      btrfs: add assertions when deleting batches of delayed items
      btrfs: deal with deletion errors when deleting delayed items
      btrfs: refactor the delayed item deletion entry point
      btrfs: improve batch deletion of delayed dir index items
      btrfs: assert that delayed item is a dir index item when adding it
      btrfs: improve batch insertion of delayed dir index items
      btrfs: do not BUG_ON() on failure to reserve metadata for delayed item
      btrfs: set delayed item type when initializing it
      btrfs: reduce amount of reserved metadata for delayed item insertion
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    763748b2
delayed-inode.h 4.65 KB