1. 07 Jan, 2022 8 commits
    • Filipe Manana's avatar
      btrfs: remove useless condition check before splitting leaf · e5e1c174
      Filipe Manana authored
      When inserting a key, we check if the write_lock_level is less than 1,
      and if so we set it to 1, release the path and retry the tree traversal.
      
      However that is unnecessary, because when ins_len is greater than 0, we
      know that write_lock_level can never be less than 1.
      
      The logic to retry is also buggy, because in case ins_len was decremented,
      due to an exact key match and the search is not meant for item extension
      (path->search_for_extension is 0), we retry without incrementing ins_len,
      which would make the next retry decrement it again by the same amount.
      
      So remove the check for write_lock_level being less than 1 and add an
      assertion to assert it's always >= 1.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e5e1c174
    • Filipe Manana's avatar
      btrfs: try to unlock parent nodes earlier when inserting a key · e2e58d0f
      Filipe Manana authored
      When inserting a new key, we release the write lock on the leaf's parent
      only after doing the binary search on the leaf. This is because if the
      key ends up at slot 0, we will have to update the key at slot 0 of the
      parent node. The same reasoning applies to any other upper level nodes
      when their slot is 0. We also need to keep the parent locked in case the
      leaf does not have enough free space to insert the new key/item, because
      in that case we will split the leaf and we will need to add a new key to
      the parent due to a new leaf resulting from the split operation.
      
      However if the leaf has enough space for the new key and the key does not
      end up at slot 0 of the leaf we could release our write lock on the parent
      before doing the binary search on the leaf to figure out the destination
      slot. That leads to reducing the amount of time other tasks are blocked
      waiting to lock the parent, therefore increasing parallelism when there
      are other tasks that are trying to access other leaves accessible through
      the same parent. This also applies to other upper nodes besides the
      immediate parent, when their slot is 0, since we keep locks on them until
      we figure out if the leaf slot is slot 0 or not.
      
      In fact, having the key ending at up slot 0 when is rare. Typically it
      only happens when the key is less than or equals to the smallest, the
      "left most", key of the entire btree, during a split attempt when we try
      to push to the right sibling leaf or when the caller just wants to update
      the item of an existing key. It's also very common that a leaf has enough
      space to insert a new key, since after a split we move about half of the
      keys from one into the new leaf.
      
      So unlock the parent, and any other upper level nodes, when during a key
      insertion we notice the key is greater then the first key in the leaf and
      the leaf has enough free space. After unlocking the upper level nodes, do
      the binary search using a low boundary of slot 1 and not slot 0, to figure
      out the slot where the key will be inserted (or where the key already is
      in case it exists and the caller wants to modify its item data).
      This extra comparison, with the first key, is cheap and the key is very
      likely already in a cache line because it immediately follows the header
      of the extent buffer and we have recently read the level field of the
      header (which in fact is the last field of the header).
      
      The following fs_mark test was run on a non-debug kernel (debian's default
      kernel config), with a 12 cores intel CPU, and using a NVMe device:
      
        $ cat run-fsmark.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        FILES=100000
        THREADS=$(nproc --all)
        FILE_SIZE=0
      
        echo "performance" | \
      	tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before this change:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           0      1200000            0     165273.6          5958381
           0      2400000            0     190938.3          6284477
           0      3600000            0     181429.1          6044059
           0      4800000            0     173979.2          6223418
           0      6000000            0     139288.0          6384560
           0      7200000            0     163000.4          6520083
           1      8400000            0      57799.2          5388544
           1      9600000            0      66461.6          5552969
           2     10800000            0      49593.5          5163675
           2     12000000            0      57672.1          4889398
      
      After this change:
      
      FSUse%        Count         Size    Files/sec            App Overhead
           0      1200000            0     167987.3 (+1.6%)         6272730
           0      2400000            0     198563.9 (+4.0%)         6048847
           0      3600000            0     197436.6 (+8.8%)         6163637
           0      4800000            0     202880.7 (+16.6%)        6371771
           1      6000000            0     167275.9 (+20.1%)        6556733
           1      7200000            0     204051.2 (+25.2%)        6817091
           1      8400000            0      69622.8 (+20.5%)        5525675
           1      9600000            0      69384.5 (+4.4%)         5700723
           1     10800000            0      61454.1 (+23.9%)        5363754
           3     12000000            0      61908.7 (+7.3%)         5370196
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2e58d0f
    • Filipe Manana's avatar
      btrfs: allow generic_bin_search() to take low boundary as an argument · fb81212c
      Filipe Manana authored
      Right now generic_bin_search() always uses a low boundary slot of 0, but
      in the next patch we'll want to often skip slot 0 when searching for a
      key. So make generic_bin_search() have the low boundary slot specified
      as an argument, and move the check for the extent buffer level from
      btrfs_bin_search() to generic_bin_search() to avoid adding another
      wrapper around generic_bin_search().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb81212c
    • Josef Bacik's avatar
      btrfs: check the root node for uptodate before returning it · 120de408
      Josef Bacik authored
      Now that we clear the extent buffer uptodate if we fail to write it out
      we need to check to see if our root node is uptodate before we search
      down it.  Otherwise we could return stale data (or potentially corrupt
      data that was caught by the write verification step) and think that the
      path is OK to search down.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      120de408
    • Nikolay Borisov's avatar
      btrfs: allow device add if balance is paused · a174c0a2
      Nikolay Borisov authored
      Currently paused balance precludes adding a device since they are both
      considered exclusive ops and we can have at most one running at a time.
      This is problematic in case a filesystem encounters an ENOSPC situation
      while balance is running, in this case the only thing the user can do
      is mount the fs with "skip_balance" which pauses balance and delete some
      data to free up space for balance. However, it should be possible to add
      a new device when balance is paused.
      
      Fix this by allowing device add to proceed when balance is paused.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a174c0a2
    • Nikolay Borisov's avatar
      btrfs: make device add compatible with paused balance in btrfs_exclop_start_try_lock · 621a1ee1
      Nikolay Borisov authored
      This is needed to enable device add to work in cases when a file system
      has been mounted with 'skip_balance' mount option.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      621a1ee1
    • Nikolay Borisov's avatar
      btrfs: introduce exclusive operation BALANCE_PAUSED state · efc0e69c
      Nikolay Borisov authored
      Current set of exclusive operation states is not sufficient to handle
      all practical use cases. In particular there is a need to be able to add
      a device to a filesystem that have paused balance. Currently there is no
      way to distinguish between a running and a paused balance. Fix this by
      introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2
      occasions:
      
      1. When a filesystem is mounted with skip_balance and there is an
         unfinished balance it will now be into BALANCE_PAUSED instead of
         simply BALANCE state.
      
      2. When a running balance is paused.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efc0e69c
    • Filipe Manana's avatar
      btrfs: make send work with concurrent block group relocation · d96b3424
      Filipe Manana authored
      We don't allow send and balance/relocation to run in parallel in order
      to prevent send failing or silently producing some bad stream. This is
      because while send is using an extent (specially metadata) or about to
      read a metadata extent and expecting it belongs to a specific parent
      node, relocation can run, the transaction used for the relocation is
      committed and the extent gets reallocated while send is still using the
      extent, so it ends up with a different content than expected. This can
      result in just failing to read a metadata extent due to failure of the
      validation checks (parent transid, level, etc), failure to find a
      backreference for a data extent, and other unexpected failures. Besides
      reallocation, there's also a similar problem of an extent getting
      discarded when it's unpinned after the transaction used for block group
      relocation is committed.
      
      The restriction between balance and send was added in commit 9e967495
      ("Btrfs: prevent send failures and crashes due to concurrent relocation"),
      kernel 5.3, while the more general restriction between send and relocation
      was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
      while we have send operations running"), kernel 5.14.
      
      Both send and relocation can be very long running operations. Relocation
      because it has to do a lot of IO and expensive backreference lookups in
      case there are many snapshots, and send due to read IO when operating on
      very large trees. This makes it inconvenient for users and tools to deal
      with scheduling both operations.
      
      For zoned filesystem we also have automatic block group relocation, so
      send can fail with -EAGAIN when users least expect it or send can end up
      delaying the block group relocation for too long. In the future we might
      also get the automatic block group relocation for non zoned filesystems.
      
      This change makes it possible for send and relocation to run in parallel.
      This is achieved the following way:
      
      1) For all tree searches, send acquires a read lock on the commit root
         semaphore;
      
      2) After each tree search, and before releasing the commit root semaphore,
         the leaf is cloned and placed in the search path (struct btrfs_path);
      
      3) After releasing the commit root semaphore, the changed_cb() callback
         is invoked, which operates on the leaf and writes commands to the pipe
         (or file in case send/receive is not used with a pipe). It's important
         here to not hold a lock on the commit root semaphore, because if we did
         we could deadlock when sending and receiving to the same filesystem
         using a pipe - the send task blocks on the pipe because it's full, the
         receive task, which is the only consumer of the pipe, triggers a
         transaction commit when attempting to create a subvolume or reserve
         space for a write operation for example, but the transaction commit
         blocks trying to write lock the commit root semaphore, resulting in a
         deadlock;
      
      4) Before moving to the next key, or advancing to the next change in case
         of an incremental send, check if a transaction used for relocation was
         committed (or is about to finish its commit). If so, release the search
         path(s) and restart the search, to where we were before, so that we
         don't operate on stale extent buffers. The search restarts are always
         possible because both the send and parent roots are RO, and no one can
         add, remove of update keys (change their offset) in RO trees - the
         only exception is deduplication, but that is still not allowed to run
         in parallel with send;
      
      5) Periodically check if there is contention on the commit root semaphore,
         which means there is a transaction commit trying to write lock it, and
         release the semaphore and reschedule if there is contention, so as to
         avoid causing any significant delays to transaction commits.
      
      This leaves some room for optimizations for send to have less path
      releases and re searching the trees when there's relocation running, but
      for now it's kept simple as it performs quite well (on very large trees
      with resulting send streams in the order of a few hundred gigabytes).
      
      Test case btrfs/187, from fstests, stresses relocation, send and
      deduplication attempting to run in parallel, but without verifying if send
      succeeds and if it produces correct streams. A new test case will be added
      that exercises relocation happening in parallel with send and then checks
      that send succeeds and the resulting streams are correct.
      
      A final note is that for now this still leaves the mutual exclusion
      between send operations and deduplication on files belonging to a root
      used by send operations. A solution for that will be slightly more complex
      but it will eventually be built on top of this change.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d96b3424
  2. 03 Jan, 2022 32 commits